Using zloop and virtme-ng for zoned btrfs development
This post is a quick rundown of the development setup for zoned btrfs development that I use on my Linux Laptop.
It mainly consists of two ingredients:
- The zoned loopback device (zloop.ko) driver in the kernel and
- virtme-ng, Andrea Righi’s fork of Andy Lutomirski’s virtme
On my workstation I use a similar setup, but instead of zloop the emulated Zoned Block Devices are created using tcmu-runner.
The way the zloop driver works is, it creates a bunch of equally sized files in a directory (per default ZLOOP_DEF_BASE_DIR). Each of these files represents a zone. The current position in the file corresponds to the zone’s write pointer. If the file reaches the zone size that the zloop device was created with, the zone is full and needs to be reset. So all in all a very simple design approach.
First of all, let’s see what ZLOOP_DEF_BASE_DIR is
johannes@neo:~/src/linux (btrfs-for-next)$ git grep ZLOOP_DEF_BASE_DIR
drivers/block/zloop.c:#define ZLOOP_DEF_BASE_DIR "/var/local/zloop"
drivers/block/zloop.c: zlo->base_dir = kstrdup(ZLOOP_DEF_BASE_DIR, GFP_KERNEL);
johannes@neo:~/src/linux (btrfs-for-next)$
So far so good. So let’s go and mkdir
that directory. Also as my rootfs is
btrfs as well (but on a non-zoned device) I want to skip CoW and do in-place
updates. This can be done by setting the appropriate attribute on the
directory:
From man chattr(1)
:
C A file with the 'C' attribute set will not be subject to copy-on-
write updates. This flag is only supported on file systems which
perform copy-on-write. (Note: For btrfs, the 'C' flag should be
set on new or empty files. If it is set on a file which already
has data blocks, it is undefined when the blocks assigned to the
file will be fully stable. If the 'C' flag is set on a direc‐
tory, it will have no effect on the directory, but new files cre‐
ated in that directory will have the No_COW attribute set. If the
'C' flag is set, then the 'c' flag cannot be set.)
So let’s go and create the directory and set the attributes:
root@neo:~# mkdir /var/local/zloop
root@neo:~# chattr +C /var/local/zloop
While at it load the zloop driver as well. If successful this will create a zloop-control character device for us:
root@neo:~# modprobe zloop
root@neo:~# ls -l /dev/zloop-control
crw-------. 1 root root 10, 261 Sep 12 12:40 /dev/zloop-control
Reading from this device will give us a short documentation on how to use it:
root@neo:~# cat /dev/zloop-control
add id=%d,capacity_mb=%u,zone_size_mb=%u,zone_capacity_mb=%u,conv_zones=%u,base_dir=%s,nr_queues=%u,queue_depth=%u,buffered_io
remove id=%d
Let’s go and create 3 devices as I’m currently back to doing RAID5 development on zoned BTRFS using the RAID stripe-tree.
root@neo:~# echo "add id=0,zone_size_mb=256,conv_zones=2" > /dev/zloop-control
bash: echo: write error: Invalid argument
root@neo:~# dmesg | tail
[957542.560077] wlp0s20f3: disconnect from AP 08:b6:57:68:18:bc for new auth to 04:b4:fe:64:5f:db
[957542.597484] wlp0s20f3: authenticate with 04:b4:fe:64:5f:db (local address=a2:de:b7:95:52:c4)
[957542.597870] wlp0s20f3: send auth to 04:b4:fe:64:5f:db (try 1/3)
[957542.600902] wlp0s20f3: authenticated
[957542.601568] wlp0s20f3: associate with 04:b4:fe:64:5f:db (try 1/3)
[957542.608099] wlp0s20f3: RX ReassocResp from 04:b4:fe:64:5f:db (capab=0x1411 status=0 aid=37)
[957542.609636] wlp0s20f3: associated
[957542.668975] wlp0s20f3: Limiting TX power to 20 (20 - 0) dBm as advertised by 04:b4:fe:64:5f:db
[959918.875201] zloop: Module loaded
[960196.424535] zloop: Failed to open directory /var/local/zloop/0 (err=-2)
root@neo:~# errno 2
ENOENT 2 No such file or directory
root@neo:~# ls /var/local/zloop/
root@neo:~# mkdir /var/local/zloop/0
root@neo:~# echo "add zone_size_mb=256,conv_zones=2" > /dev/zloop-control
root@neo:~# ls /dev/zloop*
zloop0 zloop-control
root@neo:~# ls /dev/zloop0
/dev/zloop0
root@neo:~# file /dev/zloop0
/dev/zloop0: block special (259/4)
OK, as a gotcha you need to create the directories for the zloop devices as under ZLOOP_DEF_BASE_DIR well.
root@neo:~# lsattr /var/local/zloop/0/
root@neo:~# lsattr /var/local/zloop/0/
---------------------- /var/local/zloop/0/cnv-000000
---------------------- /var/local/zloop/0/cnv-000001
---------------C------ /var/local/zloop/0/seq-000002
---------------C------ /var/local/zloop/0/seq-000003
---------------C------ /var/local/zloop/0/seq-000004
---------------C------ /var/local/zloop/0/seq-000005
---------------C------ /var/local/zloop/0/seq-000006
---------------C------ /var/local/zloop/0/seq-000007
---------------C------ /var/local/zloop/0/seq-000008
[...]
---------------C------ /var/local/zloop/0/seq-000060
---------------C------ /var/local/zloop/0/seq-000061
---------------C------ /var/local/zloop/0/seq-000062
---------------C------ /var/local/zloop/0/seq-000063
Then go ahead and create the other two zloop devices:
root@neo:~# echo "add zone_size_mb=256,conv_zones=2" > /dev/zloop-control
root@neo:~# echo "add zone_size_mb=256,conv_zones=2" > /dev/zloop-control
Let’s also do a quick check what blkzone report
has to say about the
devices:
root@neo:~# blkzone report /dev/zloop0 | head -10
start: 0x000000000, len 0x080000, cap 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
start: 0x000080000, len 0x080000, cap 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
start: 0x000100000, len 0x080000, cap 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
start: 0x000180000, len 0x080000, cap 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
start: 0x000200000, len 0x080000, cap 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
start: 0x000280000, len 0x080000, cap 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
start: 0x000300000, len 0x080000, cap 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
start: 0x000380000, len 0x080000, cap 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
start: 0x000400000, len 0x080000, cap 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
start: 0x000480000, len 0x080000, cap 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
Let’s do a quick check using virtme-ng
:
root@neo:~# vng --run=/home/johannes/src/linux --disk=/dev/zloop0
WARNING: Image format was not specified for '/dev/zloop0' and probing guessed raw.
Automatically detecting the format is dangerous for raw images, write operations on block 0 will be restricted.
Specify the 'raw' format explicitly to remove the restrictions.
_ _
__ _(_)_ __| |_ _ __ ___ ___ _ __ __ _
\ \ / / | __| __| _ _ \ / _ \_____| _ \ / _ |
\ V /| | | | |_| | | | | | __/_____| | | | (_| |
\_/ |_|_| \__|_| |_| |_|\___| |_| |_|\__ |
|___/
kernel version: 6.17.0-rc5+ x86_64
(CTRL+d to exit)
root@virtme-ng:~# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
nullb0 253:0 0 250G 0 disk
vda 254:0 0 16G 0 disk
root@virtme-ng:~# blkzone report /dev/vda | head -3
blkzone: /dev/vda: unable to determine zone size
OK that didn’t work the way I wanted it to. Let’s try adding the explicit qemu options then:
root@neo:~# vng --run=/home/johannes/src/linux --qemu-opts="-drive driver=host_device,file=/dev/zloop0,if=virtio,cache.direct=on"
_ _
__ _(_)_ __| |_ _ __ ___ ___ _ __ __ _
\ \ / / | __| __| _ _ \ / _ \_____| _ \ / _ |
\ V /| | | | |_| | | | | | __/_____| | | | (_| |
\_/ |_|_| \__|_| |_| |_|\___| |_| |_|\__ |
|___/
kernel version: 6.17.0-rc5+ x86_64
(CTRL+d to exit)
root@virtme-ng:~# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
nullb0 253:0 0 250G 0 disk
vda 254:0 0 16G 0 disk
root@virtme-ng:~# blkzone report /dev/vda | head -5
start: 0x000000000, len 0x080000, cap 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
start: 0x000080000, len 0x080000, cap 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
start: 0x000100000, len 0x080000, cap 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
start: 0x000180000, len 0x080000, cap 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
start: 0x000200000, len 0x080000, cap 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
root@virtme-ng:~#
Heureka! It worked! OK then let’s chown(1)
the devices so to my user and see
if we can still do block device operations on them in the VM. To ease things
I’ll add my user to the disk group later.
johannes@neo:~$ ls -l /dev/zloop0
brw-rw----. 1 root disk 259, 4 Sep 12 12:48 /dev/zloop0
johannes@neo:~$ id
uid=1000(johannes) gid=1000(johannes) groups=1000(johannes),10(wheel) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
johannes@neo:~$ sudo chown johannes:johannes /dev/zloop0
[sudo] password for johannes:
johannes@neo:~$ ls -l /dev/zloop0
brw-rw----. 1 johannes johannes 259, 4 Sep 12 12:48 /dev/zloop0
johannes@neo:~$ vng --user=root --run=/home/johannes/src/linux --qemu-opts="-drive driver=host_device,file=/dev/zloop0,if=virtio,cache.direct=on"
_ _
__ _(_)_ __| |_ _ __ ___ ___ _ __ __ _
\ \ / / | __| __| _ _ \ / _ \_____| _ \ / _ |
\ V /| | | | |_| | | | | | __/_____| | | | (_| |
\_/ |_|_| \__|_| |_| |_|\___| |_| |_|\__ |
|___/
kernel version: 6.17.0-rc5+ x86_64
(CTRL+d to exit)
root@virtme-ng:/home/johannes# blkzone report /dev/vda | head -5
start: 0x000000000, len 0x080000, cap 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
start: 0x000080000, len 0x080000, cap 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
start: 0x000100000, len 0x080000, cap 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
start: 0x000180000, len 0x080000, cap 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
start: 0x000200000, len 0x080000, cap 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
So far so good, lets do a mkfs.btrfs on the device and see if it works or if I need to be real root for this.
root@virtme-ng:/home/johannes# mkfs.btrfs /dev/vda
btrfs-progs v6.16
See https://btrfs.readthedocs.io for more information.
zoned: /dev/vda: host-managed device detected, setting zoned feature
Resetting device zones /dev/vda (64 zones) ...
Label: (null)
UUID: d493558e-314f-4b48-885f-ec163b5ddc07
Node size: 16384
Sector size: 4096 (CPU page size: 4096)
Filesystem size: 16.00GiB
Block group profiles:
Data: single 512.00MiB
Metadata: DUP 256.00MiB
System: DUP 256.00MiB
SSD detected: no
Zoned device: yes
Zone size: 256.00MiB
Features: extref, skinny-metadata, no-holes, free-space-tree, zoned
Checksum: crc32c
Number of devices: 1
Devices:
ID SIZE ZONES PATH
1 16.00GiB 64 /dev/vda
root@virtme-ng:/home/johannes# mount /dev/vda /mnt
root@virtme-ng:/home/johannes# xfs_io -fc "pwrite 0 1M" -c fsync /mnt/test
wrote 1048576/1048576 bytes at offset 0
1 MiB, 256 ops; 0.0011 sec (907.441 MiB/sec and 232304.9002 ops/sec)
root@virtme-ng:/home/johannes# blkzone report /dev/vda | grep oi
start: 0x000180000, len 0x080000, cap 0x080000, wptr 0x000120 reset:0 non-seq:0, zcond: 2(oi) [type: 2(SEQ_WRITE_REQUIRED)]
start: 0x000200000, len 0x080000, cap 0x080000, wptr 0x000800 reset:0 non-seq:0, zcond: 2(oi) [type: 2(SEQ_WRITE_REQUIRED)]
start: 0x000300000, len 0x080000, cap 0x080000, wptr 0x000060 reset:0 non-seq:0, zcond: 2(oi) [type: 2(SEQ_WRITE_REQUIRED)]
start: 0x000380000, len 0x080000, cap 0x080000, wptr 0x000060 reset:0 non-seq:0, zcond: 2(oi) [type: 2(SEQ_WRITE_REQUIRED)]
start: 0x000400000, len 0x080000, cap 0x080000, wptr 0x0003c0 reset:0 non-seq:0, zcond: 2(oi) [type: 2(SEQ_WRITE_REQUIRED)]
start: 0x000480000, len 0x080000, cap 0x080000, wptr 0x0003c0 reset:0 non-seq:0, zcond: 2(oi) [type: 2(SEQ_WRITE_REQUIRED)]
So as it looks, we could do writes and the write pointers moved as well. Lets do two more small tests.
- Halt and power on the VM again to see if everything is still as we’d expect it and
- Do a memory dump of the VM so we can check if we can do some post mortem analysis of a crash using i.e. drgn (or crash if you really want to) in case we need it some day.
root@virtme-ng:/home/johannes# exit
logout
johannes@neo:~$ vng --debug --user=root --run=/home/johannes/src/linux --qemu-opts="-drive driver=host_device,file=/dev/zloop0,if=virtio,cache.direct=on"
_ _
__ _(_)_ __| |_ _ __ ___ ___ _ __ __ _
\ \ / / | __| __| _ _ \ / _ \_____| _ \ / _ |
\ V /| | | | |_| | | | | | __/_____| | | | (_| |
\_/ |_|_| \__|_| |_| |_|\___| |_| |_|\__ |
|___/
kernel version: 6.17.0-rc5+ x86_64
(CTRL+d to exit)
root@virtme-ng:/home/johannes# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
nullb0 253:0 0 250G 0 disk
vda 254:0 0 16G 0 disk
root@virtme-ng:/home/johannes# mount /dev/vda /mnt
root@virtme-ng:/home/johannes# ls /mnt
test
root@virtme-ng:/home/johannes# xfs_io -fc "pread 0 1M" /mnt/test
read 1048576/1048576 bytes at offset 0
1 MiB, 256 ops; 0.0101 sec (98.961 MiB/sec and 25333.9931 ops/sec)
OK task number 1 check.
Now check if we can do a memory dump
johannes@neo:~/src/linux (btrfs-for-next)$ vng --dump vmcore
johannes@neo:~/src/linux (btrfs-for-next)$ file vmcore
vmcore: ELF 64-bit LSB core file, x86-64, version 1 (SYSV), too many program headers (65535)
johannes@neo:~/src/linux (btrfs-for-next)$ drgn -c vmcore -s vmlinux
drgn 0.0.32 (using Python 3.13.7, elfutils 0.193, with debuginfod (dlopen), with libkdumpfile, with lzma)
For help, type help(drgn).
>>> import drgn
>>> from drgn import FaultError, NULL, Object, alignof, cast, container_of, execscript, implicit_convert, offsetof, reinterpret, sizeof, stack_trace
>>> from drgn.helpers.common import *
>>> from drgn.helpers.linux import *
>>> for m in for_each_mount(prog, fstype="btrfs"):
... print(f"fs_info is @ {m.mnt.mnt_sb.s_fs_info}")
...
fs_info is @ (void *)0xffff888004848000
>>> fs_info = Object(prog, "struct btrfs_fs_info *", 0xffff888004848000)
>>> fs_info.zone_size
(u64)268435456
>>> fs_info.max_zone_append_size
(u64)1040384
So as you can see the test/debug setup is working. The next thing is creating automated xfstest runs on this setup, but that exercise is left for another day.