005. Low-curse ZFS-on-root for new Debian installations on multi-disk systems
↩In a brief departure from microcomputing saprophagia, imagine you're an american and acquiring a ThinkPad P15, sporting two two-terabyte drives and, naturally, you'd like to have a single continuous volume spanning both of them to do whatever your heart desires.
After the shock at willingly staying in the American empire and paying it for the privilege wears off, two approaches reveal themselves:
Now, I also wanted to do a ZFS-on-root for my own nefarious uses, and so it was decided:
Good, do it before I buy the p15 and report back![]()
To minimise how cursed this is, the following restrixions are in place:
This means that all you need is an EFI-compatible multi-disk platform and some way to EFI boot it into d-i — the bootloader will be fixed, ZFS installed normally, and the rootfs dumped/restored thereonto after normally booting into the target system.
The test setup is QEMU -bios OVMF.fd
and two 8G drives, one of which is designated as primary.
Filesystem tuning is not covered, encryption is supported, SecureBoot is not covered because I haven't figured it out yet,
most-all gotchas are hopefully explained,
there's a prior art that was of little help, and
zfs-{initramfs,dracut}
are full of problems that I try to work around here:
The restrixions to the original installation are all during the partitioning phase and as follows:
Either installing sid or enabling backports for buster is required,
since zfs-mount-generator(8)
appeared in 0.8.0;
if you don't install GRUB you'll have to boot
manually
from the EFI shell after copying the kernel and initrd from target/boot/
to target/boot/efi/
.
Using GRUB and a unified boot-on-root is, for better or for worse, not possible; GRUB supports these non-R/O zpool features:
/*
* List of pool features that the grub implementation of ZFS supports for
* read. Note that features that are only required for write do not need
* to be listed here since grub opens pools in read-only mode.
*/
#define MAX_SUPPORTED_FEATURE_STRLEN 50
static const char *spa_feature_names[] = {
"org.illumos:lz4_compress",
"com.delphix:hole_birth",
"com.delphix:embedded_data",
"com.delphix:extensible_dataset",
"org.open-zfs:large_blocks",
NULL
};
Which means that these, as of OpenZFS 0.8.4, are the ones that it doesn't support:
nabijaczleweli@tarta:~$ man zpool-features | grep READ-ONLY.*no -B4 | sed 's/GUID//' | \
awk '!/^$/ && !/COMPATIBLE/ && !/--/ \
{if(last == "") last = $1; else {print last "\t" $1; last = ""}}' |
grep -v 'lz4_compress|hole_birth|embedded_data|extensible_dataset|large_blocks' | column -t
bookmark_v2 com.datto:bookmark_v2
device_removal com.delphix:device_removal
edonr org.illumos:edonr
encryption com.datto:encryption
large_dnode org.zfsonlinux:large_dnode
multi_vdev_crash_dump com.joyent:multi_vdev_crash_dump
sha512 org.illumos:sha512
skein org.illumos:skein
Confer zpool-features(5) for details so as to why this is not acceptable (remember: you couldn't any of these even by accident on your root pool, and most of them can't ever be turned off).
Using a separate /boot
on ZFS required a lot of dicking around with the options for GRUB to recognise the filesystem at all
(confer prior art,
Step 2: Disk Formatting, 4. Create the boot pool, with its 12 lines of options for pool features alone)
and GRUB has difficulties generating the right root=
cmdline options for a normal pool anyway
(but it does work if you convince it hard enough and have no other options, like on EFI-less systems).
Using a separate /boot
on ext*/FAT didn't make much sense for booting on EFI, either,
so cutting GRUB out was a simple way to berid of a lot of bollocks.
Boot into your freshly installed system, become root, and:
Get rid of GRUB, optionally keeping EFI tools. | root@zoot:~# apt-mark manual efibootmgr mokutil efibootmgr, mokutil set to manually installed. root@zoot:~# apt autopurge grub* The following packages will be REMOVED: grub-common* grub-efi-amd64* grub-efi-amd64-bin* grub-efi-amd64-signed* grub2-common* os-prober* shim-signed* gettext-base* libbrotli1* libfreetype6* libfuse2* libpng16-16* shim-helpers-amd64-signed* shim-signed-common* shim-unsigned* 0 upgraded, 0 newly installed, 15 to remove and 0 not upgraded. After this operation, 44.9 MB disk space will be freed. |
Verify that you only have GRUB here before cleaning it out! | root@zoot:~# tree /boot/efi/ /boot/efi/ ├── EFI │ └── debian │ ├── BOOTX64.CSV │ ├── fbx64.efi │ ├── grub.cfg │ ├── grubx64.efi │ ├── mmx64.efi │ └── shimx64.efi └── NvVars root@zoot:~# rm -rf /boot/efi/EFI/ |
Depending on the firmware state there may be a lot of boot entries, but the GRUB one is very likely the BootCurrent and called "debian". | root@zoot:~# efibootmgr -v BootCurrent: 0001 BootOrder: 0001,0000,0002,… Boot0000*: UiApp FvVol(GUID)/FvFile(GUID) Boot0001*: debian HD(1,GPT,GUID)/File(\EFI\debian\shimx64.efi) Boot0002*: UEFI QEMU DVD-ROM QM00003 PciRoot(0x0)/… root@zoot:~# efibootmgr -Bb 1 BootCurrent: 0001 Boot0000*: UiApp FvVol(GUID)/FvFile(GUID) Boot0002*: UEFI QEMU DVD-ROM QM00003 |
Install systemd-boot and enable a timeout. This might not be required on platforms that support the systemd Boot Loader Specification (are there any?). Since systemd 251.2-3 (Wed, 08 Jun 2022 23:56:04 +0100), systemd-boot resides in its own homonymous package. |
root@zoot:~# apt install systemd-boot The following NEW packages will be installed: systemd-boot systemd-boot-efi 0 upgraded, 2 newly installed, 0 to remove and 0 not upgraded. After this operation, 739 kB of additional disk space will be used. root@zoot:~# bootctl install Created "/boot/efi/EFI", other directories. Copied "/usr/lib/systemd/boot/efi/systemd-bootx64.efi" to "/boot/efi/EFI/{systemd/systemd-bootx64.efi,BOOT/BOOTX64.EFI}". Random seed file /boot/efi/loader/random-seed successfully written (512 bytes). Created EFI boot entry "Linux Boot Manager". root@zoot:~# sed -i 's/#timeout/timeout/' /boot/efi/loader/loader.conf |
root@zoot:~# cp /{usr/lib,etc}/kernel/install.d/90-loaderentry.install root@zoot:~# diff -U2 /{usr/lib,etc}/kernel/install.d/90-loaderentry.install --- /usr/lib/kernel/install.d/90-loaderentry.install 2020-09-02 11:49:08.000000000 +0200 +++ /etc/kernel/install.d/90-loaderentry.install 2020-09-13 05:06:49.541120128 +0200 @@ -44,4 +44,6 @@ if ! [[ $PRETTY_NAME ]]; then PRETTY_NAME="Linux $KERNEL_VERSION" +else + PRETTY_NAME+=" with Linux $KERNEL_VERSION" fi | |
.install hook, which copies the initrd into \MID\VER\initrd
despite that being handled by 90-loaderentry.install , which copies it into \MID\VER\BASENAME ,
thereby duplicating it. This will not be required in the future.This hook was fixed in 247.1-4 (Fri, 11 Dec 2020 20:48:44 +0000); this step is detrimental to modern systems. |
root@zoot:~# ln -s /dev/null /etc/kernel/install.d/85-initrd.install |
GRUB installs these as zz-update-grub , but kernel-install sorts later than any other thing I saw there myself.
I hope to integrate this and make it not required in the future.Integrated into base and fixed in 251.5-1 (Sun, 02 Oct 2022 21:23:49 +0200). |
root@zoot:~# cat > /etc/kernel/postinst.d/kernel-install #!/bin/sh bootctl is-installed > /dev/null || exit 0 exec kernel-install add "$1" "/boot/vmlinuz-$1" "/boot/initrd.img-$1" ^D root@zoot:~# cat > /etc/kernel/postrm.d/kernel-install #!/bin/sh bootctl is-installed > /dev/null || exit 0 exec kernel-install remove "$1" ^D root@zoot:~# chmod +x /etc/kernel/post{inst,rm}.d/kernel-install |
Install the kernel. The initial run takes a long time, hence the -v ; 62dd03a4928c412180b3024ac6c03a90 is this machine's ID.sd-boot wouldn't make \<MID> for some time – if you don't get the "Installing" and "Creating" lines on a systemd pre-v250 system, you'll need to mkdir /boot/efi/Default (or "/boot/efi/$(< /etc/machine-id)") manually and kernel-install again. The current cmdline will be used for the boot entry, overridable with /etc/kernel/cmdline . |
root@zoot:~# kernel-install -v add $(uname -r) /boot/vml<TAB> /boot/ini<TAB> Running depmod -a 5.8.0-1-amd64 Installing /boot/efi/62dd03a4928c412180b3024ac6c03a90/5.8.0-1-amd64/vmlinuz-5.8.0-1-amd64 Creating /boot/efi/loader/entries/62dd03a4928c412180b3024ac6c03a90-5.8.0-1-amd64.conf root@zoot:~# tree /boot/efi/ /boot/efi/ ├── 62dd03a4928c412180b3024ac6c03a90 │ └── 5.8.0-1-amd64 │ ├── initrd-5.8.0-1-amd64 │ └── linux ├── EFI │ ├── BOOT │ │ └── BOOTX64.EFI │ ├── Linux │ └── systemd │ └── systemd-bootx64.efi ├── loader │ ├── entries │ │ └── 62dd03a4928c412180b3024ac6c03a90-5.8.0-1-amd64.conf │ ├── loader.conf │ └── random-seed └── NvVars 8 directories, 9 files |
I'd recommend rebooting now to verify that this works, which should look like this:
If not, and sd-boot
shows errors or doesn't start at all; boot into the EFI shell,
fs0:, and \<MID>\<VER>\linux initrd=\<MID>\<VER>\initrd.img-<VER> root=/dev/sda2
(the shell should support tab-completion, you might need to add a space before completing the initrd)
(the root=
option assumes you installed to the second partition of the first SCSI drive, as I did; adjust to taste),
then write me an e-mail or a DM
or whatever else is listed here so I can issue a correxion; thanks in advance, &c.:
Now we can boot without dealing with GRUB. On to the ZFS bit:
Install the prerequisites, remember to match the headers to your kernel! | root@zoot:~# apt install --no-install-recommends linux-headers-amd64 build-essential The following NEW packages will be installed: binutils binutils-common binutils-x86-64-linux-gnu build-essential dpkg-dev g++ g++-10 gcc gcc-10 libasan6 libatomic1 libbinutils libc-dev-bin libc6-dev libcc1-0 libcrypt-dev libctf-nobfd0 libctf0 libdpkg-perl libgcc-10-dev libgdbm-compat4 libgdbm6 libgomp1 libitm1 liblsan0 libperl5.30 libquadmath0 libstdc++-10-dev libtsan0 libubsan1 linux-compiler-gcc-10-x86 linux-headers-5.8.0-1-amd64 linux-headers-5.8.0-1-common linux-headers-amd64 linux-kbuild-5.8 linux-libc-dev make patch perl perl-modules-5.30 0 upgraded, 40 newly installed, 0 to remove and 0 not upgraded. After this operation, 286 MB of additional disk space will be used. |
And ZFS. Depending on dependency ordering, starting the services sometimes fails; simply re-running the command will fix it. |
root@zoot:~# apt install zfs-dkms The following NEW packages will be installed: distro-info-data dkms fakeroot libfakeroot libnvpair1linux libuutil1linux libzfs2linux libzpool2linux lsb-release python3-distutils python3-lib2to3 zfs-dkms zfs-zed zfsutils-linux After this operation, 22.3 MB of additional disk space will be used. Building for 5.8.0-1-amd64 Building initial module for 5.8.0-1-amd64 Done. zfs.ko: Running module version sanity check. - Original module - No original module exists within this kernel - Installation - Installing to /lib/modules/5.8.0-1-amd64/updates/dkms/ &c. DKMS: install completed. root@zoot:~# zpool list no pools available |
Here's the thing: zfs-initramfs is much more broken than zfs-dracut ;
I've made sure that what follows works with both of them, but I'd recommend using dracut anyway.If you want to be really sure after this, you can reboot, add break or rd.break to the cmdline,
and run a ZFS command in the initrd. |
root@zoot:~# apt install zfs-initramfs The following NEW packages will be installed: zfs-initramfs After this operation, 108 kB of additional disk space will be used. root@zoot:~# apt install --no-install-recommends dracut zfs-dracut && apt autopurge initramfs-tools The following packages will be REMOVED: initramfs-tools* initramfs-tools-core* klibc-utils* libklibc* The following NEW packages will be installed: dracut dracut-core kpartx libglib2.0-0 pkg-config zfs-dracut After this operation, 4,644 kB of additional disk space will be used. root@zoot:~# /etc/kernel/postinst.d/kernel-install $(uname -r) |
And now a brief interlude on the layout of ZFS datasets in the VFS.
Prior art, Step 3: System Installation issues a lot of commands and doesn't really explain why.
The final mount tree is as such:
zoot/
/ zoot/root
├── boot zoot/boot
├── home zoot/home
│ └── nab zoot/home/nab
├── root zoot/home/root
├── opt zoot/opt
├── srv zoot/srv
│ zoot/usr
├── usr/local zoot/usr/local
│ zoot/var
├── var/cache zoot/var/cache
│ zoot/var/lib
├── var/log zoot/var/log
└── var/tmp zoot/var/tmp
Why?
root=
parameter requires a dataset to mount,
and the scripts don't behave too well when the detected bootfs is the root; zoot/root
is filling that role instead;/var
, /var/lib
, and /usr
are all required for the system to boot
(I didn't try a split-/usr
system, but don't have much hope either way),
and the initrd
only
mounts
the root filesystem — things don't go well when these aren't there,
but the descendant filesystems are only used after the real init remounts everything;/var/log
needs POSIX ACLs for sd-journald
and /var/tmp
needs to be sticky a+rwx
.And so:
Move /tmp to tmpfs.
The Debian default is to keep it on / ; if you, for some reason, prefer this, you can treat it like /var/tmp later. |
root@zoot:~# ln -s /usr/share/systemd/tmp.mount /etc/systemd/system/ root@zoot:~# systemctl enable tmp.mount Created symlink /etc/systemd/system/local-fs.target.wants/tmp.mount → /usr/share/systemd/tmp.mount. root@zoot:~# mv /tmp{,_} && systemctl start tmp.mount && mv /tmp{_,} |
Now the pool on the second, heretofor unused, disk
(note that virtio devices don't have standard SCSI IDs, but are otherwise stably ordered). As promised, filesystem tuning is not included, so I'm not speccing -o ashift=12 -O relatime=on -O compress=zstd and whateverelse, but I am adding encryption with -O encryption=on -O keyformat=passphrase. |
root@zoot:~# ls -l /dev/disk/by-id/ ata-QEMU_DVD-ROM_QM00003 -> ../../sr0 ata-QEMU_HARDDISK_QM00001 -> ../../sda ata-QEMU_HARDDISK_QM00001-part1 -> ../../sda1 ata-QEMU_HARDDISK_QM00001-part2 -> ../../sda2 ata-QEMU_HARDDISK_QM00002 -> ../../sdb root@zoot:~# zpool create -O mountpoint=/ -O canmount=off -R /mnt zoot ata-QEMU_HARDDISK_QM00002 root@zoot:~# zpool list NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT zoot 7.50G 564K 7.50G - - 0% 0% 1.00x ONLINE /mnt root@zoot:~# zfs list NAME USED AVAIL REFER MOUNTPOINT zoot 492K 7.27G 192K /mnt |
Enable zfs-mount-generator(8)
via zed(8) for mount ordering; this makes systemd aware of,
i.a., /boot/efi depending on /boot and mount /var/log before starting journald there,
making both mounts both (a) behave as expected and (b) work.history_event-zfs-list-cacher.sh is enabled by default sinze zfs-zed 2.0.1-1 |
root@zoot:~# mkdir -p /etc/zfs/zfs-list.cache root@zoot:~# touch /etc/zfs/zfs-list.cache/zoot root@zoot:~# systemctl restart zfs-zed |
And the filesystems, as discussed above. Prior art, Step 3: System Installation, 3. Create datasets uses the com.sun:auto-snapshot property,
but as far as I can tell it's used only by third-party tools. |
root@zoot:~# zfs create -o mountpoint=/ zoot/root root@zoot:~# for f in zoot/{home{,/nab},boot,var{,/lib,/log,/cache,/tmp},opt,srv,usr{,/local}}; do zfs create $f; done root@zoot:~# zfs create -o mountpoint=/root zoot/home/root root@zoot:~# zfs set canmount=off zoot/{usr,var{,/lib}} root@zoot:~# zfs set com.sun:auto-snapshot=false zoot/var/{cache,tmp} root@zoot:~# zfs set acltype=posixacl xattr=sa zoot/var/log root@zoot:~# chmod 1777 /mnt/var/tmp root@zoot:~# zfs list -o name,mountpoint,canmount NAME MOUNTPOINT CANMOUNT zoot /mnt off zoot/boot /mnt/boot on zoot/home /mnt/home on zoot/home/nab /mnt/home/nab on zoot/home/root /mnt/root on zoot/opt /mnt/opt on zoot/root /mnt on zoot/srv /mnt/srv on zoot/usr /mnt/usr off zoot/usr/local /mnt/usr/local on zoot/var /mnt/var off zoot/var/cache /mnt/var/cache on zoot/var/lib /mnt/var/lib off zoot/var/log /mnt/var/log on zoot/var/tmp /mnt/var/tmp on root@zoot:~# systemctl stop zfs-zed root@zoot:~# sed -Ei 's;/mnt/?;/;' /etc/zfs/zfs-list.cache/zoot |
These or
these
are all the valid forms. If bootfs= is set, root=zfs:AUTO becomes valid, but an explicit pool can always be specified.
The documentation mentions no root= at all,
but dracut hangs waiting for /dev/gpt-auto-root if one isn't specified. |
root@zoot:~# zpool set bootfs=zoot/root zoot root@zoot:~# cp /proc/cmdline /etc/kernel/ root@zoot:~# # Then trim out initrd= and add root=ZFS=zoot/root or root=zfs:AUTO to taste, and/or root@zoot:~# echo 'root=zfs:zoot/root' > /etc/kernel/cmdline |
Comment out the old rootfs to prevent re-mounting it on top of the new one and regenerate+reinstall initrds. | root@zoot:~# sed -i 's;.* / .*ext.*;#&;' /etc/fstab root@zoot:~# run-parts --arg={,/boot/vmlinuz-}$(uname -r) /etc/kernel/postinst.d/ |
And now copy the system to the pool; this is the cursed bit. Note the CWD and ignore the few "file exists" errors for the directories — those are by design. |
root@zoot:~# apt install dump The following NEW packages will be installed: dump liblzo2-2 After this operation, 539 kB of additional disk space will be used. root@zoot:/mnt# dump -f - / | restore rf - DUMP: Date of this level 0 dump: Sat Sep 12 02:11:03 2020 DUMP: Dumping /dev/sda2 (/) to standard output DUMP: Label: zoot-root restore: ./boot, ./var, ./&c.: File exists DUMP: 1357080 blocks (1325.27MB) DUMP: finished in 50 seconds, throughput 27141 kBytes/sec DUMP: Date of this level 0 dump: Sat Sep 12 02:11:03 2020 DUMP: Date this dump completed: Sat Sep 12 02:11:53 2020 DUMP: DUMP IS DONE root@zoot:/mnt# rm restoresymtable root@zoot:/mnt# reboot |
If all went well, the system should now prompt for a password:
dracut might try to use a stored mount option like errors=remount-ro
for the rootfs;
in that case mount -t zfs -o zfsutil zoot/root /sysroot and regenerating the initrd (then committing it to the ESP!) will help.
If it says something to the effect of
[FAILED] Failed to start Import ZFS pools by cache file.
See 'systemctl status zfs-import-cache.service' for details.
instead, it's likely that /etc/zfs/zpool.cache
exists and is zero-length, and was copied like this from the real root,
which is like this for god-knows-why.
zpool import zoot and the same mount invocation will let it boot,
then removing the file and regenerating the initrd should fix the problem permanently.
This was fixed in OpenZFS 2.0.3 (Fri, 12 Feb 2021 16:30:01 +0800).
Post-login, the mounts should now be all ZFS:
Or:
nab@zoot:~$ findmnt | grep zfs
/ zoot/root zfs rw,relatime,xattr,noacl
├─/home zoot/home zfs rw,relatime,xattr,noacl
│ └─/home/nab zoot/home/nab zfs rw,relatime,xattr,noacl
├─/opt zoot/opt zfs rw,relatime,xattr,noacl
├─/boot zoot/boot zfs rw,relatime,xattr,noacl
├─/srv zoot/srv zfs rw,relatime,xattr,noacl
├─/var/cache zoot/var/cache zfs rw,relatime,xattr,noacl
├─/root zoot/home/root zfs rw,relatime,xattr,noacl
├─/var/log zoot/var/log zfs rw,relatime,xattr,posixacl
├─/usr/local zoot/usr/local zfs rw,relatime,xattr,noacl
└─/var/tmp zoot/var/tmp zfs rw,relatime,xattr,noacl
And, yes, the chrome of the windows changed from Windows 10 to i3, this is a goddamn mess of a post.
One last part now, to actually use both disks:
I'm specifying the primary disk's rootfs partition by its partlabel,
but ata-QEMU_HARDDISK_QM00001-part2 would also work in this case. |
root@zoot:~# gdisk -l /dev/disk/by-id/ata-QEMU_HARDDISK_QM00001 Disk /dev/disk/by-id/ata-QEMU_HARDDISK_QM00001: 16777216 sectors, 8.0 GiB Disk identifier (GUID): 0D47FC01-7947-4DDE-9506-3BBBCFF572FF Number Start (sector) End (sector) Size Code Name 1 2048 487423 237.0 MiB EF00 zoot-EFI 2 487424 16775167 7.8 GiB 8300 zoot-root root@zoot:~# zpool add zoot zoot-root invalid vdev specification use '-f' to override the following errors: /dev/disk/by-partlabel/zoot-root contains a filesystem of type 'ext4' root@zoot:~# dd if=/dev/zero of=/dev/disk/by-partlabel/zoot-root count=16 root@zoot:~# zpool add zoot zoot-root |
And that's it. There ain't much fan-fare to it, since this took only the very best part of a week to get right.
root@zoot:~# zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
zoot 15G 777M 14.2G - - 0% 5% 1.00x ONLINE -
root@zoot:~# zpool status
pool: zoot
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
zoot ONLINE 0 0 0
ata-QEMU_HARDDISK_QM00002 ONLINE 0 0 0
zoot-root ONLINE 0 0 0
errors: No known data errors
Or, as was succinctly put by the instigator of all this mess:
On Tue, 29 Sep 2020 22:16 -0800, after minor hiccups, the deed has finally been done, with a recommendation of the highest order:
Nit-pick? Correction? Improvement? Annoying? Cute? Anything? Don't hesitate to post
or open an issue!