005. Low-curse ZFS-on-root for new Debian installations on multi-disk systems

Mon, 14 Sep 2020 00:20:20 +0200, updated Tue, 15 Sep 2020 20:08:12 +0200

In a brief departure from microcomputing saprophagia, imagine you're an american and acquiring a ThinkPad P15, sporting two two-terabyte drives and, naturally, you'd like to have a single continuous volume spanning both of them to do whatever your heart desires.

After the shock at willingly staying in the American empire and paying it for the privilege wears off, two approaches reveal themselves:

  1. LVM, with its native debian-installer support, and
  2. ZFS, without it, but with the ability to do zfs(8)-send backups, live-mirroring to an external enclosure with a [mirror[internal, external], mirror[internal, external]] topology, and me egging you on to do it.

Now, I also wanted to do a ZFS-on-root for my own nefarious uses, and so it was decided:

Good, do it before I buy the p15 and report back :noel:

# Set-up

To minimise how cursed this is, the following restrixions are in place:

This means that all you need is an EFI-compatible multi-disk platform and some way to EFI boot it into d-i — the bootloader will be fixed, ZFS installed normally, and the rootfs dumped/restored thereonto after normally booting into the target system.

The test setup is QEMU -bios OVMF.fd and two 8G drives, one of which is designated as primary. Filesystem tuning is not covered, encryption is supported, SecureBoot is not covered because I haven't figured it out yet, most-all gotchas are hopefully explained, there's a prior art that was of little help, and zfs-{initramfs,dracut} are full of problems that I try to work around here:

Excerpt from zfs-initramfs, function load_module_initrd that starts with 'if [ $variable > 0 ]'

# Installation

The restrixions to the original installation are all during the partitioning phase and as follows:

  1. Stick to just the primary drive,
  2. pick an EFI partition large enough for a few kernels and initrds, and
  3. stuff everything onto the rootfs:
initramfs-tools partitioning screen showing one of two disks used with a 250MB EFI System Partiition marked 'zoot-EFI' and rest for the root FS, marked 'zoot-root'

Either installing sid or enabling backports for buster is required, since zfs-mount-generator(8) appeared in 0.8.0; if you don't install GRUB you'll have to boot manually from the EFI shell after copying the kernel and initrd from target/boot/ to target/boot/efi/.

# Bootloader

Using GRUB and a unified boot-on-root is, for better or for worse, not possible; GRUB supports these non-R/O zpool features:

/*
 * List of pool features that the grub implementation of ZFS supports for
 * read. Note that features that are only required for write do not need
 * to be listed here since grub opens pools in read-only mode.
 */
#define MAX_SUPPORTED_FEATURE_STRLEN 50
static const char *spa_feature_names[] = {
  "org.illumos:lz4_compress",
  "com.delphix:hole_birth",
  "com.delphix:embedded_data",
  "com.delphix:extensible_dataset",
  "org.open-zfs:large_blocks",
  NULL
};

Which means that these, as of OpenZFS 0.8.4, are the ones that it doesn't support:

nabijaczleweli@tarta:~$ man zpool-features | grep READ-ONLY.*no -B4 | sed 's/GUID//' | \
                        awk '!/^$/ && !/COMPATIBLE/ && !/--/ \
                             {if(last == "") last = $1; else {print last "\t" $1; last = ""}}' |
                        grep -v 'lz4_compress|hole_birth|embedded_data|extensible_dataset|large_blocks' | column -t
bookmark_v2            com.datto:bookmark_v2
device_removal         com.delphix:device_removal
edonr                  org.illumos:edonr
encryption             com.datto:encryption
large_dnode            org.zfsonlinux:large_dnode
multi_vdev_crash_dump  com.joyent:multi_vdev_crash_dump
sha512                 org.illumos:sha512
skein                  org.illumos:skein

Confer zpool-features(5) for details so as to why this is not acceptable (remember: you couldn't any of these even by accident on your root pool, and most of them can't ever be turned off).

Using a separate /boot on ZFS required a lot of dicking around with the options for GRUB to recognise the filesystem at all (confer prior art, Step 2: Disk Formatting, 4. Create the boot pool, with its 12 lines of options for pool features alone) and GRUB had difficulties generating the right root= cmdline options for a normal pool anyway.

Using a separate /boot on ext*/FAT didn't make much sense for booting on EFI, either, so cutting GRUB out was a simple way to berid of a lot of bollocks.

Boot into your freshly installed system, become root, and:

Get rid of GRUB, optionally keeping EFI tools. root@zoot:~# apt-mark manual efibootmgr mokutil
efibootmgr, mokutil set to manually installed.
root@zoot:~# apt purge grub* && apt autopurge
The following packages will be REMOVED:
  grub-common* grub-efi-amd64* grub-efi-amd64-bin* grub-efi-amd64-signed*
  grub2-common* os-prober* shim-signed* gettext-base* libbrotli1* libfreetype6*
  libfuse2* libpng16-16* shim-helpers-amd64-signed* shim-signed-common* shim-unsigned*
0 upgraded, 0 newly installed, 15 to remove and 0 not upgraded.
After this operation, 44.9 MB disk space will be freed.
Verify that you only have GRUB here before cleaning it out! root@zoot:~# tree /boot/efi/
/boot/efi/
├── EFI
│   └── debian
│       ├── BOOTX64.CSV
│       ├── fbx64.efi
│       ├── grub.cfg
│       ├── grubx64.efi
│       ├── mmx64.efi
│       └── shimx64.efi
└── NvVars
root@zoot:~# rm -rf /boot/efi/EFI/
Install systemd-boot and enable a timeout.
This might not be required on platforms that support the systemd Boot Loader Specification (are there any?); in that case, you'll need to mkdir "/boot/efi/$(cat /etc/machine-id)" instead, since kernel-install won't make it by itself.
root@zoot:~# bootctl install
Created "/boot/efi/EFI", other directories.
Copied "/usr/lib/systemd/boot/efi/systemd-bootx64.efi" to "/boot/efi/EFI/{systemd/systemd-bootx64.efi,BOOT/BOOTX64.EFI}".
Random seed file /boot/efi/loader/random-seed successfully written (512 bytes).
Created EFI boot entry "Linux Boot Manager".
root@zoot:~# sed -i 's/#timeout/timeout/' /boot/efi/loader/loader.conf
Add kernel version to loader entry. root@zoot:~# cp /{usr/lib,etc}/kernel/install.d/90-loaderentry.install
root@zoot:~# diff -U2 /{usr/lib,etc}/kernel/install.d/90-loaderentry.install
--- /usr/lib/kernel/install.d/90-loaderentry.install    2020-09-02 11:49:08.000000000 +0200
+++ /etc/kernel/install.d/90-loaderentry.install        2020-09-13 05:06:49.541120128 +0200
@@ -44,4 +44,6 @@
if ! [[ $PRETTY_NAME ]]; then
     PRETTY_NAME="Linux $KERNEL_VERSION"
+else
+    PRETTY_NAME+=" with Linux $KERNEL_VERSION"
fi
Disable a Debian .install hook, which copies the initrd into \MID\VER\initrd despite that being handled by 90-loaderentry.install, which copies it into \MID\VER\BASENAME, thereby duplicating it. This will not be required in the future. root@zoot:~# ln -s /dev/null /etc/kernel/install.d/85-initrd.install
Add sd-boot hooks, for integration with the normal kernel installation/removal flow.
GRUB installs these as zz-update-grub, but kernel-install sorts later than any other thing I saw there myself. I hope to integrate this and make it not required in the future.
root@zoot:~# cat > /etc/kernel/postinst.d/kernel-install
#!/bin/sh
bootctl is-installed > /dev/null || exit 0
exec kernel-install add "$1" "/boot/vmlinuz-$1" "/boot/initrd.img-$1"
^D
root@zoot:~# cat > /etc/kernel/postrm.d/kernel-install
#!/bin/sh
bootctl is-installed > /dev/null || exit 0
exec kernel-install remove "$1"
^D
root@zoot:~# chmod +x /etc/kernel/post{inst,rm}.d/kernel-install
Install the kernel.
The initial run takes a long time, hence the -v; 62dd03a4928c412180b3024ac6c03a90 is this machine's ID.
The current cmdline will be used for the boot entry, overridable with /etc/kernel/cmdline.
root@zoot:~# kernel-install -v add $(uname -r) /boot/vml<TAB> /boot/ini<TAB>
Running depmod -a 5.8.0-1-amd64
Installing /boot/efi/62dd03a4928c412180b3024ac6c03a90/5.8.0-1-amd64/vmlinuz-5.8.0-1-amd64
Creating /boot/efi/loader/entries/62dd03a4928c412180b3024ac6c03a90-5.8.0-1-amd64.conf
root@zoot:~# tree /boot/efi/
/boot/efi/
├── 62dd03a4928c412180b3024ac6c03a90
│   └── 5.8.0-1-amd64
│       ├── initrd
│       ├── linux
│       └── vmlinuz-5.8.0-1-amd64
├── EFI
│   ├── BOOT
│   │   └── BOOTX64.EFI
│   ├── Linux
│   └── systemd
│       └── systemd-bootx64.efi
├── loader
│   ├── entries
│   │   └── 62dd03a4928c412180b3024ac6c03a90-5.8.0-1-amd64.conf
│   ├── loader.conf
│   └── random-seed
└── NvVars

8 directories, 9 files

I'd recommend rebooting now to verify that this works, which should look like this:

text-mode QEMU window with two centered lines, the first one, selected, saying 'Debian GNU/Linux bullseye/sid with Linux 5.8.0-1-amd64', the second saying 'Reboot Into Firmware Interface'

If not, and sd-boot shows errors or doesn't start at all; boot into the EFI shell, fs0:, and \<MID>\<VER>\linux initrd=\<MID>\<VER>\initrd root=/dev/sda2 (the shell should support tab-completion, you might need to add a space before completing the initrd) (the root= option assumes you installed to the second partition of the first SCSI drive, as I did; adjust to taste), then write me an e-mail or a DM or whatever else is listed here so I can issue a correxion; thanks in advance, &c.:

EFI shell, demonstrating the above command

Now we can boot without dealing with GRUB. On to the ZFS bit:

# The ZFS bit

Install the prerequisites, remember to match the headers to your kernel! root@zoot:~# apt install --no-install-recommends linux-headers-amd64 build-essential
The following NEW packages will be installed:
  binutils binutils-common binutils-x86-64-linux-gnu build-essential dpkg-dev
  g++ g++-10 gcc gcc-10 libasan6 libatomic1 libbinutils libc-dev-bin libc6-dev
  libcc1-0 libcrypt-dev libctf-nobfd0 libctf0 libdpkg-perl libgcc-10-dev
  libgdbm-compat4 libgdbm6 libgomp1 libitm1 liblsan0 libperl5.30 libquadmath0
  libstdc++-10-dev libtsan0 libubsan1 linux-compiler-gcc-10-x86
  linux-headers-5.8.0-1-amd64 linux-headers-5.8.0-1-common linux-headers-amd64
  linux-kbuild-5.8 linux-libc-dev make patch perl perl-modules-5.30
0 upgraded, 40 newly installed, 0 to remove and 0 not upgraded.
After this operation, 286 MB of additional disk space will be used.
And ZFS.
Depending on dependency ordering, starting the services sometimes fails; simply re-running the command will fix it.
root@zoot:~# apt install zfs-dkms
The following NEW packages will be installed:
  distro-info-data dkms fakeroot libfakeroot libnvpair1linux libuutil1linux
  libzfs2linux libzpool2linux lsb-release python3-distutils python3-lib2to3
  zfs-dkms zfs-zed zfsutils-linux
After this operation, 22.3 MB of additional disk space will be used.
Building for 5.8.0-1-amd64
Building initial module for 5.8.0-1-amd64
Done.

zfs.ko:
Running module version sanity check.
- Original module
  - No original module exists within this kernel
- Installation
  - Installing to /lib/modules/5.8.0-1-amd64/updates/dkms/
&c.

DKMS: install completed.
root@zoot:~# zpool list
no pools available
Here's the thing: zfs-initramfs is much more broken than zfs-dracut; I've made sure that what follows works with both of them, but I'd recommend using dracut anyway.
If you want to be really sure after this, you can reboot, add break or rd.break to the cmdline, and run a ZFS command in the initrd.
root@zoot:~# apt install zfs-initramfs
The following NEW packages will be installed:
  zfs-initramfs
After this operation, 108 kB of additional disk space will be used.
root@zoot:~# apt install --no-install-recommends dracut zfs-dracut && apt autopurge initramfs-tools
The following packages will be REMOVED:
  initramfs-tools* initramfs-tools-core* klibc-utils* libklibc*
The following NEW packages will be installed:
  dracut dracut-core kpartx libglib2.0-0 pkg-config zfs-dracut
After this operation, 4,644 kB of additional disk space will be used.
root@zoot:~# /etc/kernel/postinst.d/kernel-install $(uname -r)

And now a brief interlude on the layout of ZFS datasets in the VFS.

# Layout

Prior art, Step 3: System Installation issues a lot of commands and doesn't really explain why.

The final mount tree is as such:

               zoot/
/              zoot/root
├── boot       zoot/boot
├── home       zoot/home
│   └── nab    zoot/home/nab
├── root       zoot/home/root
├── opt        zoot/opt
├── srv        zoot/srv
│              zoot/usr
├── usr/local  zoot/usr/local
│              zoot/var
├── var/cache  zoot/var/cache
│              zoot/var/lib
├── var/log    zoot/var/log
└── var/tmp    zoot/var/tmp

Why?

And so:

Move /tmp to tmpfs. The Debian default is to keep it on /; if you, for some reason, prefer this, you can treat it like /var/tmp later. root@zoot:~# ln -s /usr/share/systemd/tmp.mount /etc/systemd/system/
root@zoot:~# systemctl enable tmp.mount
Created symlink /etc/systemd/system/local-fs.target.wants/tmp.mount → /usr/share/systemd/tmp.mount.
root@zoot:~# mv /tmp{,_} && systemctl start tmp.mount && mv /tmp{_,}
Now the pool on the second, heretofor unused, disk.

As promised, filesystem tuning is not included, so I'm not speccing -o ashift=12 -O relatime=on -O compress=lz4 and whateverelse, but I am adding encryption with -O encryption=on -O keyformat=passphrase.
root@zoot:~# ls -l /dev/disk/by-id/
ata-QEMU_DVD-ROM_QM00003 -> ../../sr0
ata-QEMU_HARDDISK_QM00001 -> ../../sda
ata-QEMU_HARDDISK_QM00001-part1 -> ../../sda1
ata-QEMU_HARDDISK_QM00001-part2 -> ../../sda2
ata-QEMU_HARDDISK_QM00002 -> ../../sdb
root@zoot:~# zpool create -O mountpoint=/ -O canmount=off -R /mnt zoot ata-QEMU_HARDDISK_QM00002
root@zoot:~# zpool list
NAME   SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
zoot  7.50G   564K  7.50G        -         -     0%     0%  1.00x    ONLINE  /mnt
root@zoot:~# zfs list
NAME   USED  AVAIL     REFER  MOUNTPOINT
zoot   492K  7.27G      192K  /mnt
Enable zfs-mount-generator(8) via zed(8) for mount ordering; this makes systemd aware of, i.a. /boot/efi depending on /boot and mount /var/log before starting journald there, making both mounts both (a) behave as expected and (b) work. root@zoot:~# ln -s /usr/lib/zfs-linux/zed.d/history_event-zfs-list-cacher.sh /etc/zfs/zed.d/
root@zoot:~# mkdir -p /etc/zfs/zfs-list.cache
root@zoot:~# touch /etc/zfs/zfs-list.cache/zoot
root@zoot:~# systemctl restart zfs-zed
And the filesystems, as discussed above.
Prior art, Step 3: System Installation, 3. Create datasets uses the com.sun:auto-snapshot property, but as far as I can tell it's used only by third-party tools.
root@zoot:~# zfs create -o mountpoint=/ zoot/root
root@zoot:~# for f in zoot/{home{,/nab},boot,var{,/lib,/log,/cache,/tmp},opt,srv,usr{,/local}}; do zfs create $f; done
root@zoot:~# zfs create -o mountpoint=/root zoot/home/root
root@zoot:~# zfs set canmount=off zoot/{usr,var{,/lib}}
root@zoot:~# zfs set com.sun:auto-snapshot=false zoot/var/{cache,tmp}
root@zoot:~# zfs set acltype=posixacl xattr=sa zoot/var/log
root@zoot:~# chmod 1777 /mnt/var/tmp
root@zoot:~# zfs list -o name,used,mountpoint,canmount
NAME            MOUNTPOINT      CANMOUNT
zoot            /mnt                 off
zoot/boot       /mnt/boot             on
zoot/home       /mnt/home             on
zoot/home/nab   /mnt/home/nab         on
zoot/home/root  /mnt/root             on
zoot/opt        /mnt/opt              on
zoot/root       /mnt                  on
zoot/srv        /mnt/srv              on
zoot/usr        /mnt/usr             off
zoot/usr/local  /mnt/usr/local        on
zoot/var        /mnt/var             off
zoot/var/cache  /mnt/var/cache        on
zoot/var/lib    /mnt/var/lib         off
zoot/var/log    /mnt/var/log          on
zoot/var/tmp    /mnt/var/tmp          on
root@zoot:~# systemctl stop zfs-zed
root@zoot:~# sed -Ei 's;/mnt/?;/;' /etc/zfs/zfs-list.cache/zoot
These or these are all the valid forms.
If bootfs= is set, root=zfs:AUTO becomes valid, but an explicit pool can always be specified. The documentation mentions no root= at all, but dracut hangs waiting for /dev/gpt-auto-root if one isn't specified.
root@zoot:~# zpool set bootfs=zoot/root zoot
root@zoot:~# cat /proc/cmdline > /etc/kernel/cmdline
root@zoot:~# echo 'root=zfs:zoot/root' > /etc/kernel/cmdline
root@zoot:~# # Trim out initrd= and add root=ZFS=zoot/root or root=zfs:AUTO to taste.
Comment out old rootfs to prevent re-mounting it on top of the new one and regenerate+reinstall initrds. root@zoot:~# sed -i 's;.* / .*ext.*;#&;' /etc/fstab
root@zoot:~# run-parts --arg=$(uname -r) /etc/kernel/postinst.d/
And now copy the system to the pool; this is the cursed bit.
Note the CWD and ignore the few "file exists" errors for the directories — those are by design.
root@zoot:~# apt install dump
The following NEW packages will be installed:
  dump liblzo2-2
After this operation, 539 kB of additional disk space will be used.
root@zoot:/mnt# dump -f - / | restore rf -
  DUMP: Date of this level 0 dump: Sat Sep 12 02:11:03 2020
  DUMP: Dumping /dev/sda2 (/) to standard output
  DUMP: Label: zoot-root
restore: ./boot, ./var, ./&c.: File exists
  DUMP: 1357080 blocks (1325.27MB)
  DUMP: finished in 50 seconds, throughput 27141 kBytes/sec
  DUMP: Date of this level 0 dump: Sat Sep 12 02:11:03 2020
  DUMP: Date this dump completed: Sat Sep 12 02:11:53 2020
  DUMP: DUMP IS DONE
root@zoot:/mnt# rm restoresymtable
root@zoot:/mnt# reboot

If all went well, the system should now prompt for a password:

'Encrypted ZFS password for zoot/root' prompt after boot with many messages

dracut might try to use a stored mount option like errors=remount-ro for the rootfs; in that case mount -t zfs -o zfsutil zoot/root /sysroot and regenerating the initrd will help.

If it says something to the effect of

[FAILED] Failed to start Import ZFS pools by cache file.
See 'systemctl status zfs-import-cache.service' for details.

instead, it's likely that /etc/zfs/zpool.cache exists and is zero-length, and was copied like this from the real root, which is like this for god-knows-why. zpool import zoot and the same mount invocation will let it boot, then removing the file and regenerating the initrd should fix the problem permanently.

Post-login, the mounts should now be all ZFS:

findmnt output, showing that all filesystems that should be ZFS (/, /home, &c.) are

Or:

nab@zoot:~$ findmnt | grep zfs
/                              zoot/root      zfs        rw,relatime,xattr,noacl
├─/home                        zoot/home      zfs        rw,relatime,xattr,noacl
│ └─/home/nab                  zoot/home/nab  zfs        rw,relatime,xattr,noacl
├─/opt                         zoot/opt       zfs        rw,relatime,xattr,noacl
├─/boot                        zoot/boot      zfs        rw,relatime,xattr,noacl
├─/srv                         zoot/srv       zfs        rw,relatime,xattr,noacl
├─/var/cache                   zoot/var/cache zfs        rw,relatime,xattr,noacl
├─/root                        zoot/home/root zfs        rw,relatime,xattr,noacl
├─/var/log                     zoot/var/log   zfs        rw,relatime,xattr,posixacl
├─/usr/local                   zoot/usr/local zfs        rw,relatime,xattr,noacl
└─/var/tmp                     zoot/var/tmp   zfs        rw,relatime,xattr,noacl

And, yes, the chrome of the windows changed from Windows 10 to i3, this is a goddamn mess of a post.

One last part now, to actually use both disks:

I'm specifying the primary disk's rootfs partition by its partlabel, but ata-QEMU_HARDDISK_QM00001-part2 would also work in this case. root@zoot:~# gdisk -l /dev/disk/by-id/ata-QEMU_HARDDISK_QM00001
Disk /dev/disk/by-id/ata-QEMU_HARDDISK_QM00001: 16777216 sectors, 8.0 GiB
Disk identifier (GUID): 0D47FC01-7947-4DDE-9506-3BBBCFF572FF

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048          487423   237.0 MiB   EF00  zoot-EFI
   2          487424        16775167   7.8 GiB     8300  zoot-root
root@zoot:~# zpool add zoot zoot-root
invalid vdev specification
use '-f' to override the following errors:
/dev/disk/by-partlabel/zoot-root contains a filesystem of type 'ext4'
root@zoot:~# dd if=/dev/zero of=/dev/disk/by-partlabel/zoot-root count=16
root@zoot:~# zpool add zoot zoot-root

And that's it. There ain't much fan-fare to it, since this took only the very best part of a week to get right.

root@zoot:~# zpool list
NAME   SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
zoot    15G   777M  14.2G        -         -     0%     5%  1.00x    ONLINE  -
root@zoot:~# zpool status
  pool: zoot
 state: ONLINE
  scan: none requested
config:

        NAME                         STATE     READ WRITE CKSUM
        zoot                         ONLINE       0     0     0
          ata-QEMU_HARDDISK_QM00002  ONLINE       0     0     0
          zoot-root                  ONLINE       0     0     0

errors: No known data errors

Or, as was succinctly put by the instigator of all this mess:

[01:58] Griwes: okay, *to be fair* this looks less cursed than I thought it would be, so kudos :noel:


Nit-pick? Correction? Improvement? Annoying? Cute? Anything? Don't hesitate to post or open an issue!


Creative text licensed under CC-BY-SA 4.0, code licensed under The MIT License.
This page is open-source, you can find it at GitHub, and contribute and/or yell at me there.
Like what you see? Consider giving me a follow over at social medias listed here, or maybe even a sending a buck or two patreon my way if my software helped you in some significant way?
Automatically generated with GCC 5‍.‍4‍.‍0's C preprocessor on 15.09.2020 18:13:04 UTC from src/blogn_t/005-low-curse-zfs-on-root.html.pp.
See job on TravisCI.
RSS feed