011. pipe exclusion with splice() under Linux

Fri, 07 Jul 2023 01:42:34 +0200

Contents

    1. how?
    2. I'm not at a teletype‽
    3. Bisexion
    4. why'd I care?
    5. timeline

Since apparently I'm the only splice(2) user, herein I demonstrate a fun Linux BKL moment (but the BKL is on a pipe).

If you use a splicing cat, you can do this right now from your teletype: just

$ cat | whatever

and whatever will sleep forever on reads from its standard input stream, even if it set O_NONBLOCK on it.

That's boring tho, since anonymous pipes are, well, anonymous. What about

$ mkfifo fifo
$ whatever < fifo &
$ cat > fifo

? The same applies! Even better:

$ > fifo

from another teletype sleeps forever as well. So does the < direxion. And O_NONBLOCK.

And any operation on that pipe. Try sending a deadly signal to any of the afflicted (non-cat) processes, too!.

If you don't, then

#define _GNU_SOURCE
#include <fcntl.h>
#include <stdio.h>
int main() {
  ssize_t rd, acc = 0;
  while ((rd = splice(0, 0, 1, 0, 128 * 1024 * 1024, 0)) > 0)
    acc += rd;
  fprintf(stderr, "sp=%zd: %m\n", acc);
}

can function as the pro-verbial cat. You can also substitute the splice(…) for a sendfile(1, 0, 0, 128 * 1024 * 1024);, since, as a special case since 5.12, sendfile(any→pipe) is legal and equivalent to splice() of the same, even though otherwise it only allows seekable→any.

But

# how?

Quite easily — splice_file_to_pipe(), which, shockingly, runs when splicing from a non-pipe to a pipe, locks the output pipe, then does I/O, then unlocks it. Locking the pipe naturally excludes concurrent open()s, read()s, write()s. and final close()s (incl. implicit ones on death).

Usually you wouldn't think this to be a huge issue, since most I/O completes within some reasonably-bounded time, but teletype I/O, by design, never does until a newline/eof/eol/eol2. And, thus, QED.

Additionally, for splicing from a pipe to a socket, if the write to the socket would block, that sleep is taken with the pipe lock held. The same applies to the generic implementation, which is only an issue for teletypes and FUSE.

But

# I'm not at a teletype‽

#define _GNU_SOURCE
#include <fcntl.h>
#include <stdlib.h>
int main() {
  int pt = posix_openpt(O_RDWR);
  grantpt(pt);
  unlockpt(pt);
  int cl = open(ptsname(pt), O_RDONLY);
  for(;;)
    splice(cl, 0, 1, 0, 128 * 1024 * 1024, 0);
}

or even

#define _GNU_SOURCE
#include <fcntl.h>
#include <sys/socket.h>
int main() {
  int sp[2];
  socketpair(AF_UNIX, SOCK_STREAM, 0, sp);
  for(;;)
    splice(sp[0], 0, 1, 0, 128 * 1024 * 1024, 0);
}

# Bisexion

By rough-bisecting off snapshot.d.o kernel packages – since 4.0, and even 5.0, don't build on bookworm – to between 4.8.15-2 and 4.9.1-1~exp1, then manually bisecting between v4.8 and v4.9 – in a stretch chroot, naturally, since images built on buster hard-rebooted QEMU in a tight loop just after the decompressor and ELF parsing; strapping the chroot took two hours of baby-sitting due to the current state of s.d.o, and most revisions only build with an ubuntu patch; so much for never breaking fucking userspace –

commit 8924feff66f35fe22ce77aafe3f21eb8e5cff881 ("splice: lift pipe_lock out of splice_to_pipe()")

is the first bad commit.

(The smoketest is:

./v > fifo &
read -r _ < fifo &
echo zupa > fifo

good is it completes; bad is it hangs.)

This aligns with the origin of the modern pipe_lock() placement I got by recursive blame.

But

# why'd I care?

Depends if you're running, like, nullmailer, in which case ./v > /var/spool/nullmailer/trigger makes it ⇒ any subsequent MUA ⇒ any subsequent sender (if wait()ing synchronously) enter the signal-impervious mutex-sleeping state, which can only be recovered from by killing the splicing process. Good luck finding that, since this affects any ptracing process as well.

Or any other message or log collection system where – especially unprivileged – users write stuff to a pipe, since they've now been granted a total exclusion thereon.

Even in inocuous situations like QEMU with -chardev pipe,id=pipe,path=$HOME/uwu/q -serial chardev:pipe, catting to ~/uwu/q.in (besides only waking up every second line, which is just business as usual), excludes emulation.

I've always wanted to have a

# timeline

sexion.

2023-06-26T00:17:33
I confusion-post that splice() breaks O_NONBLOCK (in tail -f).
2023-06-26T03:12:10
This is re-posted to linux-fsdevel@ with a "rudimentary analysis".
2023-06-26T13:59:09
In reply to Christian I discover that this does actually fully exclude read()s/write()s/open()s.
2023-07-05T23:19:22
I mail security@ with, effectively, the contents of this post (sans the sendfile tidbit).
2023-07-06T19:18:13
In reply to Christian I discover the sendfile thing; variously-broken and -untested patches are circulating.
2023-07-06T23:56:45
Linus posts the "slightly tested version".
2023-07-08T22:06:56
After a few more rounds of patches, Linus concedes that potentially "we need to just bite the bullet and say »copy_splice_read() needs to use a non-blocking kiocb for the IO«.".
2023-07-09T03:03:22
I post a summary diff that does that.
2023-10-16T22:35:28
I post a re-based and itemised patchset that does the same thing, and also find similar conditions in the pipe → file direxion for sockets.
2023-12-12T11:12:28
I RESEND.
2023-12-14T19:44:42
After consternation w.r.t. "who's gonna take this?" (how would I know) and "I do wish the CC list had been setup a bit more deliberately" (Documentation/process/submitting-patches.rst says "paste get_maintainer.pl into Cc:", in as many words, which is probably better-suited for "here's 5 patches for the memfrob driver" instead of "here's an identical diff against 11 modules"; too bad I keep finding issues that are the latter), I RERESEND.
2023-12-19T23:24:14
I post an extracted removal of fully-dead-and-no-one-noticed-since-2016 relay_file_splice_read() (-162!).
2023-12-21T04:08:41
I post a fully-tested v2 with reproducers and a fixed (via disabling) teletype handling, new FUSE code, and most importantly a re-imagined framing in terms of a security model derived from only the real root being allowed to mount filesystems and an attack being possible if any other files sleep.
2023-12-24T06:01:49
I append a fix for splice(pipe→tty) and splice(pipe→FUSE).


Nit-pick? Correction? Improvement? Annoying? Cute? Anything? Mail, post, or open!


Creative text licensed under CC-BY-SA 4.0, code licensed under The MIT License.
This page is open-source, you can find it at GitHub, and contribute and/or yell at me there.
Like what you see? Consider giving me a follow over at social medias listed here, or maybe even a sending a buck liberapay donate or two patreon my way if my software helped you in some significant way?
Compiled with Clang 19's C preprocessor on 04.09.2025 20:51:39 UTC from src/blogn_t/011-linux-splice-exclusion.html.pp.
See job on builds.sr.ht.
RSS feed