023,b. A J. DeFelice polemick on V1 unix buffering schemes, novel results regarding V1 unix mounts, on the conceptualisation of the file system and block device partitioning

Fri, 23 Jan 2026 21:28:16 +0100

Contents

      1. proc. status
      2. Buffered block I/O
      3. I-node cache
      4. And now, a vicious take-down of a retiree who's made a minor error 55 years ago

If we cast our minds back and eyeballs down to write (II)syswritewriteinodedskw (write routine for non-special files):

212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
dskw: / write routine for non-special files
	mov	(sp),r1 / get an i-node number from the stack into r1
	jsr	r0,iget / write i-node out (if modified), read i-node 'r1'
		        / into i-node area of core
	mov	 *u.fofp,r2 / put the file offset [(u.off) or the offset in
		            / the fsp entry for this file] in r2
	add	 u.count,r2 / no. of bytes to be written + file offset is
		            / put in r2
	cmp	 r2,i.size / is this greater than the present size of
		           / the file?
	blos	 1f / no, branch
	 mov	r2,i.size / yes, increase the f11e size to file offset +
		           / no. of data bytes
	 jsr	r0,setimod / set imod=1 (i.e., core inode has been
		           / modified), stuff tlme of modification into
		           / core image of i-node
1:
	jsr	r0,mget / get the block no. in which to write the next data
		        / byte
	bit	*u.fofp,$777 / test the lower 9 bits of the file offset
	bne	2f / if its non-zero, branch; if zero, file offset = 0,
		   / 512, 1024,...(i.e., start of new block)
	cmp	u.count,$512. / if zero, is there enough data to fill an
		              / entire block? (i.e., no. of
	bhis	3f / bytes to be written greater than 512.? Yes, branch.
		   / Don't have to read block
2: / in as no past info. is to be saved (the entire block will be
   / overwritten).
	jsr	r0,dskrd / no, must retain old info.. Hence, read block 'r1'
		         / into an I/O buffer
3:
	jsr	r0,wslot / set write and inhibit bits in I/O queue, proc.
		         / status=0, r5 points to 1st word of data
	jsr	r0,sioreg / r3 = no. of bytes of data, r1 = address of data,
		          / r2 points to location in buffer in which to
		          / start writing data
2:
	movb	(r1 )+,(r2)+ / transfer a byte of data to the I/O buffer
	dec	r3 / decrement no. of bytes to be written
	bne	2b / have all bytes been transferred? No, branch
	jsr	r0,dskwr / yes, write the block and the i-node
	tst	u.count / any more data to write?
	bne	1b / yes, branch
	jmp	ret / no, return to the caller via 'ret'
Heed emphasised fragments
(labels colour-coded; note that 123 is octal and 123. is decimal; _-prefixed names expositional)

which can be translated to

212
​
214
215
216
217
218
220
223
225
226
​
​
229
​
​
​
240
241
​
​
243
244
245
246
247
​
​
252
​
​
extern r1, r2, r3, cdev;
dskw(_ino) /* write routine for non-special files */
{
	r1 = _ino; iget(); /* write i-node out (if modified), read i-node 'r1' on 'cdev'
	                      into i-node area of core */
	r2 = *u.fofp + u.count; /* file offset [(u.off) or the offset in
	                           the fsp entry for this file] +
	                           no. of bytes to be written */
	if(r2 > i.size) {
		i.size = r2;
		setimod();
	}

	while(u.count) {
		mget(); /* get the block no. in which to write the next data byte */

		/* if lower 9 bits of file offset are 0,
		   file offset = 0, 512, 1024,...(i.e., start of new block): */
		if(*u.fofp & 511 || u.count < 512) {
			dskrd();  /* if there is not enough data to fill an entire block, */
		}	          /* read block 'r1' on 'cdev' into an I/O buffer */

		wslot(); /* set write and inhibit bits in I/O queue, proc. status=0,
		            r5 points to 1st word of data */
		sioreg(); /* r3 = no. of bytes of data,
		             r1 = address of data,
		             r2 points to location in buffer in which to start writing data */
		_memcpy(r2, r1, r3); /* transfer a byte of data to the I/O buffer */

		dskwr(); /* yes, write the block and the i-node */
	}
}

we observe the following: accd'g to DeFelice's commentary, the final dskwr call will schedule I/O such that the (implied current) i-node and the freshly-written data block are sent to disk. This is not true: the old i-node (if any) was already enqueued, the current i-node is not affected by this call, and the only buffer scheduled here is the new data block.

This is all the more interesting because the behaviour of the iget call is labelled correctly, and would contradict the supposed dskwr behaviour.

# proc. status

is the processor status, which contains arithmetic status flags (overflow, carry, zero, negative) and the interrupt priority:

During the interrupt acknowledge and priority arbitration phase the LSI-11/23 processor will acknowledge interrupts under the following conditions:
  1. The device interrupt priority is higher than the current PS<7:5>.
  2. The processor has completed instruction execution and no additional bus cycles are pending.

which can themselves go up to 7 (and micro/pdp-11's smallest interrupt priority seems to be 4), so setting it to 7 blocks all interrupts. The V1 kernel uses these values:

Instruction Value Priority Occurrences Filter
clr *$ps 08 0000 0000 0000 0000 000 0 13+1 All interrupts allowed
mov $240,*$ps 2408 0000 0000 1010 0000 101 5 11+1 Priority 6, 7 interrupts allowed
mov $300,*$ps 3008 0000 0000 1100 0000 110 6 2+0 Priority 7 interrupts allowed
mov $340,*$ps 3408 0000 0000 1110 0000 111 7 2+0 All interrupts blocked

Priority 7 is used to guard the I/O buffer queue: set by bufaloc — which finds, locks, and returns a dynamic I/O buffer slot or sleeps in I/O wait until one is available — and in ppoke which wraps poke in priority=7 — used by static sb0/sb1/swp I/O buffer slots. poke enqueues dirty buffers onto their respective device I/O queues (so it requires priority=7 on entry — this matches "must call with these locks held" interface contracts in modern Linux).

Priority 6 (swap, sleep) is used to guard the run queues (runq, p.link). Their precise structure is thankfully not relevant here.

Priority 5 is used by the teletype driver to protect tty state (thus why its use sprawls quite so).

Priority is cleared when not in a critical section (or during lock inversion while in I/O sleep in idle).

One would be remiss to not note that this interface, obviously, is not very good: it's less of an issue for a system with barely any memory management, but overwriting other flags (well, the non-ephemeral ones to the left of the priority) combined with a text-search-defeating non-zero offset both suck ass. Thus, subsequent systems grow an spl instruction and C unixes have (V4 has) spl[014567]() functions, which use this instruction on the pdp-11/45 (and open-code it on the pdp-11/40, emulating atomicity by ORing to get to privilege 7, then clearing the unneeded bits). (Note also how pdp-11/40 spl4() falls through to spl5().)

# Buffered block I/O

was touched upon in the last post, but only insofar as the number of dynamic buffers.

_-prefixed names were invented for exposition; others match source.
Upper-case comments are DeFelice's; others from me.
This is covered by the final file with variables and the top of on-entry initialisation.
​
​
​
​
​
​
​
ux
ux
​
​
​
ux
ux
​
u0
u0
u0
​
​
ux
ux
ux
​
ux
ux
u0
u0
u0
​
ux
u0
u0
u0
​
ux
ux
ux
u0
u0
​
​
​
​
​
​
​
​
35
71
​
​
​
72
71
​
115
117
118
​
​
36
37
38
​
2
15
122
127
125
​
24
123
129
128
​
76
75
74
124
130
​
struct _iobuf_hdr {
	char  _dev;       	// 0=rf0/1=rk0/≥2=tap0-7 (dectape)
	char  _flags;
	int   _block_number;
	int   _word_count;	// negative
	char *_buf;
};
struct _iobuf_hdr * bufp[nbuf + 3];  // priority = reverse index

struct {
	char _ttybufs[140][ntty];
	struct _dynamic_iobuf {
		struct _iobuf_hdr _hdr;
		char              _buf[512];
	} _iobufs[nbuf];
} buffer;

bufp[0..nbuf] = &buffer._iobufs[0..nbuf]._hdr;
bufp[0..nbuf]->_buf        = buffer._iobufs[0..nbuf]._buf;
bufp[0..nbuf]->_word_count = -256;  // 512 bytes


struct _iobuf_hdr sb0;  // I/O queue entry drum
struct _iobuf_hdr sb1;  // I/O queue entry disk (mounted device)
struct _iobuf_hdr swp;  // I/O queue entry core image being swapped

struct /* 218 bytes */ systm;  // rootfs superblock
struct inode           inode;
bufp[nbuf]      = &sb0;
sb0._buf        = &systm;
sb0._word_count = (&systm - &inode) / 2;

char mount[1024];              // second filesystem superblock
bufp[nbuf + 1]  = &sb1;
sb1._buf        = &mount;
sb1._word_count = -512;  // 1024 bytes

union {
	struct /*...*/ u;             // process-specific state
	char           user[64];
} [[pin(core - 64)]];          // (precedes start of userspace memory)
bufp[nbuf + 2] = &swp;
swp._buf       = &user;
// swp._word_count set when swapping process in/out

Herein we observe that:

  1. this assembler uses ! for bit inversion (what C later called ~; insert Rust retvrn gag)
  2. unix supports exactly 1 or 2 filesystems being mounted at any time
  3. the super-block of the second filesystem, if any, is 1k, which matches file system (V)Every file system storage volume (e.g. RF disk, RK disk, DECtape reel) has a common format […] divided into […] 512 byte) blocks. Blocks 0 and 1 are collectively known as the super-block.
  4. the mandatory file-system resides on rf0 and its super-block is but 218 bytes!
  5. this is because the V1 file system is self-describing unstructuredslop with its two fields (free-block bitmap, free-i-node bitmap) laid out length-prefixed, unpadded, in order, and both parsers effectively take an int *
  6. because of this, the root filesystem's size is encoded in the kernel — 128 * 8 blocks, 64 * 8 i-nodes (also mirrored in the installer) — and changing this would render kernels configured with differently-sized root filesystems unbootable
  7. (but if you copy a rootfs to a different device, it's still usable generically)
  8. these sizes mean that, by definition, the rootfs is 1024 blocks long (512k, matching the documented RK05 size), and can contain up to 512 files
  9. (it's actually less, because the blocks at the end of the filesystem are marked as used and data is stored there at well-known addresses: the last 32k of the disk is used for the bootloader and kernels, then 136k for swapped-out process images (8k+512 per process (corresponding to the userspace memory size + 1 block for kernel user area), nproc = 16); i-node storage takes up an additional 32 blocks, so there are 654 data blocks (327k) on the rootfs)
  10. ((if you added more processes to your unix (or if you added more memory so you're swapping out more memory per process) but didn't re-create the rootfs, then the system will boot but the bottom of the area the new unix will swap to will still be marked as free, so if you happen to allocate blocks in that range (i.e. when you fill your rootfs) then those blocks will overwrite parts of the swapped-out image for the last process in the process tables))
  11. the rootfs's privileged position means you can just stuff debug data after the filesystem stuff in the super-block and it's chill, so long as you keep it under 512 bytes — they advertise this as tm (I) — system-wide accounting of time spent in the kernel (s.syst, which tm calls ovh, which is rather telling of the attitudes of the time) vs. in what we'd call iowait (s.wait/dsk) vs. idle (no process to run) (s.idlet/idl) vs. in userspace (s.chrgt/usr) + error count (s.drerr/der; the source doesn't use this symbol at all, but it's got fully functional logs in the manual, so perhaps DeFelice got an early (or late) version with this stubbed out? unclear.) — but the real current time is also kept there (the clock updating the time(s) doesn't cause the super-block to be written to disk, of course, but it will come along with the next update)
  12. an additional side-effect of this is that on each rootfs superblock update (i.e. when-ever the block or i-node bitmaps are updated, i.e. on file creation/destruction and file growth/shrinkage), less than half a block is written to disk; conversely, every time the optional file-system is updated (as above), the entire kilobyte is sent to disk/tape
  13. similarly, this frees appx. two dynamic I/O buffers' worth of space in the kernel, which, as proven before, is very much worth it, and gaining two I/O buffers (4 → 6 (shipped value)) makes the system ~30% faster in an emulator
  14. the theoretically-maximal size of the optional file-system is
    • 8160 blocks (4M-16k with 0 i-nodes); or
    • 7664 i-nodes (with 496 blocks (248k), of which 15 (7.5k) available for data); or
    • 4080 blocks (2M-8k) * 4080 i-nodes (leaving 3823 blocks (1911k) for data)

One would also be remiss not to note this error in the analysis:

DeFelice calls base 256 base 8.
u5
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
imap: / get the byte that has the allocation bit for the i-number contained
      / in r1
	mov	$1,mq / put 1 in the mq
	mov	r1,r2 / r2 now has i-number whose byte in the map we
 		      / must find
	sub	$41.,r2 / r2 has i-41
	mov	r2,r3 / r3 has i-41
	bic	$!7,r3 / r3 has (i-41) mod 8 to get the bit position
	mov	r3,lsh / move the 1 over (i-41) mod 8 positions to the left
		       / to mask the correct bit
	asr	r2
	asr	r2
	asr	r2 / r2 has (i-41) base 8 of the byte no. from the start of
		   / the map
	mov	r2,-(sp) / put (i-41) base 8 on the stack
	mov	$systm,r2 / r2 points to the in-core image of the super
			  / block for drum
	tst	cdev / is the device the disk
	beq	1f / yes
	add	$mount-systm,r2 / for mounted device, r2 points to 1st word
				/ of its super block
1:
	add	(r2)+,(sp) / get byte address of allocation bit
	add	(sp)+,r2 / ?
	add	$2,r2 / ?
	rts	r0

which can be translated as

_-prefixed names were invented for exposition; others match source.
DeFelice's comments; my notes marked with [].
ux
​
ux
ux
ux
ux
​
​
​
u5
​
​
​
​
​
​
​
!
!
!
⇔
​
u5
​
41
​
5
6
7
8
​
​
188
199
​
​
​
​
​
​
​
210
211
212
​
​
213
​
extern cdev;  // [device containing i-node referred to by any current i-node number]
extern struct {
	int _free_block_bitmap_size;
	char _free_block_bitmap[._free_block_bitmap_size];
	int _free_inode_bitmap_size;
	char _free_inode_bitmap[._free_inode_bitmap_size];
} systm, mount;

extern r1, r2, r3, mq;
imap() {  // get the byte that has the allocation bit for the i-number contained in r1
	r2 = r1 - 41; 
	mq = 1 << (r2 % 8); // move the 1 over (i-41) mod 8 positions to the left
	                    // to mask the correct bit

	int _off = r2 >> 3;  // (i-41) base [256] of the byte no. from the start of the map
	r2 = cdev ? &mount : &systm;

	_off += (r2 += 2)->_free_block_bitmap_size;
	r2 += _off; // ?
	r2 += 2;    // ?
	r2 = &r2->_free_inode_bitmap[_off];

	return;  // [r2 points to the byte, mq has the mask]
}

It's a very ugly way to skip the first array and index into the second, but that's clearly what this does. Charitably, one could say that the get byte address of allocation bit comment applies to the whole section.

Notice also how each system call can affect only one filesystem at once (the only one that would even have the capacity to would be link (II) and that's illegal), hence why cdev works at all as a global variable.

  1. the I/O queue consists of a list of I/O requests ordered by priority: 3 special-purpose with dedicated buffers, then nbuf general-purpose dynamic ones with 512-byte buffers
  2. I/O requests are bound to device and starting 512-byte-block number, but may have any even length
  3. the internal structure of the _flags byte need not concern us, but it supports lock, "want to read", "want to write", "dispatched read request", "dispatched write request" bits, and 0 means "buffer unclaimed"
  4. I can't really tell why _word_count is negative, except that the disks and tape just take that format (presumably the devices do while(left++) *dst++ = *src++;)
  5. since all but one buffer have a constant size, this doesn't really factor in
  6. when looking for a block on a block device, the dynamic buffer list is searched for, in order of preference:
    1. an unclaimed buffer for which (_dev, _block_number) matches (cdev, block number); a "buffer warm" flag is set in this case (lowest-priority result returned)
    2. any unclaimed buffer (highest-priority result returned)
    3. (sleep in iowait and run the search again)
    the returned buffer is moved to be the lowest priority
  7. in theory this wants to form an LRU global coherent shared buffer of nbufs blocks (and under no load this does work); in reality, due to the ordering of the constraints and the buffer _flags lifecycle, performance is degraded significantly: _flags are 0 only after the I/O has completed, while a warm-but-currently-pending request will have one of the "dispatched" flags set; so, during repeated access to block 32 on rf0, the queue may look like this (in priority order):
    t=0:
    1. (d, i) unclaimed (flags: none)
    t=1:
    1. (0, 32) flags: want to write, dispatched write request
    2. (d, i) unclaimed (flags: none)
    t=2:
    1. (0, 32) flags: want to write, dispatched write request
    2. (0, 32) flags: want to read
    3. (d, i) unclaimed (flags: none)
    t=3:
    1. (0, 32) flags: want to write, dispatched write request
    2. (0, 32) flags: want to read
    3. (0, 32) flags: want to read
    4. (d, i) unclaimed (flags: none)
    t=4:
    1. (0, 32) unclaimed (flags: none)
    2. (0, 32) flags: want to read, dispatched read request
    3. (0, 32) flags: want to read
    4. (d, i) unclaimed (flags: none)
    at this point the first request has succeeded (and requesting (0, 32) will return the buffer with priority 0), so now there's an I/O request in flight (and one in the queue which will get sent) for data we've just gotten
  8. additionally, given:
    t=1:
    1. (0, 32) flags: want to write, dispatched write request
    2. (d, i) unclaimed (flags: none)
    t=2:
    1. (0, 32) flags: want to write, dispatched write request
    2. (0, 32) flags: want to write
    3. (d, i) unclaimed (flags: none)
    t=3:
    1. (0, 32) flags: unclaimed (flags: none)
    2. (0, 32) flags: want to write, dispatched write request
    3. (0, 32) flags: want to write
    4. (d, i) unclaimed (flags: none)
    t=3:
    1. (0, 32) flags: unclaimed (flags: none)
    2. (0, 32) flags: unclaimed (flags: none)
    3. (0, 32) flags: want to write, dispatched write request
    4. (d, i) unclaimed (flags: none)
    requesting (0, 32) will return the buffer with priority 0 (as above), which has stale data from the write issued after t=0, in spite of a write was already issued and committed (and another one is pending, and will happen, and will render the returned data even more stale)
  9. this would've been prevented if the discovery was more like
    1. a buffer for which (_dev, _block_number) matches (cdev, block number); a "buffer warm" flag is set in this case
    which would mean that there's at most 1 entry in the cache for any (device, block), making staleness impossible. I'm not sure how well the scheduler would cope with this, however (well. you'd write the scheduler to cope with it if you implemented it like this).
  10. the most common functions used to interface with the block buffering system — dskrd and wslot — mirror each other in a way that doesn't really make sense given the shipped buffer allocation model which can only return unclaimed buffers:
    dskrd requests a buffer for its device/block and returns it instantly if it was hot; if it wasn't, it sets its "want to read" flag, enqueues it, then idles until the "want to read" and "dispatched read request" flags are clear and returns it.
    Similarly, wslot requests a buffer for its device/block, then idle-loops until the "want to read" and "dispatched read request" flags are clear (which they always are since the returned buffer always has no flags set), then sets "want to write" and lock flags, and returns it
  11. (this, to me, reads like a remnant of a buffering scheme that could return warm-but-currently-pending buffers, especially since there are clear signs of a half-assed patch (the br 1f, only executed on a warm buffer, is clearly supposed to point to the next 1: label); perhaps the scheduler or whatever did not in fact cope well with this; or I'm wrong and they just blindly re-wrote dskrd!)
  12. dynamic buffers, like the rest of memory, are zeroed, except for the fields indicated: this means that the initial state of the cache isn't empty (or full of invalid entries), but it actually consists of nbufs entries that say that block 0 on rk0 consists of 512 zero bytes
  13. (so, if not enough churn has happened (or I/O is serviced infinitely fast(?)), od /dev/rf0 may return an all-zero block; I haven't been able to reproduce this in single-user mode with nbufs of 6; attempting to reduce the churn even more by using /bin/sh as /etc/init irreversibly destroyed my rootfs somehow)

# I-node cache

All names match source; rootdir comment DeFelice's, others mine.
Covered by the final file with variables and on-entry process tables initialisation.
​
​
ux
ux
ux
ux
ux
ux
ux
​
ux
​
​
ux
ux
u0
​
​
16
17
18
19
20
21
22
​
23
​
​
62
53
195
union {
	struct {
		int  flgs;
		char nlks;
		char uid;
		int  size;
		int  dskp[8];
		int  ctim[2];
		int  mtim[2];
	}    i;
	char inode[32];  // (2 free bytes)
};
int  idev, ii;    // i-node (device, number)
bool imod;        // modified?

int rootdir;
rootdir = 41;     // set to 41 and never changed
_-prefixed names were invented for exposition; others match source.
Lower-case // comments DeFelice's, others mine. u5
u0
​
​
u5
​
u5
​
u5
u5
​
​
​
u5
u5
u5
​
​
u5
u5
u5
​
​
​
u5
u5
​
​
u5
u5
u5
​
u5
​
​
u5
u5
​
​
​
​
u5
u5
u5
u5
​
​
u5
​
u5
u5
​
​
50
​
​
215
​
249
​
221
224
​
​
​
229
230
231
​
​
233
248
249
​
​
​
241
242
​
​
244
245
246
​
249
​
​
251
253
​
​
​
​
260
261
262
263
​
​
272
​
278
280
​
​
int mdev; // Device containing second filesystem

extern r1;
iget() {  // r1 = i-number of current flle
	if(r1 == ii && idev == cdev)
		return;

	if(imod) {        // has i-node of current file been modified
		imod = 0; // if it has, we must write the new i-node out on disk

		int _ino = _std::exchange(r1,   ii);
		int _dev = _std::exchange(cdev, idev);
		icalc(1);
		cdev = _dev;
		r1   = _ino;
	}

	if(r1 == 0) {
		r1 = ii;
		return;
	}

	if(cdev == 0 && r1 == mnti) {  // On rootfs and opening mounted-over file
		cdev = mntd;
		r1   = rootdir;
	}

	ii = r1;
	idev = cdev;
	icalc(0);  // read in i-node ii

	return;    // r1 has i-node number
}

icalc(_wr) { // i-node i is located in block (i+31.)/16.
	r1 += 31;   // and begins 32.*(i+31)mod16 bytes from its start
	int _pos_in_block = r1 % 16;
	r1 /= 16; // r1 contains block number of block in which
		  // i-node exists

	/*r5 =*/ dskrd(); // read in block containing i-node i.
	if(_wr)
		/*r5 =*/ wslot(); // set up data buffer for write
			          // (will be same buffer as dskrd got)

	r5 += _pos_in_block * 32; // r5 points to first word in i-node i.
	if(_wr) {
		_memcpy(r5, inode, 32);
		dskwr();
	} else
		_memcpy(inode, r5, 32);
}
  1. since the smallest valid i-node is 1, the i-node cache starts empty (structurally, i-node 0 on rf0, all 0 bytes)
  2. in the installer kernel, it's initialised during rootfs prep
  3. in a regular kernel, it's initialised in the user-space trampoline, which has featured before on this blog; summa summarum: initialised by exec("/etc/init", ) ( namei("/etc/init") iget(rootdir))
  4. it's never cleared again, so in V1, at all times, there is exactly one currently-opened file
  5. asking for i-node 0 is overloaded to be equivalent to what we'd call fsync() of the currently-open i-node: if i-node modified, update the block it's contained in and enqueue writing it
  6. this is used exactly once, when returning from kernel- to user-space, which, before deciding how to return,
    • r1 = 0; iget()
    • if rootfs superblock modified: enqueues writing sb0
    • if second filesystem's superblock modified: enqueues writing sb1
  7. this explains why unix72 prescribes ^E, q (halt emulator, exit) to stop: while in userspace, all caches (i-node, block device) have been enqueued to be written, so the system state is coherent once the disk is quiescent
  8. in principle, this also means that, after init, all unix does is service interrupts (from userspace, disk/tape, clock, other devices); the only persistent inter- and supra-process state are the process tables themselves; this differs significanly from subsequent unixes, which have An kernel (scheduler) process
  9. however, unix72 is wrong to say that You do not need to sync the system before shutdown., because you do — by waiting until all userspace processes are in userspace (or sleeping on non-filesystem & non-block-device I/O) and disk/tape I/O has stopped (on a modern system with disk/tape backed by a VFS file this is below a human time-scale, but there definitely is a sync-equivalent procedure, and you can run afoul of it if you're unlucky)
  10. it's worth pointing out that this betrays what to a modern reader appears like a layering violation: i-nodes appear on devices, not file-systems
  11. rather, perhaps, these haven't fully divorced yet. The unix Programmer's Manual is permeated with an unstated almost-equivalency of the two — a modelling of the file system as little more than a slightly-more-convenient way to address the disk (tape), or a relatively thin wrapper over the same
  12. this reflects the rejection of the file management attitude of the systems pre-dating unix that so pervades contemporary (DRAFT: The UNIX Time-Sharing System, D. M. Ritchie, mid-1971
    1. Introduction
    […] UNIX contains a number of features very seldom
    offered even by larger systems, including
    1. A versatile, convenient file system with complete integra- tion between disk files and I/O devices;
    3. The File System The most important role of UNIX is to provide a file system. […] 3.1 Ordinary Files A file contains whatever information the user places there, for example symbolic or binary (object) programs. No particular structuring is expected by the system. […] A few user programs generate and expect files with more structure; […] however, the structure of files is controlled solely by the programs which use them, not by the system. 3.5 System I/O Calls […] There is no distinction between "random" and sequential I/O, nor is any logical or physical record size imposed by the system. The size of a file on the disk is determined by the location of the last piece of information written on it; no predetermination of the size of a file is necessary.
    (that's 2.5 times by page 7!)) and subsequent (The UNIX™ System: Making Computers More Productive, 1982, Bell Laboratories — dmr, 11:32-12:03:
    unix system has many features which make it easier for the programmer to write programs. These include formatless files, the hierarchical directory structure, the ability to pipeline the output of one command as the input of another, device independent I/O: all of these things make programming considerably easier than on most other systems.
    bwk, 12:03-12:55:
    The heart of the system is really the file system – the ability to store information for extended periods of time. And the reason— one of the reasons the system works as well as it does is that the file system is well-designed; and, many systems, you have to say an awful lot about a file before you can do anything with it. You have to say where it is, and how big it is, and what kind of information that's going to— that's going to be in it. All kinds of things that are basically utterly, completely irrelevant. Here, you don't have to do any of that: a file is as big as it is, it doesn't matter where it is as long as you know what it's called… and so, you basically don't have to think of any of those complexities that you have in other systems. When you want information in a file, you put it there; when you want it back, you get it out again, and you don't have to think about size, or number of records and number of fields, or anything like that, unless it's really germane to your program. For most purposes, it's utterly irrelevant.
    ken, 12:56-13:17:
    A file is simply a sequence of bytes. Its main attribute is its size. By contrast, in more conventional systems, a file has a dozen or so attributes. To specify or create a file, it takes endless amounts of chit-chat. If you want a unix system file, you simply ask for a file. And you can use it interchangeably wherever you want a file.
    (this is all consecutive. it's so serious for them, and yet we cannot imagine a world where this would be prescient information.)) marketing
  13. there is a view that is puritanical in a different direction than how a modern reader and system conceptualises strict filesystem-on-blockdev layering: the file system's purpose is to let you name (groups of) blocks on the underlying device, as fast and easily as possible, and anything that makes that harder or makes it be more structured is heresy
  14. you can cut out basically any part of the data area of the file-system by marking it used (and remembering which blocks you marked used): they do this to the 32k area at the end of the disk with a bootloader and a few kernels — what we'd call /boot — and the nproc 8k+512-byte-sized areas for swapped-out process images
  15. this makes rf0 implicitly (non-self-descriptively) partitioned into 3 (2+nproc) sections — the file system (V) per se, the boot area, and the swap area (viz. nproc individually-allocated swap areas) — but the non-self-descriptiveness poses issues (though idk whether they'd noticed this at the time, but it seems pretty obvious)
  16. later unixex do away with the split-/boot model and read the kernel from a file-system
  17. at a later point in time, in descendant systems, "the swap area of the root file system/block device" reifies into "the swap" and the file system gains a length field (and a bootloader hole in block 0), and the root device becomes more explicitly partitioned into {[bootloader+]filesystem, rest of device}, with rest of device being used for swapping (rather, the swap), but could be for anything, and all readers/writers will understand this without knowing anything about the structure of the file-system beyond its length
  18. later systems still grow actual partitions (poorly), wherein 81200-block (39M+664k) RP(IV) disks have
    file start length
    rp0,8,16 0 40600 19M +844k
    rp1,9,17 40600 40600 19M +844k
    rp2,10,18 0 9200 4M +504k
    rp3,11,19 7200 9200 4M +504k
    rp4,12,20 0 65535 32M
    rp5,13,21 15600 65535 32M
    with a recommended configuration of rootfs on rp2, /usr on rp5, and swap space in the unused blocks 9500 to 15600 of rp0 (or, equivalently, rp4)(?!); allthewhile warning that It is unwise for all tof these files to be present in one installation, since there is overlap in addresses and proection becomes a sticky matter.
  19. (this is by no means universal in V5: depending on the amount of RF(IV)-compatible disks you have (context for reading the manual: RF11 is the controller, RS11 is the disk), you're supposed to make /dev/rf0 with the minor corresponding to how long they all are in 512k increments minus 1 (conveniently, this corresponds to how many disks you have minus one, since they're 512k long), and you get a continuous block device spanning them all in order, so if you have 3, you mknod /dev/tf0 b n 2 and you get a 1.5M device; rf.c backs this up, and shows that the controller exposes all the disks in order like this, and this is just protection (Digital Equipment Corporation, RF11/RS11 DECdisk system manual, 6th Printing, May 1973, DEC-11-HRFD-D, p. 1-4))
  20. ((conversely, RK(IV) has a three-liner implementing opt-in any-width RAID0 striping; but so does V4, and it's even better there; they were kinda just doing whatever back then))
  21. at UCB, by 4BSD, every supported disk type has partitions, and they're all in every installation, with the unwise sentence conspicuously disappearing (this doesn't exist in linkable form; cf. usr/man/man4/*.4 in 3bsd/4.0 on CSRG CD 1); from experience (sending mail to/from 4.2BSD) this model fucking sucks, especially as disks get bigger and their lay-outs more convoluted, as they had by 4.2BSD
  22. but this implies a different solution to this problem — an alternate-reality unix — where the swap areas for each process ended up properly as files on the filesystem as well; this may've been our reality if the performance eval looked different
  23. (this does happen a bit: ext4 keeps some metadata as special unnameable files, and linux lets you have "swap files" which are actually just like V1 swap but the swap area is named as part of an allocated i-node instead of blindly blocked off, but no system du jour swaps to addressable filesystem files, separately or together; all Windowses do this though)
  24. sys mount (II) claims that
    Almost always, name should be a directory so that
    an entire file system, not just one file, may
    exist on the removable device 
    but this is evidently false: every attempt to get the i-node that was the target of mount will return the root directory of the mounted file system; since (unlike in later and modern systems) the file type is not cached in the directory entry, it is impossible to name the mounted-over file, or at all distinguish it from the root of the mounted file system
  25. unlike in comparatively modern systems with a name-based VFS, the file system is mounted on the i-node, which means it's actually mounted on every link the mounted-over file has:
# chdir /tmp
# echo gaming >file1
# ln file1 file2
# ls -l
total    8
 47 s-r-r-  1 sys    1664 Jan  1 00:00:00 etma
119 s-rwrw  2 root      8 Jan  1 00:00:00 file1
119 s-rwrw  2 root      8 Jan  1 00:00:00 file2
 46 s-rwr-  1 root     26 Jan  1 00:00:00 ttmp
 45 s-rwr-  1 root    142 Jan  1 00:00:00 utmp
# cat file1 file2
gaming
gaming
# cat >mt.s
mount = 21.
umount = 22.
sys     umount; rk0
sys     mount; rk0; file1
sys     exit
rk0:    </dev/rk0\0>
file1:  <file1\0>
# as mt.s
I
II
# a.out
# ls -l
total   10
124 sxrwrw  1 root     96 Jan  1 00:00:00 a.out
 47 s-r-r-  1 sys    1664 Jan  1 00:00:00 etma
 41 sdrwr- 10 root    120 Jan  1 00:00:00 file1
 41 sdrwr- 10 root    120 Jan  1 00:00:00 file2
120 s-rwrw  1 root    107 Jan  1 00:00:00 mt.s
 46 s-rwr-  1 root     26 Jan  1 00:00:00 ttmp
 45 s-rwr-  1 root    142 Jan  1 00:00:00 utmp
# ls -l file1 file2
file1:
total   82
202 sdrwr-  2 root    140 Jan  1 00:00:00 boot
197 sdrwr-  2 root     60 Jan  1 00:00:00 fort
194 sdrwr-  2 root     40 Jan  1 00:00:00 jack
192 sdrwr-  2 root     30 Jan  1 00:00:00 ken
183 sdrwr-  2 root    100 Jan  1 00:00:00 lib
209 sdrwrw  2 root    110 Jan  1 00:00:00 nab-test
 42 sdrwr-  5 root     60 Jan  1 00:00:00 src
168 sdrwr-  2 root    360 Jan  1 00:00:00 sys
217 lxrwrw  2 root  36432 Jan  1 00:00:00 u
167 sxrwrw  1 root     54 Jan  1 00:00:00 x

file2:
total   82
202 sdrwr-  2 root    140 Jan  1 00:00:00 boot
197 sdrwr-  2 root     60 Jan  1 00:00:00 fort
194 sdrwr-  2 root     40 Jan  1 00:00:00 jack
192 sdrwr-  2 root     30 Jan  1 00:00:00 ken
183 sdrwr-  2 root    100 Jan  1 00:00:00 lib
209 sdrwrw  2 root    110 Jan  1 00:00:00 nab-test
 42 sdrwr-  5 root     60 Jan  1 00:00:00 src
168 sdrwr-  2 root    360 Jan  1 00:00:00 sys
217 lxrwrw  2 root  36432 Jan  1 00:00:00 u
167 sxrwrw  1 root     54 Jan  1 00:00:00 x
# cat file1/x file2/x
tap x\
./fort/fc1\
./fort/fc2\
./fort/fc3\
./fort/fc4
tap x\
./fort/fc1\
./fort/fc2\
./fort/fc3\
./fort/fc4

I've never seen this mentioned or hinted at by anyone, ever. Presumably because directories can only have 1 non-.. link, and people mount over directories generally? No modern system lets you do this.

  1. in principle, if the i-node on the mounted file system that was supposed to contain the root directory actually contained another type of file that would also Just Work (actually we can just test this too: after copying 32 bytes from after 0x18C0 to after 0x0900 — i-node 167 (/x) to 41 (/) — on rk0 we observe
:login: root
root
# ls -l
total    6
 58 sdrwr-  2 root    620 Jan  1 00:00:00 bin
 42 sdrwr-  2 root    250 Jan  1 00:00:00 dev
 48 sdrwr-  2 root    110 Jan  1 00:00:00 etc
 44 sdrwr-  2 root    120 Jan  1 00:00:00 tmp
 41 sxrwrw  1 root     54 Jan  1 00:00:00 usr
# cat usr
tap x\
./fort/fc1\
./fort/fc2\
./fort/fc3\
./fort/fc4
  1. – has anyone done this?)
  2. this is the final buffer we'll look at in this post, so: knowing that the memory map consists of kernel memory in 0-16k, then 8k of userspace memory, and buffers total 6164=6k+20 bytes (systm+inode+mount+clist+buffer), buffers are 37.6% of kernel and 25.1% of total memory.

# And now, a vicious take-down of a retiree who's made a minor error 55 years ago

with a final look at emphasised fragments:

u6
​
u6
u6
u6
u6
u6
u6
u6
u6
u6
​
​
u6
​
​
​
u6
u6
​
​
u6
u6
u6
u6
u6
​
​
u6
​
​
212
​
214
215
216
217
218
220
223
225
226
​
​
229
​
​
​
240
241
​
​
243
244
245
246
247
​
​
252
​
​
extern r1, r2, r3, cdev;
dskw(_ino) /* write routine for non-special files */
{
	r1 = _ino; iget(); /* write i-node out (if modified), read i-node 'r1' on 'cdev'
	                      into i-node area of core */
	r2 = *u.fofp + u.count; /* file offset [(u.off) or the offset in
	                           the fsp entry for this file] +
	                           no. of bytes to be written */
	if(r2 > i.size) {
		i.size = r2;
		setimod();
	}

	while(u.count) {
		mget(); /* get the block no. in which to write the next data byte */

		/* if lower 9 bits of file offset are 0,
		   file offset = 0, 512, 1024,...(i.e., start of new block): */
		if(*u.fofp & 511 || u.count < 512) {
			dskrd();  /* if there is not enough data to fill an entire block, */
		}	          /* read block 'r1' on 'cdev' into an I/O buffer */

		wslot(); /* set write and inhibit bits in I/O queue, proc. status=0,
		            r5 points to 1st word of data */
		sioreg(); /* r3 = no. of bytes of data,
		             r1 = address of data,
		             r2 points to location in buffer in which to start writing data */
		_memcpy(r2, r1, r3); /* transfer a byte of data to the I/O buffer */

		dskwr(); /* yes, write the block and the i-node */
	}
}

where we observe (assuming no other processes are running that would influence the make-up of the block device I/O queue), if, say, writing to a small file with i-node 167 on rk0 that remains small throughout (which lets us discard mget cache effects from indirect blocks):

  1. on l. 214, the file being written to is loaded into the i-node cache
  2. (if it wasn't already in there, the block containing the i-node is looked up in the block cache (and, if missing, enqueued to be read, and waited for), parsed)
  3. ((if had to look up in block cache, it now contains
    1. (1, 12) flags: unclaimed (flags: none)
    2. (d, i) unclaimed (flags: none)))
    or
    1. (1, 12) flags: unclaimed (flags: none)
    2. (0, 4) flags: unclaimed (flags: none)
    3. (d, i) unclaimed (flags: none)))
    if the previously-open i-node was modified (in this example: 0,4 has the i-node for /); this case folds into the former
  4. if writing partial or non-block-aligned data (l. 240), the data block at the file cursor is looked up in the block cache (l. 241) (and, if missing, enqueued to be read, and waited for)
  5. (the block cache now contains
    1. (1, 222) flags: unclaimed (flags: none)
    2. (1, 12) flags: unclaimed (flags: none)
    3. (d, i) unclaimed (flags: none))
    (if the block cache contained the data block and had to look up the i-node in the i-node cache) or
    1. (1, 222) flags: unclaimed (flags: none)
    2. (d, i) unclaimed (flags: none)
    (in all other cases)
  6. the data block at the file cursor is looked up in the block cache (l. 243) and locked for writing
  7. (the block cache now contains
    1. (1, 222) flags: lock, want to write; data=(current contents of block (1,222) if condition above was met, otherwise stale data from some other block that was reclaimed for this)
    2. (1, 12) flags: unclaimed (flags: none)
    3. (d, i) unclaimed (flags: none))
    or
    1. (1, 222) flags: lock, want to write; data=(current contents of block (1,222) if condition above was met, otherwise stale data from some other block that was reclaimed for this)
    2. (d, i) unclaimed (flags: none)
    resp.
  8. after the open-coded _memcpy, the buffer contains the new contents of the block
  9. on l. 252, the buffer with priority 0 is unlocked, and enqueued if there's no contending I/O (dskwr is bufp[0]._flags &= ~lock; ppoke();)
  10. (the block cache now contains
    1. (1, 222) flags: want to write (+ dispatched read request if successfully enqueued)
    2. (1, 12) flags: unclaimed (flags: none)
    3. (d, i) unclaimed (flags: none))
    or
    1. (1, 222) flags: want to write (+ dispatched read request if successfully enqueued)
    2. (d, i) unclaimed (flags: none)
    resp.
  11. the sequence above repeats until all affected data blocks were written and enqueued; the I/O queue grows like
    1. (1, 223) flags: want to write (+ dispatched read request if successfully enqueued)
    2. (1, 222) flags: want to write (+ dispatched read request if successfully enqueued)
    3. (1, 12) flags: unclaimed (flags: none)
    4. (d, i) unclaimed (flags: none))
  12. after returning to one of the callers syswrite, wdir (sysunlink), badsys, at some point control ends up returning from kernel- to user-space at which point the (block containing the) i-node is flushed (if it was modified because the file grew)
  13. (dskw is only called by as-if by tail recursion from writei, so returning from dskw returns to writei's caller)
  14. ((it's also called by the installer kernel to write the installer i-nodes which flushes everything together afterward))
  15. it follows trivially from the above that the yes, write the block and the i-node comment is bogus: at most, it writes the block; most likely, it just returns because there's already I/O happening to the device containing the block
  16. an attentive reader will notice that if the process times out (a clock interrupt happens ⇒ user quantum expired ⇒ process swapped out), another process can re-arrange the I/O queue arbitrarily; this is expected, except this function behaves as-if it held a lock on the buffer between ll. 240-252, but it doesn't
  17. thus, if the process is interrupted between proposition 58 (l. 241) and proposition 60 (l. 243), and not writing full block (i.e. need original data), if the returned block is pushed out of the cache, wslot will fail to find the buffer with the original data, and will instead update a "random" block found on the cache with partial data
  18. if the process is interrupted between proposition 60 (l. 243) and proposition 63 (l. 252) and the swapped-in process does any dynamic buffer I/O, the buffer obtained in proposition 60 (l. 243) will be lost forever, since it will have been moved to priority >0, and thus never unlocked
  19. (this would be fixed if dskwr did container_of(r5, struct _dynamic_iobuf, _buf)->_flags &= ~lock; ppoke();, updating the used buffer instead of the one with priority 0)

at least, I think this is possible. The second half of clock is impenetrable to me.

There's no "satisfying" way to end this post. All novel results were presented in-line, and pedantry that disproves well-established consensus and unix historiosophy were rewards unto themselves. Spin File, View, Edit maybe? Comment for engagement? Call to action 3, 4, 5…?


Nit-pick? Correction? Improvement? Annoying? Cute? Anything? Mail, post, or open!


Creative text licensed under CC-BY-SA 4.0, code licensed under The MIT License.
This page is open-source, you can find it at GitHub, and contribute and/or yell at me there.
Like what you see? Consider giving me a follow over at social medias listed here, or maybe even a sending a buck liberapay donate or two patreon my way if my software helped you in some significant way?
Compiled with Clang 21's C preprocessor on 26.01.2026 13:05:00 UTC from src/blogn_t/023,b-DeFelice-polemic.html.pp.
See job on builds.sr.ht.
RSS feed