If we cast our minds back and eyeballs down to
write (II)
→ syswrite
→ writeinode
→ dskw (write routine for non-special files
):
212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255
dskw:/ write routine for non-special filesmov (sp),r1/ get an i-node number from the stack into r1jsr r0,iget/ write i-node out (if modified), read i-node 'r1'/ into i-node area of coremov *u.fofp,r2/ put the file offset [(u.off) or the offset in/ the fsp entry for this file] in r2add u.count,r2/ no. of bytes to be written + file offset is/ put in r2cmp r2,i.size/ is this greater than the present size of/ the file?blos 1f/ no, branchmov r2,i.size/ yes, increase the f11e size to file offset +/ no. of data bytesjsr r0,setimod/ set imod=1 (i.e., core inode has been/ modified), stuff tlme of modification into/ core image of i-node1: jsr r0,mget/ get the block no. in which to write the next data/ bytebit *u.fofp,$777/ test the lower 9 bits of the file offsetbne 2f/ if its non-zero, branch; if zero, file offset = 0,/ 512, 1024,...(i.e., start of new block)cmp u.count,$512./ if zero, is there enough data to fill an/ entire block? (i.e., no. ofbhis 3f/ bytes to be written greater than 512.? Yes, branch./ Don't have to read block2:/ in as no past info. is to be saved (the entire block will be/ overwritten).jsr r0,dskrd/ no, must retain old info.. Hence, read block 'r1'/ into an I/O buffer3: jsr r0,wslot/ set write and inhibit bits in I/O queue, proc./ status=0, r5 points to 1st word of datajsr r0,sioreg/ r3 = no. of bytes of data, r1 = address of data,/ r2 points to location in buffer in which to/ start writing data2: movb (r1 )+,(r2)+/ transfer a byte of data to the I/O bufferdec r3/ decrement no. of bytes to be writtenbne 2b/ have all bytes been transferred? No, branchjsr r0,dskwr/ yes, write the block and the i-nodetst u.count/ any more data to write?bne 1b/ yes, branchjmp ret/ no, return to the caller via 'ret'
which can be translated to
212 214 215 216 217 218 220 223 225 226 229 240 241 243 244 245 246 247 252
extern r1, r2, r3, cdev; dskw(_ino)/* write routine for non-special files */{ r1 = _ino; iget();/* write i-node out (if modified), read i-node 'r1' on 'cdev'into i-node area of core */ r2 = *u.fofp + u.count;/* file offset [(u.off) or the offset in the fsp entry for this file] + no. of bytes to be written */if(r2 > i.size) { i.size = r2; setimod(); } while(u.count) { mget();/* get the block no. in which to write the next data byte *//* if lower 9 bits of file offset are 0, file offset = 0, 512, 1024,...(i.e., start of new block): */if(*u.fofp & 511 || u.count < 512) { dskrd();/* if there is not enough data to fill an entire block, */}/* read block 'r1' on 'cdev' into an I/O buffer */wslot();/* set write and inhibit bits in I/O queue, proc. status=0, r5 points to 1st word of data */sioreg();/* r3 = no. of bytes of data, r1 = address of data, r2 points to location in buffer in which to start writing data */_memcpy(r2, r1, r3);/* transfer a byte of data to the I/O buffer */dskwr();/* yes, write the block and the i-node */} }
we observe the following: accd'g to
DeFelice's commentary,
the final dskwr call will schedule I/O such that
the
(implied current) i-node and the freshly-written data block are sent to disk.
This is not true:
the old i-node (if any) was already enqueued,
the current i-node is not affected by this call,
and the only buffer scheduled here is the new data block.
This is all the more interesting because the behaviour of the iget call is labelled correctly, and would contradict the supposed dskwr behaviour.
proc. status
is the processor status, which contains arithmetic status flags (overflow, carry, zero, negative) and the interrupt priority:
During the interrupt acknowledge and priority arbitration phase the LSI-11/23 processor will acknowledge interrupts under the following conditions:
- The device interrupt priority is higher than the current PS<7:5>.
- The processor has completed instruction execution and no additional bus cycles are pending.
which can themselves go up to 7 (and micro/pdp-11's smallest interrupt priority seems to be 4), so setting it to 7 blocks all interrupts. The V1 kernel uses these values:
| Instruction | Value | Priority | Occurrences | Filter | ||
|---|---|---|---|---|---|---|
| clr *$ps | 08 | 0000 0000 0000 0000 | 000 | 0 | 13 | All interrupts allowed |
| mov $240,*$ps | 2408 | 0000 0000 1010 0000 | 101 | 5 | 11 | Priority 6, 7 interrupts allowed |
| mov $300,*$ps | 3008 | 0000 0000 1100 0000 | 110 | 6 | 2 | Priority 7 interrupts allowed |
| mov $340,*$ps | 3408 | 0000 0000 1110 0000 | 111 | 7 | 2 | All interrupts blocked |
Priority 7 is used to guard the I/O buffer queue: set by bufaloc — which finds, locks, and returns a dynamic I/O buffer slot or sleeps in I/O wait until one is available — and in ppoke which wraps poke in priority=7 — used by static sb0/sb1/swp I/O buffer slots. poke enqueues dirty buffers onto their respective device I/O queues (so it requires priority=7 on entry — this matches "must call with these locks held" interface contracts in modern Linux).
Priority 6 (swap, sleep) is used to guard the run queues (runq, p.link). Their precise structure is thankfully not relevant here.
Priority 5 is used by the teletype driver to protect tty state (thus why its use sprawls quite so).
Priority is cleared when not in a critical section (or during lock inversion while in I/O sleep in idle).
One would be remiss to not note that this interface, obviously, is not very good: it's less of an issue for a system with barely any memory management, but overwriting other flags (well, the non-ephemeral ones to the left of the priority) combined with a text-search-defeating non-zero offset both suck ass. Thus, subsequent systems grow an spl instruction and C unixes have (V4 has) spl[014567]() functions, which use this instruction on the pdp-11/45 (and open-code it on the pdp-11/40, emulating atomicity by ORing to get to privilege 7, then clearing the unneeded bits). (Note also how pdp-11/40 spl4() falls through to spl5().)
was touched upon in the last post, but only insofar as the number of dynamic buffers.
ux ux ux ux u0 u0 u0 ux ux ux ux ux u0 u0 u0 ux u0 u0 u0 ux ux ux u0 u0
35 71 72 71 115 117 118 36 37 38 2 15 122 127 125 24 123 129 128 76 75 74 124 130
struct _iobuf_hdr {
char _dev; // 0=rf0/1=rk0/≥2=tap0-7 (dectape)
char _flags;
int _block_number;
int _word_count; // negative
char *_buf;
};
struct _iobuf_hdr * bufp[nbuf + 3]; // priority = reverse index
struct {
char _ttybufs[140][ntty];
struct _dynamic_iobuf {
struct _iobuf_hdr _hdr;
char _buf[512];
} _iobufs[nbuf];
} buffer;
bufp[0..nbuf] = &buffer._iobufs[0..nbuf]._hdr;
bufp[0..nbuf]->_buf = buffer._iobufs[0..nbuf]._buf;
bufp[0..nbuf]->_word_count = -256; // 512 bytes
struct _iobuf_hdr sb0; // I/O queue entry drum
struct _iobuf_hdr sb1; // I/O queue entry disk (mounted device)
struct _iobuf_hdr swp; // I/O queue entry core image being swapped
struct /* 218 bytes */ systm; // rootfs superblock
struct inode inode;
bufp[nbuf] = &sb0;
sb0._buf = &systm;
sb0._word_count = (&systm - &inode) / 2;
char mount[1024]; // second filesystem superblock
bufp[nbuf + 1] = &sb1;
sb1._buf = &mount;
sb1._word_count = -512; // 1024 bytes
union {
struct /*...*/ u; // process-specific state
char user[64];
} [[pin(core - 64)]]; // (precedes start of userspace memory)
bufp[nbuf + 2] = &swp;
swp._buf = &user;
// swp._word_count set when swapping process in/out
Herein we observe that:
Every file system storage volume (e.g. RF disk, RK disk, DECtape reel) has a common format[…]divided into[…]512 byte)blocks. Blocks 0 and 1 are collectively known as the super-block.
ovh, which is rather telling of the attitudes of the time) vs. in what we'd call iowait (s.wait/
dsk) vs. idle (no process to run) (s.idlet/
idl) vs. in userspace (s.chrgt/
usr) + error count (s.drerr/
der; the source doesn't use this symbol at all, but it's got fully functional logs in the manual, so perhaps DeFelice got an early (or late) version with this stubbed out? unclear.) — but the real current time is also kept there (the clock updating the time(s) doesn't cause the super-block to be written to disk, of course, but it will come along with the next update)
One would also be remiss not to note this error in the analysis:
188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213
imap:/ get the byte that has the allocation bit for the i-number contained/ in r1mov $1,mq/ put 1 in the mqmov r1,r2/ r2 now has i-number whose byte in the map we/ must findsub $41.,r2/ r2 has i-41mov r2,r3/ r3 has i-41bic $!7,r3/ r3 has (i-41) mod 8 to get the bit positionmov r3,lsh/ move the 1 over (i-41) mod 8 positions to the left/ to mask the correct bitasr r2 asr r2 asr r2/ r2 has (i-41) base 8 of the byte no. from the start of/ the mapmov r2,-(sp)/ put (i-41) base 8 on the stackmov $systm,r2/ r2 points to the in-core image of the super/ block for drumtst cdev/ is the device the diskbeq 1f/ yesadd $mount-systm,r2/ for mounted device, r2 points to 1st word/ of its super block1: add (r2)+,(sp)/ get byte address of allocation bitadd (sp)+,r2/ ?add $2,r2/ ?rts r0
which can be translated as
ux ux ux ux ux u5 ! ! ! ⇔ u5
41 5 6 7 8 188 199 210 211 212 213
extern cdev;// [device containing i-node referred to by any current i-node number]extern struct { int _free_block_bitmap_size; char _free_block_bitmap[._free_block_bitmap_size]; int _free_inode_bitmap_size; char _free_inode_bitmap[._free_inode_bitmap_size]; } systm, mount; extern r1, r2, r3, mq; imap() {// get the byte that has the allocation bit for the i-number contained in r1r2 = r1 - 41; mq = 1 << (r2 % 8);// move the 1 over (i-41) mod 8 positions to the left// to mask the correct bitint _off = r2 >> 3;// (i-41) base [256] of the byte no. from the start of the mapr2 = cdev ? &mount : &systm; _off += (r2 += 2)->_free_block_bitmap_size; r2 += _off;// ?r2 += 2;// ?r2 = &r2->_free_inode_bitmap[_off];return;// [r2 points to the byte, mq has the mask]}
It's a very ugly way to skip the first array and index into the second, but that's clearly what this does.
Charitably, one could say that the get byte address of allocation bit
comment applies to the whole section.
Notice also how each system call can affect only one filesystem at once (the only one that would even have the capacity to would be link (II) and that's illegal), hence why cdev works at all as a global variable.
od /dev/rf0 may return an all-zero block;
I haven't been able to reproduce this in single-user mode with nbufs of 6;
attempting to reduce the churn even more by using /bin/sh as /etc/init irreversibly destroyed my rootfs somehow) ux ux ux ux ux ux ux ux ux ux u0
16 17 18 19 20 21 22 23 62 53 195
union {
struct {
int flgs;
char nlks;
char uid;
int size;
int dskp[8];
int ctim[2];
int mtim[2];
} i;
char inode[32]; // (2 free bytes)
};
int idev, ii; // i-node (device, number)
bool imod; // modified?
int rootdir;
rootdir = 41; // set to 41 and never changed
u0 u5 u5 u5 u5 u5 u5 u5 u5 u5 u5 u5 u5 u5 u5 u5 u5 u5 u5 u5 u5 u5 u5 u5 u5 u5
50 215 249 221 224 229 230 231 233 248 249 241 242 244 245 246 249 251 253 260 261 262 263 272 278 280
int mdev;// Device containing second filesystemextern r1; iget() {// r1 = i-number of current flleif(r1 == ii && idev == cdev) return; if(imod) {// has i-node of current file been modifiedimod = 0;// if it has, we must write the new i-node out on diskint _ino = _std::exchange(r1, ii); int _dev = _std::exchange(cdev, idev); icalc(1); cdev = _dev; r1 = _ino; } if(r1 == 0) { r1 = ii; return; } if(cdev == 0 && r1 == mnti) {// On rootfs and opening mounted-over filecdev = mntd; r1 = rootdir; } ii = r1; idev = cdev; icalc(0);// read in i-node iireturn;// r1 has i-node number} icalc(_wr) {// i-node i is located in block (i+31.)/16.r1 += 31;// and begins 32.*(i+31)mod16 bytes from its startint _pos_in_block = r1 % 16; r1 /= 16;// r1 contains block number of block in which// i-node exists/*r5 =*/dskrd();// read in block containing i-node i.if(_wr)/*r5 =*/wslot();// set up data buffer for write// (will be same buffer as dskrd got)r5 += _pos_in_block * 32;// r5 points to first word in i-node i.if(_wr) { _memcpy(r5, inode, 32); dskwr(); } else _memcpy(inode, r5, 32); }
You do not need to sync the system before shutdown., because you do — by waiting until all userspace processes are in userspace (or sleeping on non-filesystem & non-block-device I/O) and disk/tape I/O has stopped (on a modern system with disk/tape backed by a VFS file this is below a human time-scale, but there definitely is a sync-equivalent procedure, and you can run afoul of it if you're unlucky)
mid-1971
(that's 2.5 times by page 7!)) and subsequent (The UNIX™ System: Making Computers More Productive, 1982, Bell Laboratories — dmr, 11:32-12:03:1. Introduction[…]UNIX contains a number of features very seldom offered even by larger systems, including3. The File System The most important role of UNIX is to provide a file system.
- A versatile, convenient file system with complete integra- tion between disk files and I/O devices;
[…]3.1 Ordinary Files A file contains whatever information the user places there, for example symbolic or binary (object) programs. No particular structuring is expected by the system.[…]A few user programs generate and expect files with more structure;[…]however, the structure of files is controlled solely by the programs which use them, not by the system. 3.5 System I/O Calls[…]There is no distinction between "random" and sequential I/O, nor is any logical or physical record size imposed by the system. The size of a file on the disk is determined by the location of the last piece of information written on it; no predetermination of the size of a file is necessary.
unix system has many features which make it easier for the programmer to write programs. These include formatless files, the hierarchical directory structure, the ability to pipeline the output of one command as the input of another, device independent I/O: all of these things make programming considerably easier than on most other systems.bwk, 12:03-12:55:
The heart of the system is really the file system – the ability to store information for extended periods of time. And the reason— one of the reasons the system works as well as it does is that the file system is well-designed; and, many systems, you have to say an awful lot about a file before you can do anything with it. You have to say where it is, and how big it is, and what kind of information that's going to— that's going to be in it. All kinds of things that are basically utterly, completely irrelevant. Here, you don't have to do any of that: a file is as big as it is, it doesn't matter where it is as long as you know what it's called… and so, you basically don't have to think of any of those complexities that you have in other systems. When you want information in a file, you put it there; when you want it back, you get it out again, and you don't have to think about size, or number of records and number of fields, or anything like that, unless it's really germane to your program. For most purposes, it's utterly irrelevant.ken, 12:56-13:17:
A file is simply a sequence of bytes. Its main attribute is its size. By contrast, in more conventional systems, a file has a dozen or so attributes. To specify or create a file, it takes endless amounts of chit-chat. If you want a unix system file, you simply ask for a file. And you can use it interchangeably wherever you want a file.(this is all consecutive. it's so serious for them, and yet we cannot imagine a world where this would be prescient information.)) marketing
| file | start | length | ||
|---|---|---|---|---|
| rp0,8,16 | 0 | 40600 | 19M | +844k |
| rp1,9,17 | 40600 | 40600 | 19M | +844k |
| rp2,10,18 | 0 | 9200 | 4M | +504k |
| rp3,11,19 | 7200 | 9200 | 4M | +504k |
| rp4,12,20 | 0 | 65535 | 32M | |
| rp5,13,21 | 15600 | 65535 | 32M |
in the unused blocks 9500 to 15600 of rp0 (or, equivalently, rp4)(?!); allthewhile warning that
It is unwise for all tof these files to be present in one installation, since there is overlap in addresses and proection becomes a sticky matter.
unwisesentence conspicuously disappearing (this doesn't exist in linkable form; cf. usr/man/man4/*.4 in 3bsd/4.0 on CSRG CD 1); from experience (sending mail to/from 4.2BSD) this model fucking sucks, especially as disks get bigger and their lay-outs more convoluted, as they had by 4.2BSD
but this is evidently false: every attempt to get the i-node that was the target of mount will return the root directory of the mounted file system; since (unlike in later and modern systems) the file type is not cached in the directory entry, it is impossible to name the mounted-over file, or at all distinguish it from the root of the mounted file systemAlmost always, name should be a directory so that an entire file system, not just one file, may exist on the removable device
# chdir /tmp # echo gaming >file1 # ln file1 file2 # ls -l total 8 47 s-r-r- 1 sys 1664 Jan 1 00:00:00 etma 119 s-rwrw 2 root 8 Jan 1 00:00:00 file1 119 s-rwrw 2 root 8 Jan 1 00:00:00 file2 46 s-rwr- 1 root 26 Jan 1 00:00:00 ttmp 45 s-rwr- 1 root 142 Jan 1 00:00:00 utmp # cat file1 file2 gaming gaming # cat >mt.s mount = 21. umount = 22. sys umount; rk0 sys mount; rk0; file1 sys exit rk0: </dev/rk0\0> file1: <file1\0> # as mt.s I II # a.out # ls -l total 10 124 sxrwrw 1 root 96 Jan 1 00:00:00 a.out 47 s-r-r- 1 sys 1664 Jan 1 00:00:00 etma 41 sdrwr- 10 root 120 Jan 1 00:00:00 file1 41 sdrwr- 10 root 120 Jan 1 00:00:00 file2 120 s-rwrw 1 root 107 Jan 1 00:00:00 mt.s 46 s-rwr- 1 root 26 Jan 1 00:00:00 ttmp 45 s-rwr- 1 root 142 Jan 1 00:00:00 utmp # ls -l file1 file2 file1: total 82 202 sdrwr- 2 root 140 Jan 1 00:00:00 boot 197 sdrwr- 2 root 60 Jan 1 00:00:00 fort 194 sdrwr- 2 root 40 Jan 1 00:00:00 jack 192 sdrwr- 2 root 30 Jan 1 00:00:00 ken 183 sdrwr- 2 root 100 Jan 1 00:00:00 lib 209 sdrwrw 2 root 110 Jan 1 00:00:00 nab-test 42 sdrwr- 5 root 60 Jan 1 00:00:00 src 168 sdrwr- 2 root 360 Jan 1 00:00:00 sys 217 lxrwrw 2 root 36432 Jan 1 00:00:00 u 167 sxrwrw 1 root 54 Jan 1 00:00:00 x file2: total 82 202 sdrwr- 2 root 140 Jan 1 00:00:00 boot 197 sdrwr- 2 root 60 Jan 1 00:00:00 fort 194 sdrwr- 2 root 40 Jan 1 00:00:00 jack 192 sdrwr- 2 root 30 Jan 1 00:00:00 ken 183 sdrwr- 2 root 100 Jan 1 00:00:00 lib 209 sdrwrw 2 root 110 Jan 1 00:00:00 nab-test 42 sdrwr- 5 root 60 Jan 1 00:00:00 src 168 sdrwr- 2 root 360 Jan 1 00:00:00 sys 217 lxrwrw 2 root 36432 Jan 1 00:00:00 u 167 sxrwrw 1 root 54 Jan 1 00:00:00 x # cat file1/x file2/x tap x\ ./fort/fc1\ ./fort/fc2\ ./fort/fc3\ ./fort/fc4 tap x\ ./fort/fc1\ ./fort/fc2\ ./fort/fc3\ ./fort/fc4
I've never seen this mentioned or hinted at by anyone, ever. Presumably because directories can only have 1 non-.. link,
and people mount over directories generally?
No modern system lets you do this.
:login: root root # ls -l total 6 58 sdrwr- 2 root 620 Jan 1 00:00:00 bin 42 sdrwr- 2 root 250 Jan 1 00:00:00 dev 48 sdrwr- 2 root 110 Jan 1 00:00:00 etc 44 sdrwr- 2 root 120 Jan 1 00:00:00 tmp 41 sxrwrw 1 root 54 Jan 1 00:00:00 usr # cat usr tap x\ ./fort/fc1\ ./fort/fc2\ ./fort/fc3\ ./fort/fc4
with a final look at emphasised fragments:
u6 u6 u6 u6 u6 u6 u6 u6 u6 u6 u6 u6 u6 u6 u6 u6 u6 u6 u6
212 214 215 216 217 218 220 223 225 226 229 240 241 243 244 245 246 247 252
extern r1, r2, r3, cdev; dskw(_ino)/* write routine for non-special files */{ r1 = _ino; iget();/* write i-node out (if modified), read i-node 'r1' on 'cdev'into i-node area of core */ r2 = *u.fofp + u.count;/* file offset [(u.off) or the offset in the fsp entry for this file] + no. of bytes to be written */if(r2 > i.size) { i.size = r2; setimod(); } while(u.count) { mget();/* get the block no. in which to write the next data byte *//* if lower 9 bits of file offset are 0, file offset = 0, 512, 1024,...(i.e., start of new block): */if(*u.fofp & 511 || u.count < 512) { dskrd();/* if there is not enough data to fill an entire block, */}/* read block 'r1' on 'cdev' into an I/O buffer */wslot();/* set write and inhibit bits in I/O queue, proc. status=0, r5 points to 1st word of data */sioreg();/* r3 = no. of bytes of data, r1 = address of data, r2 points to location in buffer in which to start writing data */_memcpy(r2, r1, r3);/* transfer a byte of data to the I/O buffer */dskwr();/* yes, write the block and the i-node */} }
where we observe (assuming no other processes are running that would influence the make-up of the block device I/O queue), if, say, writing to a small file with i-node 167 on rk0 that remains small throughout (which lets us discard mget cache effects from indirect blocks):
yes, write the block and the i-nodecomment is bogus: at most, it writes the block; most likely, it just returns because there's already I/O happening to the device containing the block
at least, I think this is possible. The second half of clock is impenetrable to me.
There's no "satisfying" way to end this post. All novel results were presented in-line, and pedantry that disproves well-established consensus and unix historiosophy were rewards unto themselves. Spin File, View, Edit maybe? Comment for engagement? Call to action 3, 4, 5…?
Nit-pick? Correction? Improvement? Annoying? Cute? Anything?
Mail,
post, or open!