[TUHS] Re: If forking is bad, how about buffering?

15 May 2024

Hi Dan,
Thanks for the considered response.  I was beginning to fear that my
musing was of moronically minimal merit.
At 2024-05-15T10:42:33-0400, Dan Cross wrote:
...
  On Tue, May 14, 2024 at 7:10 AM G. Branden Robinson
 &lt;g.branden.robinson(a)gmail.com&gt; wrote:
  [snip]
 Viewpoint 1: Perspective from Pike's Peak 
 Clever. 
If Rob's never heard _that_ one before, I am deeply disappointed.
...
   Elementary
Unix commands should be elementary.  Unix is a kernel.
 Programs that do simple things with system calls should remain
 simple.  This practices makes the system (the kernel interface)
 easier to learn, and to motivate and justify to others.  Programs
 therefore test the simplicity and utility of, and can reveal flaws
 in, the set of primitives that the kernel exposes.  This is valuable
 stuff for a research organization.  "Research" was right there in
 the CSRC's name. 
 I believe this is at once making a more complex argument than was
 proffered, and at the same misses the contextual essence that Unix was
 created in. 
My understanding of that context is, "a pleasant environment for
software development" (McIlroy)[0].  My notion of software development
entails (when not under managerial pressure to bang something together
for the exploitation of "market advantage") analysis and reanalysis of
software components to make them more efficient and more composable.
As a response to the perceived bloat of Multics, the development of the
Unix kernel absolutely involved much critical reappraisal of what
_needed_ to be in a kernel, and of which services were so essential that
they must be offered.
As a microkernel Kool-Aid drinker, I tend to view Unix's origin in that
light, which was reinforced by the severe limitations of the PDP-7 where
it was born.  Possibly many of the decisions about where to draw the
kernel service/userspace service line we made by instinct or seasoned
judgment, but the CSRC being a research organization, I'd be surprised
if matters of empirical measurement were far from top of mind.
It's a shame we don't have more insight into Thompson's development
process, especially in those early days.  I think we have a tendency to
conceive of Unix as having sprung from his fingers already crystallized,
like a mineral Athena from the forehead of Zeus.  I would wager (and
welcome correction if he has the patience) that he made and reversed
decisions based on the experience of using the system.  Some episodes in
McIlroy's "A Research Unix Reader" illustrate that this was a recurring
feature of its _later_ development, so why not in the incubation period?
That, too, is empirical measurement, even if informal.  Many revisions
are made in software because we find in testing that something is "too
damn slow", or runs the system out of memory too often.
So to summarize, I want to push back on your counter here.  Making
little things to measure system features is a salutary practice in OS
development.  Stevens's _Advanced Programming in the Unix Environment_
is, shall we say, tricked out with exhibits along these lines.  The
author's dedication to _measurement_ as opposed to partisan opinion is,
I think, a major factor in its status as a landmark work and as
nigh-essential reading for the serious Unix developer to this day.
Put differently, why would anyone _care_ about making cat(1) simple if
one didn't have these objectives in mind?
...
   Viewpoint 2:
"I Just Want to Serve 5 Terabytes"[1]
 cat(1)'s man page did not advertise the traits in the foregoing
 viewpoint as objectives, and never did.[2]  Its avowed purpose was
 to copy, without interruption or separation, 1..n files from storage
 to and output channel or stream (which might be redirected).
 I don't need to tell convince that this is a worthwhile application.
 But when we think about the many possible ways--and destinations--a
 person might have in mind for that I/O channel, we have to face the
 necessity of buffering or performance goes through the floor.
 It is 1978.  Some VMS 
 I don't know about that; VMS IO is notably slower than Unix IO by
 default. Unlike VMS, Unix uses the buffer cache to serialize access to
 the underlying storage device(s). 
I must confess I have little experience with VMS (and none more recent
than 30 years ago) and offered it as an example mainly because it was
actually around in 1978 (if still fresh from the foundry).
My personal backstory is much more along the lines of my other example,
CP/M on toy computers (8-bit data bus pffffffft, right?).
...
  Ironically, caching here is a major win, not just for
speed, but to
 make it relatively easy to reason about the state of a block, since
 that state is removed from the minutiae of the underlying storage
 device and instead handled in the bio layer. Treating the block cache
 as a fixed-size pool yields a relatively simple state machine for
 synchronizing between the in-memory and on-disk representations of
 data. 
I entirely agree with this.  I contemplated following up Bakul Shah's
post with a mention of Jim Gettys's work on bufferbloat.[1]  So let me
do that here, and venture the opinion that a "buffer" as popularly
conceived and implemented (more or less just a hunk of memory to house
data) is too damn dumb a data structure for many of the uses to which it
is put.
If/when people address these problems, they do what the Unix buffer
cache did; they elaborate it with state.  This is a repeated design
pattern: see SIGURG for example.
Off the top of my head I perceive three circumstances that buffers often
need to manage.
1.  Avoidance of underrun.  Such were the joys of CD-R burning.  But
    also important in streaming or other real-time applications to avoid
    interruption.  Essentially you want to be able to say, "I'm running
    out of data at the current rate, please supply more ASAP".
2.  Avoidance of overrun.  The problems of modem-like flow control are
    familiar to most.  An important insight here, reinforced if not
    pioneered by Gettys, is that "just making the buffer bigger", the
    brogrammer solution, is not always the wise choice.
3.  Cancellation.  Familiar to all as SIGPIPE.  Sometimes all of the
    data in the buffer is invalidated.  The sender needs to stop
    transmitting ASAP, and the receiver can discard whatever it has.
I apologize for the armchair approach.  I have no doubt that much
literature exists that has covered this stuff far more rigorously.  And
yet much of that knowledge has not made its way down the mountain into
practice.  That, I think, was at least part of Doug's point.  Academics
may have considered the topic adequately, but practitioners are too
often solving problems as if it's 1972.
...
  [snip]
 And this, as we all know, is one of the reasons the standard I/O
 library came into existence.  Mike Lesk, I surmise, understood that
 the "applications programmer" having knowledge of kernel internals
 was in general neither necessary nor desirable. 
 I'm not sure about that.  I suspect that the justification _may_ have
 been more along the lines of noting that many programs implemented
 their own, largely similar buffering strategies, and that it was
 preferable to centralize those into a single library, and also noting
 that building some kinds of programs was inconvenient using raw system
 calls. For instance, something like `gets` is handy, 
An interesting choice given its notoriety as a nuclear landmine of
insecurity.  ;-)
...
  but is _annoying_ to write using just read(2). It can
obviously be
 done, but if I don't have to, I'd prefer not to. 
I think you are justifying why stdio was written _as a library_, as your
points seem to be pretty typical examples of why we move code thither
from applications.  My emphasis is a little different: why was buffered
I/O in particular (when it could so easily have been string handling)
the nucleus of what would be become a large standard library with its
toes in many waters, so huge that projects like uclibc and musl arose
for the purpose of (in part) chopping back out the stuff they felt they
didn't need?
My _claim_ is that stdio.h was the first piece of the library to walk
upright because the need for it was most intense.  More so than with
strings; in fact we've learned that Nelson's original C string library
was tricky to use well, was often elaborated by others in unfortunate
ways.[7]
But there was no I/O at all without going through the kernel, and while
there were many ways to get that job done, the best leveraged knowledge
of what the kernel had to work with.  And yet, the kernel might get
redesigned.
Could stdio itself have been done better?  Korn and Vo tried.[8]
...
  Here's where I think this misses the mark: this
focuses too much on
 the idea that simple programs exist as to be tests for, and exemplars
 of, the kernel system call interface, but what evidence do you have
 for that? 
A little bit of experience, long after the 1970s, of working with
automated tests for the seL4 microkernel.
...
  A simpler explanation is that simple programs are
easier to
 write, easier to read, easier to reason about, test, and examine for
 correctness. 
All certainly true.  But these things are just as true of programs that
don't directly make system calls at all.  cat(1), as ideally envisioned
by Pike (if I understand the Platonic ideal of his position correctly),
not only makes system calls, but dirties its hands with the standard
library as little as possible (if you recognize no options, you need
neither call nor reimplement getopt(3)) and certainly not for the
central task.
Again I think we are not so much disagreeing as much as I'm finding out
that I didn't adequately emphasize the distinctions I was making.
...
  Unix amplified this with Doug's "garden
hoses of data" idea and the
 advent of pipes; here, it was found that small, simple programs could
 be combined in often surprisingly unanticipated ways. 
Agreed; but given that pipes-as-a-service are supplied by the _kernel_,
we are once again talking about system calls.
One of the projects I never got off the ground with seL4 was a
reconsideration from first principles of what sorts of more or less
POSIXish buffering and piping mechanisms should be offered (in userland
of course).  For those who are scandalized that a microkernel doesn't
offer pipes itself, see this Heiser piece on "IPC" in that system.[2]
...
  Unix built up a philosophy about _how_ to write
programs that was
 rooted in the problems that were interesting when Unix was first
 created. Something we often forget is that research systems are built
 to address problems that are interesting _to the researchers who build
 them_. 
I agree.
...
  This context can shape a system, and we see that with
Unix: a
 highly synchronous system call interface, because overly elaborate
 async interfaces were hard to program; 
And still are, apparently even without the qualifier "overly elaborate".
...though Go (and JavaScript?) fans may disagree.
...
  a simple file abstraction that was easy to use
 (open/creat/read/write/close/seek/stat) because files on other
 contemporary systems were baroque things that were difficult to use; 
Absolutely.  It's a truism in the Unix community that it's possible to
simulated record-oriented storage and retrieval on top of a byte stream,
but hard to do the converse.
Though, being a truism, it might be worthwhile to critically reconsider
it and more rigorously establish how we know what we think we know.
That's another reason I endorse the microkernel mission.  Let's lower
the cost of experimentation on parts of the system that of themselves
don't demand privilege.  It's a highly concurrent, NUMA world out there.
...
  a simple primitive for the creation of processes
because, again, on
 other systems processes were very heavy, complicated things that were
 difficult to use. 
It is with some dismay that I look at what they are, _on Unix_, today.
https://github.com/torvalds/linux/blob/1b294a1f35616977caddaddf3e9d28e576a1…
https://github.com/openbsd/src/blob/master/sys/sys/proc.h#L138
Contrast:
https://github.com/jeffallen/xv6/blob/master/proc.h#L65
...
  Unix took problems related to IO and processes and
made them easy. By
 the 80s, these were pretty well understood, so focus shifted to other
 things (languages, networking, etc). 
True, but beside my point.  Pike's point about cat and its flags was, I
think, a call to reconsider more fundamental things.  To question what
we thought we knew--about how best to design core components of the
system, for example.  Do we really need the efflorescence of options
that perfuses not simply the GNU versions of such components (a popular
sink for abuse), but Busybox and *BSD implementations as well?
Every developer of such a component should consider the cost/benefit
ratio of flags, and then RE-consider them at intervals.  Even at the
cost of backward compatibility.  (Deprecation cycles and
mitigation/migration plans are good.)
...
  Unix is one of those rare beasts that escaped the lab
and made it out
 there in the wild. It became the workhorse that beget a whole two or
 three generations of commercial work; it's unsurprising that when the
 web explosion happened, Unix became the basis for it: it was there, it
 was familiar, and by then it wasn't a research project anymore, but a
 basis for serious commercial work. 
Yes, and in a sense this success has cost all of us.[3][4][5]
...
  That it has retained the original system call
interface is almost
 incidental; 
In _structure_, sure; in detail, I'm not sure this claim withstands
scrutiny.  Just _count_ the system calls we have today vs. V6 or V7.
...
  perhaps that fits with your brocolli-man analogy.

I'm unfamiliar with this metaphor.  It makes me wonder how to place it
in company with the requirements documents that led to the Ada language:
Strawman, Woodenman, Ironman, and Steelman.
At least it's likely better eating than any of those.  ;-)
Since no one else ever says it on this list, let me point out what a
terrific and unfairly maligned language Ada is.  In reading the
minutes of the latest WG14 meeting[6] I marvel anew at how C has over
time slowly, slowly accreted type- and memory-safety features that Ada
had in 1983 (or even in 1980, before its formal standardization).
Regards,
Branden
[0] https://www.gnu.org/software/groff/manual/groff.html.node/Background.html
[1] https://gettys.wordpress.com/category/bufferbloat/
[2] https://microkerneldude.org/2019/03/07/how-to-and-how-not-to-use-sel4-ipc/
[3] https://tianyin.github.io/misc/irrelevant.pdf (guess who)
[4] https://www.youtube.com/watch?v=36myc8wQhLo (Timothy Roscoe)
[5] https://queue.acm.org/detail.cfm?id=3212479 (David Chisnall)
[6] https://www.open-std.org/JTC1/sc22/wg14/www/docs/n3227.htm
    Skip down to section 5.  Note particularly `_Optional`.
[7] https://www.symas.com/post/the-sad-state-of-c-strings
[8]
https://www.semanticscholar.org/paper/SFIO%3A-Safe-Fast-String-File-IO-Korn…

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

1996

1995

1994

1993

1992

1991

1990

[TUHS] Re: If forking is bad, how about buffering?