On Thu, Dec 14, 2023 at 7:07 PM Noel Chiappa <jnc(a)mercury.lcs.mit.edu> wrote:
Now I'd
probably call them kernel threads as they don't have a separate
address space.
Makes sense. One query about stacks, and blocking, there. Do kernel threads,
in general, have per-thread stacks; so that they can block (and later resume
exactly where they were when they blocked)?
That was the thing that, I think, made kernel processes really attractive as
a kernel structuring tool; you get code ike this (from V6):
swap(rp->p_addr, a, rp->p_size, B_READ);
mfree(swapmap, (rp->p_size+7)/8, rp->p_addr);
The call to swap() blocks until the I/O operation is complete, whereupon that
call returns, and away one goes. Very clean and simple code.
Assuming we're talking about Unix, yes, each process has two stacks:
one for userspace, one in the kernel.
The way I've always thought about it, every process has two parts: the
userspace part, and a matching thread in the kernel. When Unix is
running, it is always running in the context of _some_ process (modulo
early boot, before any processes have been created, of course).
Furthermore, when the process is running in user mode, the kernel
stack is empty. When a process traps into the kernel, it's running on
the kernel stack for the corresponding kthread.
Processes may enter the kernel in one of two ways: directly, by
invoking a system call, or indirectly, by taking an interrupt. In the
latter case, the kernel simply runs the interrupt handler within the
context of whatever process happened to be running when the interrupt
occurred. In both cases, one usually says that the process is either
"running in userspace" (ie, normal execution of whatever program is
running in the process) or "running in the kernel" (that is, the
kernel is executing in the context of that process).
Note that this affects behavior around blocking operations.
Traditionally, Unix device drivers had a notion of an "upper half" and
a "lower half." The upper half is the code that is invoked on behalf
of a process requesting services from the kernel via some system call;
the lower half is the code that runs in response to an interrupt for
the corresponding device. Since it's impossible in general to know
what process is running when an interrupt fires, it was important not
to perform operations that would cause the current process to be
unscheduled in an interrupt handler; hence the old adage, "don't sleep
in the bottom half of a device driver" (where sleep here means sleep
as in "sleep and wakeup", a la a condition variable, not "sleep for
some amount of time"): you would block some random process, which may
never be woken up again!
An interesting aside here is signals. We think of them as an
asynchronous mechanism for interrupting a process, but their delivery
must be coordinated by the kernel; in particular, if I send a signal
to a process that is running in userspace, it (typically) won't be
delivered right away; rather, it will be delivered the next time the
process is scheduled to run, as the process must enter the kernel
before delivery can be effected. Signal delivery is a synthetic event,
unlike the delivery of a hardware interrupt, and the upcall happens in
userspace.
Use of a kernel process probably makes the BSD pageout
daemon code fairly
straightforward, too (well, as straightforward as anything done by Berzerkly
was :-).
Interestingly, other early systems don't seem to have thought of this
structuring technique. I assumed that Multics used a similar technique to
write 'dirty' pages out, to maintain a free list. However, when I looked in
the Multics Storage System Program Logic Manual:
http://www.bitsavers.org/pdf/honeywell/large_systems/multics/AN61A_storageS…
Multics just writes dirty pages as part of the page fault code: "This
starting of writes is performed by the subroutine claim_mod_core in
page_fault. This subroutine is invoked at the end of every page fault." (pg.
8-36, pg. 166 of the PDF.) (Which also increases the real-time delay to
complete dealing with a page fault.)
Note that this says, "starting of writes." Presumably, the writes
themselves were asynchronous; this just initiates the operations. It
certainly adds latency to the page fault handler, but not as much as
waiting for the operations to complete!
It makes sense to have a kernel process do this;
having the page fault code
do it just makes that code more complicated. (The code in V6 to swap
processes in and out is beautifully simple.) But it's apparently only obvious
in retrospect (like many brilliant ideas :-).
I can kinda sorta see a method in the madness of the Multics approach.
If you think that page faults are relatively rare, and initiating IO
is relatively cheap but still more expensive than executing "normal"
instructions, then it makes some sense that you might want to amortize
the cost of that by piggybacking one on the other. Of course, that's
just speculation and I don't really have a sense for how well that
worked out in Multics (which I have played around with and read about,
but still seems largely mysterious to me). In the Unix model, you've
got scheduling latency to deal with to run the pageout daemon; of
course, that all happened as part of a context switch, and in early
Unix there was no demand paging (and so I suppose page faults were
considered fatal).
That said, using threads as an organizational metaphor for structured
concurrency in the kernel is wonderful compared to many of the
alternatives (hand-coded state machines, for example).
- Dan C.