to reiterate: you can avoid resetting the machine, and for all the x million systems in data centers around the world, we do avoid resetting the machine. 

But that comes with its own set of issues, including kernel version x not being able to boot kernel version y (very common with linux, no problem on plan 9); and hardware not behaving well, since few people write drivers that properly reset hardware; or the hardware can't be cleaned up absent a reset (most common problem areas are NICs and graphics). Very few linux drivers can properly shut down hardware for a kexec. The IOMMU and MSIx added a whole new world of fun. Sometimes it feels like it works by accident. 

Also recall that side channels across a kexec are an issue that has to be considered. At Google we've considered them by, e.g., turning on "zero on free" and "zero on alloc" in the kernel that will kexec, only using a small amount of memory in the first kernel (32GiB is small! Ha!), among other things. But since DRAM SPD and Voltage regulator module FLASH provide places to hide things, it's getting messy.

So it's not as simple as "reset bad, not reset good". Hardware is poorly designed, or the drivers are poorly written, and absent a reset, the kexec'ed kernel may fail to boot -- lockup is common, panic is common. That said, no system I know of implements kexec with a reset in the middle. 

Short history: in the Linux world, kernel boots kernel was done, in 1999, by LOBOS (me) and Eric Hendriks (Two Kernel Monte), and Alpha Power (DBLX). Werner Almesberger did his own thing ca. 2000 called (iirc) bootimg. Eric Biederman looked at LOBOS, did not like it, and wrote kexec, I believe around 2001. Plan 9 got kernel boots kernel around that time. As usual, the Plan 9 implementation was the most compact and cleanest. This paper https://ieeexplore.ieee.org/document/1392643 compares them.

The AlphaPower DBLX code was lost when the company went under. They made a heroic effort to get it to sourceforge but things happened too fast. DBLX means "direct boot linux" -- the acronym reads better.

The first kexec was a very general interface, with a Himalayan learning curve. At some point an Intel engineer found kexec confusing and wrote an entirely new type of kexec, with a different API, that many people found easier. 

so kexec has been around for 20 years, and we're still getting the hang of it, and there are still people who claim that it will never fully work.

Anyway, we're far afield of the original question, but it was very interesting to read how far back the idea goes! PDP-7, who knew?

p.s. as to an unrelated discussion: kernels have been self modifying code since at least module loaders became a thing -- that's almost 40 years. Today, especially for risc-v, Linux is aggressively self-modifying; there's no option for some risc-v SoC if you want them to work correctly. Linux rewrites the entire kernel text in early boot stages. You can consider the last stage linker optimization occurs in Linux early boot code.

On Thu, Sep 19, 2024 at 12:13 AM Warner Losh <imp@bsdimp.com> wrote:


On Thu, Sep 19, 2024, 1:05 AM Bakul Shah via TUHS <tuhs@tuhs.org> wrote:
Can you not avoid resetting the machine? This can be treated almost as sleep in the old kernel, wakeup in the new one! You do have to reset devices individually (which may not always work if it requires assistance from some undocumented firmware).

Kexec does just this. The new kernel boots without going through the reset vector. The old kernel keeps a tiny bit of code around that tears down all the protections, etc and hands off to the new kernel a mostly reset machine.. but it doesn't go through the firmware to do it... it was the original reason for it in linux: fast reboot times.

Warner

On Sep 18, 2024, at 4:58 PM, ron minnich <rminnich@gmail.com> wrote:

well, yes, on many systems, there's a lot that runs before the kernel. But if you have a risc-v system with oreboot, you own the system. The problem is that on most of these systems a reset will stop the dram clock for a little bit, or glitch clock enable, or dram power, or whatever. New systems are not designed to allow this.

Ideally, we could force a reset of everything save memory, but modern systems are not designed in this way. Most annoying.

On Wed, Sep 18, 2024 at 4:38 PM Bakul Shah <bakul@iitbombay.org> wrote:
I would prefer old kernel to new kernel handoff if it can be made to work reliably. Nowadays there are a lot of things that run before the kernel gets control. 

On Sep 18, 2024, at 3:38 PM, ron minnich <rminnich@gmail.com> wrote:

Interesting about the amiga. I'm assuming their firmware zeros memory on reset, so you have to do handoff from kernel to kernel, not via a reset and so on?

What was particularly nice about the V6/PDP-11 case: we were able to yank reset, which let us cleanly reset/disable devices, because everything was in memory when we got back. I miss the simplicity of the old machines.

On Wed, Sep 18, 2024 at 3:07 PM Christian Hopps <chopps@chopps.org> wrote:

We had/have this functionality in the Amiga port of NetBSD.

It is implemented as `/dev/reload` device and you copy a kernel image to it. In locore.s there's code that copies the kernel image over top of the running kernel and then restarts. I believe for it to work nothing below the copy code in locore.s can change :)

Thanks,
Chris.

Phil Budne <phil@ultimate.com> writes:

> ron minnich wrote:
>> But I'm wondering: is Ed's work in 1977 the first "kernel boots kernel" or
>> was there something before?
>
> There was!  The PDP-7 UNIX listings contain a program trysys.s
> https://github.com/DoctorWkt/pdp7-unix/blob/master/src/sys/trysys.s
> that reboots the system by reading a.out into user memory (in the high
> 4K of core), then copies it to low memory and jumping to the entry
> point.  The name suggests its original intended use was to test a new
> system (kernel).
>
> P.S.
> Normal bootable system images seem to have been stored in reserved
> tracks of the (fixed head) disk (that are inacessible via system calls):
>
> https://github.com/DoctorWkt/pdp7-unix/blob/master/src/sys/maksys.s
> reads a.out and uses I/O instructions to write it out.
>
> P.P.S.
> Accordingly, I put together a "paper tape" for booting the system:
> https://github.com/DoctorWkt/pdp7-unix/blob/master/src/other/pbboot.s
>
> P.P.P.S.
> The system (kernel) is 3K words, the last 1K of low memory
> used for the character table for the vector graphics controller.
>
> The definitions for the table are compiled by
> https://github.com/DoctorWkt/pdp7-unix/blob/master/src/cmd/cas.s
> from definition file
> https://github.com/DoctorWkt/pdp7-unix/blob/master/src/sys/cas.in
> (after, ISTR, figuring out the ordering of the listing pages!)
>
> I don't think we ever figured out how the initial character table
> is loaded into core.  One thing that was missing from the table
> was the dispatch array, which I recreated:
> https://github.com/DoctorWkt/pdp7-unix/blob/master/src/other/chrtbl.s
>
> The system (kernel) could be built for a "cold start", reloading the
> disk (prone to head crashes?) from paper tape? But I don't think
> anyone ever reconstructed the procedure for rebuilding a disk that way.
>
> The disk was two sided, and the running system only used one side:
> https://github.com/DoctorWkt/pdp7-unix/blob/master/src/cmd/dsksav.s
> https://github.com/DoctorWkt/pdp7-unix/blob/master/src/cmd/dskres.s
> appear to be programs to save and restore the filesystem from the
> "other" side of the disk.