Ted -- thank you -- excellent write up. Love it and I could not agree
more!! Your 'worse is better' is the same idea as what is 'good
enough,'
an argument I used to have at DEC, where being 'prefect' cost years and in
the end - and we lost because of it.
FWIW: I put it slightly differently, make sure you pick a couple of things
that matter, and nail them... be the best on those >>few<< items but then
the rest needs to be 'good enough' and in time you can make those parts
better. But if you wait for all parts to be great or worse 'perfect' - it
doesn't matter -- this says an ex-Alpha guy now working in INTEL*64 --
sigh...
BTW: Not to quibble, but you might also remember, traditional UNIX took the
same path as you did. Ken's v6/v7 FS were not the great about write
ordering either. Remember FFS is replacing Ken's work 15 years down the
road. Kirk did not do all the careful ordered write stuff until after
George Gobble taught us all how to, which was a few years later. When
Kirk implemented FFS, it was after the Purdue patches had been released to
make Unix's original FS more 'reliable' - and yes - Ted (Kowalski) and my
version of the original fsck was not nearly as careful as you were year
later. But again, we were a huge step forward from what had been at the
time. That said -- your stuff, as Larry has pointed, was rock solid and in
practice 'just worked.' Certainly post ext3, I do not have memory of
losing any real user data on any of Linux boxes, and an error I made has
definitely be the cause of them to crashing over the years ;-)
Also, WRT to making the HW properly, and at lot of PC HW being trash -- yup
- which comes back to the what is good enough issue. DEC did the same
thing and for long time in as SGI in making rock solid HW. Heck DEC,
somewhat cornered the SCSI disk business for the mid-range and upper end
rack world in the 90s'. Like Sgi's and Sun's OSses of the day, Tru64 has
very good DMA controllers under the covers that were hardened and lab
tested for corner cases. In fact, it is one of the reasons why while
Tru64 could detect an Adaptec controller @ boot up and actually use it (I
had on my workstation), it was not officially in the 'SPD' as supported
device because the HW failed as you described and TruCluster's in
particular could not make a proper DLM do the failure modes of the Apaptec
(now as I used to point out to the marketing and HW weenies, none sane
person was going to put a $150 SCSI controller in their $.5M TruCluster
system - so we could have allowed it, just not allowed it some configs -
made it clear in the SPD -- Adaptec on in configs XXX).
As a result, the issue places out this way... an then DEC's VPs used to say
you could not make & sell an Alpha for under $5K (one person in particular
who I will leave nameless - those who were there - all know who I refer).
My last act before I left DEC/Compaq for Paceline, was to make the $1K
alpha using an $799 (end user) Compaq system with the K7 on the motherboard
swapped with an EV6 and some mechanical shims, Adaptec SCSI BT (I still
have the motherboard @ home, and the EV6 in on my desk at Intel). It was
built using PC parts - case, power supply et al. The key was the Alpha
@5K was a better physically built system than the $799 based PC -- but who
cared ... your closing para WRT to WiFi and PCMCIA summed up the issue
pretty well.
Clem
On Sun, May 14, 2017 at 12:30 AM, Theodore Ts'o <tytso(a)mit.edu> wrote:
On Thu, May 11, 2017 at 03:25:47PM -0700, Larry McVoy
wrote:
This is one place where I think Linux kicked
Unix's ass. And I am not
really sure how they did it, I have an idea but am not positive. Unix
file systems up through UFS as shipped by Sun, were all vulnerable to
what I call the power out test. Untar some big tarball and power off
the machine in the middle of it. Reboot. Hilarity ensues (not).
You were dropped into some stand alone shell after fsck threw up its
hands and it was up to you to fix it. Dozens and dozens of errors.
It was almost always faster to go to backups because figuring that
stuff out, file by file (which I have done more than once), gets you
to the point that your run "fsck -y" and go poke at lost+found when
fsck is done, realize that there is no hope, and reach for backups.
Try the same thing with Linux. The file system will come back, starting
with, I believe, ext2.
My belief is that Linux orders writes such that while you may lose data
(as in, a process created a file, the OS said it was OK, but that file
will not be in the file system after a crash), but the rest of the file
system will be consistent. I think it's as if you powered off the
machine a few seconds earlier than you actually did, some stuff is in
flight and until they can write stuff out in the proper order you may
lose data on a hard reset.
So the story is a bit complicated here, and may be an example of
"worse is better" --- which is ironically one of those things which is
used as an explanation for why BSD/Unix won ever though the Lisp was
technically superior[1] --- but in this case, it's Linux that did
something "dirty", and BSD that did something that was supposed to be
the "better" solution.
[1]
https://www.jwz.org/doc/worse-is-better.html
So first let's talk about ext2 (which indeed, does not have file
system journalling; that came in ext3). The BSD Fast File System goes
to a huge amount of effort to make sure that writes are sent to the
disk in exactly the right order so that fsck can actually fix things.
This requires that the disk not reorder writes (e.g., write caching is
disabled or in write-through mode). Linux, in ext2, didn't bother
with trying to get the write order correct at all. None. Nada. Zip.
Writes would go out in whatever order dictated by the elevator
scheduler, and so on a power failure or a kernel crash, the order in
which metadata writes would be sent to the disk was completely
unconstrained.
Sounds horrible, right? In many ways, it was. And I lost count of
how often NetBSD and FreeBSD users would talk about how primitive and
horrible ext2 was in comparison to FFS, which had all of this
excellent engineering work to make sure writes happened in the correct
order such that fsck was guaranteed to always be able to fix things.
So why did Linux get away with it? When I wrote the fsck for ext2, I
knew that anything can and would happen, so it was implemented so that
it was extremely paranoid about not ever losing any data. And if
there was a chance that an expert could recover the data, e2fsck would
stop and ask the system administrator to take a look. In the case
that the user ran with fsck -y, the default was drop files into
lost+found, where as in some cases with the FFS fsck, it "knew" that
in a particular case, the order in which writes were staged out the
right thing to do was to let the unlink complete, so it would let the
refcount go to zero, or stay at zero.
The other thing that we did in Linux is that I made sure we had a
highly functional "debugfs" tool. This tool served two purposes. The
first was it made it very easy for me to creat a regression test suite
for fsck. As far as I know, none of the other major file systems at
the time had an fsck with a regression test suite --- and I was
religious about adding tests as I added functionality, and as I fixed
bugs. The debugfs tool made it easy for me to create test case file
systems that was corrupted in various interesting ways. The other use
of debugfs was that it made it easy for experts to do file system
recovery after a crash, if there was some really precious file that
they needed to try to recover.
So this is why this is a great example of "worse is better". In Linux,
ext2 was ***incredibly*** sloppy about how it handled write ordering
--- it didn't do anything at all. But as a consequence we developed
tools that were extremely good to compensate, and in practice, it was
extremely rare (although it did happen on occasion) that files would
get lost or the file systme could end up in a state where fsck would
not be able to recover without manual intervention by a system
administrator using debugfs.
But the other thing to note here is that in the PC era, most disk
drives ran with write caching enabled, with writeback caching so that
the hard drive could do its own elevator shceduling. So having a file
system that very carefully scheduled writes to make sure they happened
in the write order didn't help you a *bit* unless you configured your
hard drive to disable writeback caching --- at which point you would
take a massive speed hit.
This is ultimately also one weaknesses of Soft Updates --- it requires
that you disable writeback caching, since it works by letting the OS
control the order in which writes hit stable storage. With
journalling you don't have to do that; but the tradeoff is that when
you do a journal commit, you need typically two cache flush
operations. (Or a cache flush followed by a FUA write of the commit
block, if the disk supports FUA.)
There is another example of how Linux embraced the "worse is better"
philosophy in ext3, and that has to do with how we do journalling.
The sophisticated way to do journalling is to do logical journalling.
This is where what you write in the in journal is "set bit XXX in the
allocation bitmap", or "update the mtime to YYYY". And in this way,
you can batch multiple file system operations into a single block
written to the journal. Solaris/UFS and Irix uses this much more
sophisticated form of journalling. (Actually, older versions of
Solaris did use volume-level journalling, which is basically what
ext3/ext4 uses, but they upgraded to the much more "right", more
advanced thing, which is logical journalling.)
Ext3 uses phyiscal, or volume-level journalling. This journalling
works on the block level --- so if we flip a bit in an allocation
bitmap, we log the entire 4k block to the journal. By default, we
only do a journal commit every five seconds (unless an fsync happens
first), so there could be multiple changes to a single inode table
blocks that can be batched together, but it's still true that for a
given metadata-heavy workload, a file system which uses logical
journalling will tend require many fewer blocks written to the journal
than a file system such as ext3/ext4 which uses physical block
journalling.
Why did Linux get away with it? Number one, most workloads aren't
really modify metadata all _that_ intensively, and 12k of sequential
writes versus 32k of sequential writes doesn't actually take that much
more time. Secondly, Ted's law of PC-class hardware ("most PC-class
hardware is crap") comes into play, and turns physical journalling
into an advantage. PC class hardware tends not to have power fail
interrupts, and when power drops, and the voltage levels on the power
rails start drooping, DRAM tends to go insane and starts returning
garbage long before the DMA engine and the hard drive stops
functioning.
So if your system is doing logical journalling, after the file system
commits a transaction, it will start writing the inode table block to
the permanent location on disk. If at that point you get a power
drop, garbage can get written to the inode table block, and if the
file system is using logical journalling, on reboot the mtime field
can get updated from the logical journal --- but the rest of the inode
table block is still garbage.
In contrast, since ext3 was using physical block journalling, even if
various metadata blocks get corrupted due to writes from failing DRAM
during a power drop, when we replay the journal, this will restore the
entire metadata block, and Things Just Worse.
I have talked to an XFS engineer from SGI, and this was definitely a
thing which SGI discovered the hard way. After they discovered this
problem, they added extra capacitors to the power supply, added a
power fail interrupt, and taught Irix so that when the power fail
interrupt was triggered, it would frantically cancel DMA transfers in
order to avoid this problem. I do not know how many of the other
Legacy Unix systems figured out this failure mode --- and I can't
claim that we were brilliant enough to design a system to avoid this
problem. It just so happened that the brute-force design that we
chose was very well suited for crappy (but way cheaper than a Sun Fire
E10k :-) PC-class hardware.
I copied Ted, who had his fingers deep in that
code, maybe he can correct
me where I got it wrong. Details aside, I think this is a place where
Linux moved the state of the art significantly forward. There are other
places but this one is a big deal IMHO, maybe the biggest deal.
So I'm not really sure we can claim to have "moved the state of the
art". There certainly wasn't any brilliant computer science
innovations here. That sort of thing is more like Soft Updates, of
which Valerie Aurora (formerly Henson) once wrote,
"I've read this paper at least 15 times, and each time I when get
to page 7, I'm feeling pretty good and thinking, "Yeah, okay, I
must be smarter now than the last time I read this because I'm
getting it this time," - and then I turn to page 8 and my head
explodes." ---
https://lwn.net/Articles/339337/
I will be the first to admit that with ext2/ext3/ext4, especially in
the early days, it was much more about brute force engineering, and
regression testing, and much less about "moving the state of the art".
Certainly those of us who were working on Linux weren't trying to get
papers published in peer reviewed journals or conferences! (And I've
always thought that Greg Ganger was _way_ smarter than I. :-)
And if the Lisp Machine hackers looked down on BSD, and complained
that BSD adopted the "Worse is Better" philosophy, while Lisp strived
for the true, elegant, Correct technical solution, it's perhaps
especially interesting to consider that if anything, Linux was an even
more radical example of the "Worse is Better" philosophy.
Cheers,
- Ted
P.S. There is yet another example of "Worse is Better" in how Linux
had PCMCIA support several years before FreeBSD/NetBSD. However, if
you ejected a PCMCiA card in a Linux system, there was a chance (in
practice it worked out to be about in 1 in 5 times for a WiFI card, in
my experience) that the system would crash. The *BSD's took a good
2-3 years longer to get PCMCIA support, but when they did, it was rock
solid. Of course, if you are a laptop user, and are happy to keep
your 802.11 PCMCIA card permanently installed, guess which OS you were
likely to prefer --- "sloppy but works, mostly", or "it'll get there
eventually, and will be rock solid when it does, but zip, nada, right now"?
--lm
On Thu, May 11, 2017 at 04:37:29PM -0400, Ron Natalie wrote:
> I remember the pre-fsck days. It was part of my test to become an
operator
at the UNIX site at JHU that I could run the various manual checks.
>
> The V6 file system wasn???t exactly stable during crashes (lousy
database
behavior), so there was almost certainly something to clean up.
>
>
>
> The first thing we???d run was icheck. This runs down the superblock
freelist and all the allocated blocks in the inodes. If there were
missing blocks (not in a file or the free list), you could use icheck ???s
>
> to rebuild it. Similarly, if you had duplicated allocations in the
freelist or between the freelist and a single file. Anything more
complicated required some clever patching (typically, we???d just mount
readonly, copy the files, and then blow them away with clri).
>
>
>
> Then you???d run dcheck. As mentioned dcheck walks the directory
path from
the top of the disk counting inode references that it reconciles
with the link count in the inode. Occasionally we???d end up with a 0-0
inode (no directory entires, but allocated???typically this is caused by
people removing a file while it is still open, a regular practice of some
programs for their /tmp files.). clri again blew these away.
>
>
>
> Clri wrote zeros all over the inode. This had the effect of wiping
out the
file, but it was dangerous if you got the i-number wrong. We
replaced it with ???clrm??? which just cleared the allocated bit, a lot
easy to reverse.
>
>
>
> If you really had a mess of a file system, you might get a piece of
the
directory tree broken off from a path to the root. Or you???d have an
inode that icheck reported dups. ncheck would try to reconcile an inumber
into an absolute path.
>
>
>
> After a while a program called fsdb came around that allowed you to
poke at
the various file system structures. We didn???t use it much
because by the time we had it, fsck was fast on its heals.
--
---
Larry McVoy lm at
mcvoy.com http://www.mcvoy.com/lm