DEC's diags were far from perfect, but they were a
hell of a
lot better than the largely-nonexistent diags available for
modern Intel-architecture systems. I am right now dealing
with a system that has an intermittent fault, that causes
the OS to crash in the middle of some device driver every
so often. Other identical systems don't, so I don't think
it's software.
Could be the device itself, corrupting things.
Some problems are just hard to track down. If you remember
the infamous UDA50/RA81 setup, the original Unix driver was
flaky somehow in that it would "lose" MSCP packets and hang.
I rewrote the thing from scratch to fix that problem. But
then we got an Emulex board that had a ... different problem.
I hacked on our driver to find it. The problem turned out
to be that the Emulex hardware (or firmware) would *drop*
16 of the 32 bits of a 32-bit field. In each MSCP packet,
there was a single 32-bit field where you could store arbitrary
data to be reflected back to you in a reply. The Unix driver
stored `struct buf *bp` there, if I remember right, and I
originally did as well.
Once I figured out this was being clobbered, I replaced it
with a small integer (index into "outstanding I/O table") with
check bytes. I'd log the occurrence of corruption, recover the
useful data from the 16 bytes that had the right data, and we
would be on our merry way. There was no obvious pattern here
though.
Two other sort of related war stories...
* We had the carry-chain timing bug on our VAX 780 at one point.
It most-consistently hit on the `extzv` instruction in the
kernel exit() handler, but only about 1 out of every 10 to 100
thousand occurrences. So I wrote a user-land program that
would spin doing that `extzv`. If the user program crashed,
the board-set installed in the backplane had the problem, and
we'd have the DEC service guy cycle through them (in the usual
"how many flat tires do we have today" dance).
* The Ultrasparc II CPU had a similar timing bug, I think in the
register forwarding logic. The BSD/OS SPARC port had a three
instruction sequence for setting up the right stack on a
trap (interrupt, system call, etc)., and it would randomly
crash with a bizarre value, that I eventually figured out
was from putting the result that should have gone into one
of the %l registers, into the %sp register instead. It only
happened after a pipeline flush for other purposes and I
forget what I did to make it happen frequently enough to
diagnose.
(Re-ordering the three instructions fixed the problem.)
Tying back into ZFS etc., if that was on this mailing list: :-)
I had a bad DIMM in an Intel box a while back, that corrupted
data in the kernel buffer pool. That one was scary, because,
while the memtest86 tests found it, who knows what data they
corrupted?
(This is why I want ECC, even in my home systems.)
Chris