I recently upgraded my machines to fc34. I just did a stock
uncomplicated installation using the defaults and it failed miserably.
Fc34 uses btrfs as the default filesystem so I thought that I'd give it
a try. I was especially interested in the automatic checksumming because
the majority of my storage is large media files and I worry about bit
rot in seldom used files. I have been keeping a separate database of
file hashes and in theory btrfs would make that automatic and transparent.
I have 32T of disk on my system, so it took a long time to convert
everything over. A few weeks after I did this I went to unload my
camera and couldn't because the filesystem that holds my photos was
mounted read-only. WTF? I didn't do that.
After a bit of poking around I discovered that btrfs SILENTLY remounted the
filesystem because it had errors. Sure, it put something in a log file,
but I don't spend all day surfing logs for things that shouldn't be going
wrong. Maybe my expectation that filesystems just work is antiquated.
This was on a brand new 16T drive, so I didn't think that it was worth
the month that it would take to run the badblocks program which doesn't
really scale to modern disk sizes. Besides, SMART said that it was fine.
Although it's been discredited by some, I'm still a believer in "stop and
fsck" policing of disk drives. Unmounted the filesystem and ran fsck to
discover that btrfs had to do its own thing. No idea why; I guess some
think that incompatibility is a good thing.
Ran "btrfs check" which reported errors in the filesystem but was otherwise
useless BECAUSE IT DIDN'T FIX ANYTHING. What good is knowing that the
filesystem has errors if you can't fix them?
Near the top of the manual page it says:
Warning
Do not use --repair unless you are advised to do so by a developer
or an experienced user, and then only after having accepted that
no fsck successfully repair all types of filesystem corruption. Eg.
some other software or hardware bugs can fatally damage a volume.
Whoa! I'm sure that operators are standing by, call 1-800-FIX-BTRFS.
Really? Is a ploy by the developers to form a support business?
Later on, the manual page says:
DANGEROUS OPTIONS
--repair
enable the repair mode and attempt to fix problems where possible
Note there’s a warning and 10 second delay when this option
is run without --force to give users a chance to think twice
before running repair, the warnings in documentation have
shown to be insufficient
Since when is it dangerous to repair a filesystem? That's a new one to me.
Having no option other than not being able to use the disk, I ran btrfs
check with the --repair option. It crashed. Lesson so far is that
trusting my data to an unreliable unrepairable filesystem is not a good
idea. Since this was one of my media disks I just rebuilt it using ext4.
Last week I was working away and tried to write out a file to discover
that /home and /root had become read-only. Charming. Tried rebooting,
but couldn't since btrfs filesystems aren't checked and repaired. Plugged
in a flash drive with a live version, managed to successfully run --repair,
and rebooted. Lasted about 15 minutes before flipping back to read only
with the same error.
Time to suck it up and revert. Started a clean reinstall. Got stuck
because it crashed during disk setup with anaconda giving me a completely
useless big python stack trace. Eventually figured out that it was
unable to delete the btrfs filesystem that had errors so it just crashed
instead. Wiped it using dd; nice that some reliable tools still survive.
Finished the installation and am back up and running.
Any of the rest of you have any experiences with btrfs? I'm sure that it
works fine at large companies that can afford a team of disk babysitters.
What benefits does btrfs provide that other filesystem formats such as
ext4 and ZFS don't? Is it just a continuation of the "we have to do
everything ourselves and under no circumstances use anything that came
from the BSD world" mentality?
So what's the future for filesystem repair? Does it look like the past?
Is Ken's original need for dsw going to rise from the dead?
In my limited experience btrfs is a BiTteR FileSystem to swallow.
Or, as Saturday Night Live might put it: And now, linux, starring the
not ready for prime time filesystem. Seems like something that's been
under development for around 15 years should be in better shape.
Jon
...
DEC Diagnositcs would run on a beached whale
?
Anyone remember and/or know?
(It seems to apply to other manufacturer's diagnostics as well, even today.)
Thanks,
Arnold
I hope that this does not start any kind of language flaming and that if
something starts the moderator will shut it down quickly.
Where did the name for abort(3) and SIGABRT come from? I believe it was
derived from the IBM term ABEND, but would like to know one way or the
other.
Clem Cole:
I believe the line was: *"running **DEC Diagnostics is like kicking a dead
whale down the beach.*"
As for who said it, I'm not sure, but I think it was someone like Rob
Kolstad or Henry Spencer.
=====
The nearest I can remember encountering before was a somewhat
different quote, attributed to Steve Johnson:
Running TSO is like kicking a dead whale down the beach.
Since scj is on this list, maybe he can confirm that part.
I don't remember hearing it applied to diagnostics. I can
imagine someone saying it, because DEC's hardware diags were
written by hardware people, not software people; they required
a somewhat arcane configuration language, one that made more
sense if you understood how the different pieces of hardware
connected together.
I learned to work with it and found it no less usable than,
say, the clunky verbose command languages of DEC's operating
systems; but I have always preferred to think in low levels.
DEC's diags were far from perfect, but they were a hell of a
lot better than the largely-nonexistent diags available for
modern Intel-architecture systems. I am right now dealing
with a system that has an intermittent fault, that causes
the OS to crash in the middle of some device driver every
so often. Other identical systems don't, so I don't think
it's software. Were it a PDP-11 or a VAX I'd fire up the
diagnostics for a while, and have at least a chance of spotting
the problem; today, memtest is about the only such option,
and a solid week of running memtest didn't shake out anything
(reasonably enough, who says it's a memory problem?).
Give me XXDP, not just the Blue Screen of Death.
Norman Wilson
Toronto ON
Not to get into what is soemthing of a religious war,
but this was the paper that convinced me that silent
data corruption in storage is worth thinking about:
http://www.cs.toronto.edu/~bianca/papers/fast08.pdf
A key point is that the character of the errors they
found suggests it's not just the disks one ought to worry
about, but all the hardware and software (much of the latter
inside disks and storage controllers and the like) in the
storage stack.
I had heard anecdotes long before (e.g. from Andrew Hume)
suggesting silent data corruption had become prominent
enough to matter, but this paper was the first real study
I came across.
I have used ZFS for my home file server for more than a
decade; presently on an antique version of Solaris, but
I hope to migrate to OpenZFS on a newer OS and hardware.
So far as I can tell ZFS in old Solaris is quite stable
and reliable. As Ted has said, there are philosophical
reasons why some prefer to avoid it, but if you don't
subscribe to those it's a fine answer.
I've been hearing anecdotes since forever about sharp
edges lurking here and there in BtrFS. It does seem
to be eternally unready for production use if you really
care about your data. It's all anecdotes so I don't know
how seriously to take it, but since I'm comfortable with
ZFS I don't worry about it.
Norman Wilson
Toronto ON
PS: Disclosure: I work in the same (large) CS department
as Bianca Schroeder, and admire her work in general,
though the paper cited above was my first taste of it.
This may be due to logic similar to that of a classic feature that I
always deemed a bug: troff begins a new page when the current page is
exactly filled, rather than waiting until forced by content that
doesn't fit. If this condition happens at the end of a document, a
spurious blank page results. Worse, if the page header happens to
change just after the exactly filled page, the old heading will be
produced before the new heading is read.
Doug
> fork() is a great model for a single-threaded text processing pipeline to do
> automated typesetting. (More generally, anything that is a straightforward
> composition of filter/transform stages.) Which is, y'know, what Unix is *for*.
> It's not so great for a responsive GUI in front of a multi-function interactive program.
"Single-threaded" is not a term I would apply to multiple processes in
a pipeline. If you mean a single track of data flow, fine, but the
fact that that's a prevalent configuration of cooperating processes in
Unix is an artifact of shell syntax, not an inherent property of
pipe-style IPC. The cooperating processes in Rob Pike's 20th century
window systems and screen editors, for example, worked smoothly
without interrupts or events - only stream connections. I see no
abstract distinction between these programs and "stuff people play
with on their phones."
It bears repeating, too, that stream connections are much easier to
reason about than asynchronous communication. Thus code built on
streams is far less vulnerable to timing bugs.
At last a prince has come to awaken the sleeping beauty of stream
connections. In Go (Pike again) we have a widely accepted programming
language that can fully exploit them, "[w]hich is, y'know, what Unix
is 'for'."
(If you wish, you may read "process" above to include threads, but
I'll stay out of that.)
Doug
Steve Simon:
once again i am taken aback at the good taste of the residents of the unix room.
As a whilom denizen of that esteemed playroom, I question
both the accuracy and the relevance of that metric.
Besides, what happened to the sheep shelf? Was it scrubbed
away after I left? And, Ken, whatever happened to Dolly the
Sheep (after she was hidden to avoid upsetting visitors)?
Norman Wilson
Toronto ON
No longer a subscriber to sheep! magazine
> I don't think anyone knows. Nobody relevant, I believe.
>
> -rob
I understand that Dave Presotto bought that photo at a garage sale for $1. The photo hung in
the Unix Room for years, at one point labeled “Peter Weinberger.”
One day I removed it from its careful mounting and scanned in the photo. It bore the label
“what, no steak?”
The photo was stolen from a wall sometime after I left. The scanned image is at
https://cheswick.com/ches/tmp/whatnosteak.jpeg
ches