On Fri, Feb 5, 2021 at 12:51 PM Dave Horsfall <dave(a)horsfall.org> wrote:
[...]
Thanks; I'd heard that ZFS was a compressed file system, so I stopped
right there (I had lots of experience in recovering from corrupted RK05s,
and didn't need any more trouble).
That's funny, for me this is the main reason to use ZFS... What really sets
ZFS apart from everything else is the lack of trouble and its resilience to
failures. We used to have lots and lots of ZFS filesystems at work, and
I've been using ZFS exclusively at home ever since. I have run into a
non-importable ZFS file system (all drives are there, but it's corrupt and
won't import) only once, and was able to fix it with zdb.
ZFS compression is completely optional, and not even on by default. I've
only tried it once and found it cost too much performance on something
that's not very fast to begin with, but I don't think it affects data
recovery much (the way ZFS stripes data makes traditional data recovery
tools pretty useless anyway).
I personally don't care about purity of implementation, because everything
is a trade-off. The argument really reminds me of Tanenbaums criticism of
the Linux Monokernel (was Tanenbaum right? Maybe, but who cares, because
Linux took over the world, and Minix didn't, so from a practical point of
view, Linus was right). The other one it reminds me of is the criticism of
TCPs "blatant layering violation" (vs OSI). But IMHO the critics were just
jealous of the cool things they couldn't do because they needed to respect
the division of labor along those pesky layers.
I remember reading on one of the Sun engineers blogs (remember when Sun
allowed their engineers to keep blogs about Solaris development? Good
times!) about the heated discussions they had over the ARC and bypassing
the page cache. I don't remember the actual arguments for it, but it was
certainly not a decision that was made out of laziness.
Performance wise, ZFS is not the best, and if that's all you care about,
there are better options. It needs a lot of tuning to just reach
"acceptable" and it definitely does not play well with doing other stuff on
the same machine (it pretty much assumes that your storage appliance is
dedicated). It has particularly abysmal performance when you do lots of
small random writes and then try to read that back in order, but if you
care about not losing your data, it's 2nd to none.
In $JOB-1 (almost 15 years ago), we spent a few weeks stress testing ZFS.
The setup was 24x4TB SATA drives, divided into 2 12 drive raidz2 vdevs or
something like that. All tests were done while it was busy reading/writing
checksummed test files at full speed, 1GB/s or so (see? Performance was not
impressive. We definitely got a lot more out of that with UFS). What was
absolutely stunning was the fact that in all our tests it never served one
bit of corrupted data. It either had it, or it returned an error.
We tortured the storage in any way we could imagine. Wiggled the cables,
yanked out drives, used dd to overwrite random parts or entire drives,
smashed a drive with a hammer and put it back in, put in drives of the
wrong size, put in known bad drives, yanked out drives while it was
resilvering, put drives back into a different slot, overwrote stuff while
it was resilvering. Unplugged the entire storage, plugged the storage into
another machine and imported it, plugged the drives back into the first
machine in a different order. We even did things like "copy a drive onto a
spare with dd, remove 3 drives, and then substitute the spare drive for the
removed one" (this led to some data loss because making the copy was not
atomic, but most of the data was recoverable). And no matter what we did,
it just kept going unless the data was simply not there, and even then, it
kept serving the files (or parts of files) that were available, and
indicated exactly which files were affected by data loss. And when you put
the drives back (or restored the overwritten parts), it would continue as
if nothing had ever happened.
If you've ever wrestled a hardware RAID controller, or VxFS/JFS/HPFS, or
mdadm, you know that none of that can be taken for granted, and that doing
any of the stupid things mentioned above would most likely lead to complete
data loss and/or serving lots of random corrupted data and no way to tell
what had been corrupted.
I remember some performance issues with mmap, but I don't remember how we
fixed it. Probably just sucked it up. Using ZFS was not for maximum
performance.