On 8/30/2021 9:06 AM, Norman Wilson wrote:
A key point is that the character of the errors they
found suggests it's not just the disks one ought to worry
about, but all the hardware and software (much of the latter
inside disks and storage controllers and the like) in the
storage stack.
I had a pair of Dell MD1000's, full of SATA drives (28 total),
with the
SATA/SAS interposers on the back of the drive. Was getting checksum
errors in ZFS on a handful of the drives. Took the time to build a new
array, on a Supermicro backplane, and no more errors with the exact same
drives.
I'm theorizing it was either the interposers, or the SAS
backplane/controllers in the MD1000. Without ZFS, who knows who
swiss-cheesy my data would be.
Not to mention the time I setup a Solaris x86 cluster zoned to a
Compellent and periodically would get one or two checksum errors in ZFS.
This was the only cluster out of a handful that had issues, and only on
that one filesystem. Of course, it was a production PeopleSoft Oracle
database. I guess moving to a VMware Linux guest and XFS just swept the
problem under the rug, but the hardware is not being reused so there's that.
I had heard anecdotes long before (e.g. from Andrew
Hume)
suggesting silent data corruption had become prominent
enough to matter, but this paper was the first real study
I came across.
I have used ZFS for my home file server for more than a
decade; presently on an antique version of Solaris, but
I hope to migrate to OpenZFS on a newer OS and hardware.
So far as I can tell ZFS in old Solaris is quite stable
and reliable. As Ted has said, there are philosophical
reasons why some prefer to avoid it, but if you don't
subscribe to those it's a fine answer.
Been running Solaris 11.3 and ZFS for quite a few years now, at home.
Before that, Solaris 10. I recently setup a home Redhat 8 server, w/ZoL
(.8), earlier this year - so far, no issues, with 40+TB online. I have
various test servers with ZoL 2.0 on them, too.
I have so much online data that I use as the "live copy" - going back to
the early 80's copies of my TOPS-10 stuff. Even though I have copious
amounts of LTO tape copies of this data, I won't go back to the "out of
sight out of mind" mentality.
Trying to get customers to buy into that idea is another story.
art k.
PS: I refuse to use a workstation that doesn't use ECC RAM, either. I
like swiss-cheese on a sandwich. I don't like my (or my customers') data
emulating it.