Re: [TUHS] Quotas - did anyone ever use them?

31 May 2019

[This is another essay-length post about O/Ses and ZFS...]
Arthur Krewat &lt;krewat(a)kilonet.net&gt; asks on Thu, 30 May 2019 20:42:33
-0400 about my TUHS list posting about running a large computing
facility without user disk quotas, and our experiences with ZFS:
...
 > I have yet to play with Linux and ZFS but would
appreciate to
> hear your experiences with it. 
First, ZFS on the Solaris family (including DilOS, Dyson, Hipster,
Illumian, Omnios, Omnitribblix, OpenIndiana, Tribblix, Unleashed, and
XStreamOS), the FreeBSD family (including ClonOS, FreeNAS, GhostBSD,
HardenedBSD, MidnightBSD, PCBSD, Trident, and TrueOS), and on
GNU/Linux (1000+ distributions, due to theological differences) offers
important data safety features, and ease of management.
There are lots of details about ZFS that you can find in the slides of
a talk that we have given several times:
        http://www.math.utah.edu/~beebe/talks/2017/zfs/zfs.pdf
The slides at the end of that file contain pointers to ZFS resources,
including recent books.
Some of the key ZFS features are:
        * all disks form a dynamic shared pool from which space can be
          drawn for datasets, on top of which filesystems can be
          created;
        * the pool can exploit data redundancy via various RAID Zn
          choices to survive loss of individual disks, and optionally,
          provide hot spares shared across the pool, and available to
          all datasets;
        * hardware RAID controllers are unneeded, and discouraged ---
          a JBOD (just a bunch of disks) array is quite satisfactory
        * all metadata, and all file data blocks, have checksums that
          are replicated elsewhere in the pool, and checked on EVERY
          read and write, allowing automatic silent recovery (via data
          redundancy) from transient or permanent errors in disk
          blocks --- ZFS is self healing;
        * ZFS filesystems can have unlimited numbers of snapshots;
        * snapshots are extremely fast, typically less than one
          second, even in multi-terabyte filesystems;
        * snapshots are readonly, and thus, immune to ransomware
          attacks;
        * ZFS send and receive operations allow propagation of copies
          of filesystems by transferring only data blocks that have
          changed since the last send operation;
        * the ZFS copy-on-write policy means that in-use blocks are
          never changed, and that block updates are guaranteed to be
          atomic;
        * quotas can optionally be enabled on datasets, and grown as
          needed (quota shrink is not yet possible, but is in ZFS
          development plans).
        * ZFS optionally supports encryption, data compression, block
          deduplication, and n-way disk replication;
        * Unlike traditional fsck, which requires disks to be offline
          during the checks, ZFS scrub operations can be run (usually
          by cron jobs, and at lower priority) to go through datasets
          to verify data integrity and filesystem sanity while normal
          services continue.
ZFS likes to cache metadata, and active data blocks, in memory.  Most
of our VMs that have other filesystems, like EXT{2,3,4}, FFS, JFS,
MFS, ReiserFS, UFS, and XFS, run quite happily with 1GB of DRAM.  The
ZFS, DragonFly BSD Hammer, and BTRFS ones are happier with 2GB to 4GB
of DRAM.  Our central fileservers have 256GB to 768GB of DRAM.
The major drawback of copy-on-write and snapshots is that once a
snapshot has been taken, a filesystem-full condition cannot be
ameliorated by removing a few large files.  Instead, you have to
either increase the dataset quota (our normal practice), or you have
to free older snapshots.
Our view is that the benefits of snapshots for recovery of earlier
file versions far outweigh that one drawback: I myself did such a
recovery yesterday when I accidentally clobbered a critical file full
of digital signature keys.
On Solaris and FreeBSD families, snapshots are visible to users as
read-only filesystems, like this (for ftp://ftp.math.utah.edu/pub/texlive
and http://www.math.utah.edu/pub/texlive)
        % df /u/ftp/pub/texlive
        Filesystem             1K-blocks      Used Available Use% Mounted on
        tank:/export/home/2001 518120448 410762240 107358208  80% /home/2001
        % ls /home/2001/.zfs/snapshot
        AMANDA           auto-2019-05-21  auto-2019-05-25  auto-2019-05-29
        auto-2019-05-18  auto-2019-05-22  auto-2019-05-26  auto-2019-05-30
        auto-2019-05-19  auto-2019-05-23  auto-2019-05-27  auto-2019-05-31
        auto-2019-05-20  auto-2019-05-24  auto-2019-05-28
        % ls /home/2001/.zfs/snapshot/auto-2019-05-21/ftp/pub/texlive
        Contents  Images  Source  historic  protext  tlcritical  tldump  tlnet  tlpretest
That is, you first use the df command to find the source of the
current mount point, then use ls to examine the contents of
.zfs/snapshot under that source, and finally follow your pathname
downward to locate a file that you want to recover, or compare with a
current copy, or another snapshot copy.
On Network Appliance systems with the WAFL filesystem design (see
        https://en.wikipedia.org/wiki/Write_Anywhere_File_Layout
), snapshots are instead mapped to hidden directories inside each
directory, which is more convenient for human users, and is a feature
that we would really like to see on ZFS.
A nuisance for us is that the current ZFS implementation on CentOS 7
(a subset of the pay-for-service Red Hat Enterprise Linux 7) does not
show any files under the .zfs/snapshot/auto-YYYY-MM-DD directories,
except on the fileserver itself.
When we used Solaris ZFS for 15+ years, our users could themselves
recover previous file versions following instructions at
        http://www.math.utah.edu/faq/files/files.html#FAQ-8
Since our move to a GNU/Linux fileserver, they no longer can; instead,
they have to contact systems management to access such files.
We sincerely hope that CentOS 8 will resolve that serious deficiency:
see
        http://www.math.utah.edu/pub/texlive-utah/README.html#rhel-8
for comments on the production of that O/S release from the recent
major new Red Hat EL8 release.
We have a large machine-room UPS, and outside diesel generator, so our
physical servers are immune to power outages and power surges, the
latter being a common problem in Utah during summer lightning storms.
Thus, unplanned fileserver outages should never happen.
A second issue for us is that on Solaris and FreeBSD, we have never
seen a fileserver crash due to ZFS issues, and on Solaris, our servers
have sometimes been up for one to three years before we took them down
for software updates.  However, with ZFS on CentOS 7, we have seen 13
unexplained reboots in the last year.  Each has happened late at
night, or in the early morning, while backups to our tape robot, and
ZFS send/receive operations to a remote datacenter, are in progress.
The crash times suggest to us that heavy ZFS activity is exposing a
kernel or Linux ZFS bug.  We hope that CentOS 8 will resolve that
issue.
We have ZFS on about 70 physical and virtual machines, and GNU/Linux
BTRFS on about 30 systems.  With ZFS, freeing a snapshot moves its
blocks to the free list within seconds.  With BTRFS, freeing snapshots
often takes tens of minutes, and sometimes, hours, before space
recovery is complete.  That can be aggravating when it stops your work
on that system.
By contrast, snapshots on both BTRFS and ZFS are fast.  However, they
appear to be far smaller on ZFS than on BTRFS.  We have VMs and
physical machines with ZFS that have 300 to 1000 daily snapshots with
little noticeable reduction in free space, whereas those with BTRFS
seem to lose about a gigabyte a day.  My home TrueOS system has
sufficient space for about 25 years of ZFS dailies.  Consequently, I
run nightly reports of free space on all of our systems, and manually
intervene on the BTRFS ones when space hits a critical level (I try to
keep 10GB free).
On both ZFS and BTRFS, packages are available to trim old snapshots,
and we run the ZFS trimmer via cron jobs on our main fileservers.
In the GNU/Linux world, however, only openSUSE comes by default with a
cron-enabled BTRFS snapshot trimmer, so intervention is unnecessary on
that O/S flavor.  I have never installed snapshot trimmer packages on
any of our other VMs, because it just means more management work to
deal with variants in trimmer packages, configuration files, and cron
jobs.
Teams of ZFS developers from FreeBSD and GNU/Linux are working on
merging divergent features back into a common OpenZFS code base that
all O/Ses that support ZFS can use; that merger is expected to happen
within the next few months.  ZFS has been ported by third parties to
Apple macOS and Microsoft Windows, so it has the potential of becoming
a universal filesystem available on all common desktop environments.
Then we could use ZFS send/receive instead of .iso, .dmg, and .img
files to copy entire filesystems between different O/Ses.
-------------------------------------------------------------------------------
- Nelson H. F. Beebe                    Tel: +1 801 581 5254                  -
- University of Utah                    FAX: +1 801 581 4148                  -
- Department of Mathematics, 110 LCB    Internet e-mail: beebe(a)math.utah.edu  -
- 155 S 1400 E RM 233                       beebe(a)acm.org  beebe(a)computer.org -
- Salt Lake City, UT 84112-0090, USA    URL: http://www.math.utah.edu/~beebe/ -
-------------------------------------------------------------------------------

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

1996

1995

1994

1993

1992

1991

1990

Re: [TUHS] Quotas - did anyone ever use them?