UsingZFS

Using ZFS instead of UFS on your fs server node

Important Note:

   Before FreeBSD 9.x, there is a huge race condition caused by mountd when
   old export info is removed from the kernel and before new info is reloaded.
   During this time, client operations will fail with "permission denied".
   The window opens wider as you export more filesystems.

   So, if you have lots of filesystems (i.e., users and projects), do NOT
   attempt to use ZFS as described here with pre-9.x FreeBSD as the vulnerable
   window is big enough to drive a truck through. Starting with 9.x, mountd
   will at least suspends the NFS server while it is doing its thing. This
   will cause clients to just block instead of fail.

A. Setting up ZFS users/proj/groups/share.

1. Make a zpool. I am just using the second disk of a d710 right now.
   We would want some sort of redundant config on a real machine, though
   probably not on a VM where the underlying storage is already RAID.
   If it is a VM, you might want to export raw disks to the VM for ZFS
   to work with. Otherwise you risk offending the gods of ZFS who say
   that ONLY ZFS knows how to properly make a redundant volume. We don't.
   We just exported a logical slice of a HW RAID6 for ZFS to use.
   Awaiting lightning bolts...

I did not use any special features here:

zpool create z /dev/da1

2. Create "parent" ZFSes (datasets) for /users, /proj, /groups, and /share.
   I did this so that we can inherit properties from it, like the mountpoint.
   We can also use it to impose a quota for the total space used by /users
   or /proj. And it also allows us to perform some operations recursively on
   all the children, like snapshot.

   zfs create -o setuid=off -o mountpoint=/users -o quota=50G z/users
   zfs create -o setuid=off -o mountpoint=/proj -o quota=200G z/proj
   zfs create -o setuid=off -o mountpoint=/groups -o quota=50G z/groups
   zfs create -o setuid=off -o mountpoint=/share -o quota=50G z/share

   If we are paranoid, we should set setuid per-filesystem instead just
   to make sure that if it accidentally got turned off at the root, it
   would not get turned off everywhere.

   As mentioned, the quota here is not necessary, it only serves the
   purpose of preventing one of the sub pieces from eating up the whole
   zpool. Actually, it might be better in that case to use reservation=NNN
   which would effectively statically partition the space of the zpool among
   the four datasets.

We can also use these root datasets to enable the default exporting
of them and the descendent filesystems to boss, ala:

zfs set sharenfs='boss -maproot=root' z/users
...

   however, as of early 2015 we no longer do that. This was prompted
   by the fact that we have over 7500 ZFS filesystems and exporting
   each was causing mountd to take minutes to reload. Instead we have
   now modified exports_setup to export only the "active" users and
   projects to boss. This is a stopgap til we eliminate the need for
   boss to have ops FSes NFS mounted.

Anyway, you instead need to disable use of ZFS's export table with:

   zfs set sharenfs=off z/users
   zfs set sharenfs=off z/proj
   zfs set sharenfs=off z/groups
   zfs set sharenfs=off z/share

   Note that ZFS has a sharesmb attribute for samba exports, but we
   do NOT use this (i.e., leave it set to off). Samba exports to nodes
   are still handled via exports_setup.

3. Filesystems for users. I created a zfs per-user with quota,
along the lines of:

zfs create -o quota=1G z/users/mike

We need to decide on a mechanism for determining the quota value to use.
Note that the mountpoint is automatically derived from the parent.

4. Filesystems for projects. Ditto:

zfs create -o quota=100G z/proj/testbed

5. Filesystems for groups. I went with just a filesystem per *project*,
not per subgroup of the project:

zfs create -o quota=10G z/groups/testbed

As of early 2015, this is the officially supported model.

Notes:

* We could set a userquota for root on every FS, ala: userquota@root=0
   if we really wanted to stop root writes. We currently do not do this.
* Use snapdir=visible to make snapshots visible in .zfs? This would allow
   users to look around and get things from previous snapshots, e.g., if
   we used snapshots for backup. We currently do not do this.
* We may want to "zfs allow" the ability to snapshot (and rollback?) on
   the per-user and per-project filesystems so they could perform their
   own backups. We currently do not do this.

B1. Combining with NFSv4.

[ 09/2014 update: after using the setup below for awhile, I noticed that
NFS v4 on FreeBSD will hang for seconds at a time periodically (when
accessing ops from boss). These are frequent enough that it is disruptive
and we are no longer using V4. Instead we are using either AMD (FreeBSD 10.0)
or autofs (10.1) on boss to more-or-less transparently handle the mounts. ]

Our immediate interest in NFS v4 is just for its "inherited mounts"
mechanism so that clients do not have to explicitly mount every user/proj
filesystem. This is really all about boss mounts, since regular nodes
always explicitly mount everything already and don't necessarily have
NFS v4 anyway.

1. On ops.
Need to add to /etc/exports.head:

V4: / -sec=sys boss

   and get rid of exports of /{users,proj,groups,share} to boss. Leave
   the lines that export /usr/testbed and /var and the /share RO export
   to nodes. Then add to /etc/rc.conf:

   # NFS v4
   nfsv4_server_enable="YES"
   nfsuserd_enable="YES"

2. On boss.

Need to add to /etc/rc.conf:

   # NFS v4
   nfsuserd_enable="YES"
   nfscbd_enable="YES"

and fixup /etc/fstab to have:

   ops:/users    /users    nfs    rw,nfsv4,nosuid,late    0    0
   ops:/proj     /proj     nfs    rw,nfsv4,nosuid,late    0    0
   ops:/groups   /groups   nfs    rw,nfsv4,nosuid,late    0    0
   ops:/share    /share    nfs    rw,nfsv4,nosuid,late    0    0

in place of the usual mounts for those.

B2. Combining with AMD.

On boss, create /etc/amd.<FS> files for FS=users,proj,groups containing:

/defaults opts:=rw,nosuid,vers=3,proto=tcp
* type:=nfs;rhost:=ops;rfs:=/<FS>/${key}

   replacing <FS> as appropriate. Note that we are using TCP-based
   mounts here, you don't have to. Maybe you can do this all in one
   amd.* file, we didn't try. In /etc/rc.conf add:

amd_enable="YES"
amd_flags="-k amd64 -x all -d <your-domain> -l syslog /users /etc/amd.users \
/proj /etc/amd.proj /groups /etc/amd.groups"

and (first time only) do:

/etc/rc.d/amd start

B3. Combining with autofs.

   An autofs implementation is new in FreeBSD 10.1 (and maybe 9.3?).
   I gather it is like other autofs implementations. Here, on boss you
   need to create /etc/auto_master with:

   /users   auto_users   -nobrowse,nosuid,vers=3,tcp
   /proj    auto_proj    -nobrowse,nosuid,vers=3,tcp
   /groups auto_groups -nobrowse,nosuid,vers=3,tcp

Again, we are using TCP mounts. And then create /etc/auto_<FS> files with:

* fs:/<FS>/&

replacing <FS> as appropriate. Then enable it in /etc/rc.conf:

autofs_enable="YES"

and do the one-time start:

/etc/rc.d/autofs start

C. Growing the (non-root) zpool.

   [ WARNING: I only did this once and on an unimportant elabinelab.
     Think twice (or three or four times) before doing this to your
     production system. Don't try it without a backup. YOU HAVE BEEN
     WARNED! ]

   If your boss and ops are Xen VMs using logical disks exported from
   the host, then you can grow the zpool and logical disk. At least I
   have done so once!

1. Shutdown the zpool hosting users/proj/groups.

On ops:

zpool export z

You will of course need to shutdown single user first.

2. Grow the logical volume containing the zpool.

On the vhost you will need to detach the logical disk from the VM:

sudo xl block-list # to determine which device you need to detach
sudo xl block-detach ops <vdev>

Now resize the LV:

sudo lvresize -L <newsize> <VG>/<LV>

   Note: first I used "lvextend" to do this, but that resulted in a
   corrupted zpool when I tried to re-import. I do not know what the
   difference between lvresize and lvextend is, but use lvresize.

Now reattach it to the VM:

sudo xl block-attach ops <diskspec>

where <diskspec> is the xm.conf line like:

sudo xl block-attach ops 'phy:/dev/xen-vg/ops.disk4,sde,w'

On ops, reimport the pool and make sure it resizes:

   sudo zpool import z
   sudo zpool set autoexpand=on z
   ## I did this "offline" first but it seemed to fail because there
   ## was no redundancy in the pool, so I don't think that command
   ## is necessary, but if this "online" fails, try that.
   #sudo zpool offline z da4
   sudo zpool online -e z da4

where "da4" is the local disk name. At least this worked for me and
yielded a larger, uncorrupted disk!

D. Hacking mountd.

   As mentioned, mountd becomes a serious bottleneck when you start
   exporting thousands of filesystems. Part of the overhead is unavoidable,
   you just need to push a lot of info into the kernel, but mountd is
   particularly atrocious at doing it. On the mothership, I have hacked
   mountd in two ways. First, is to split the parsing of the exports file
   from the step that loads the new info into the kernel. Parsing takes
   a surprising long time due to resolution of host names and net groups
   and ???. By splitting parsing off, the time during which NFS service has
   to be suspended is minimized. Second, is to perform incremental updates
   of the kernel info rather than just throwing out the old info and
   inserting the new info. Again, this reduces the total time with
   service suspended.

   For us, the difference is significant. Even with changes to exports_setup
   to minimize the FSes we export to boss, a full export still takes over 15
   seconds, involving over 13,000 mount calls to remove and add export info.
   A typical incremental update caused by an experiment swapin or swapout
   still takes over 9 seconds, but services is only suspended for about 1.4
   seconds of that due to about 160 mount calls. There is still more that
   could be done, in particular tracking down the excessive parsing time
   (over 8 seconds in the above case). But I spent a week of my life that
   I will never get back hacking mountd, and there are more important things
   that need to be done.

   Note that I have only tested mountd against the sort of exports file syntax
   we use, I may well have broken other things (in particular, I made little
   effort to maintain NFSv4 support--which was already a horrible kludge to
   the code). For this reason, I won't be submitting these changes back to
   FreeBSD. If you really want to use this code, a patch is in the Emulab
   source directory as patches/FreeBSD-10.1-mountd.patch. It would be easy
   enough to adapt this patch to other versions of FreeBSD as mountd has
   changed little in the last 20 years.