UsingZFS
Important Note:
Before FreeBSD 9.x, there is a huge race condition caused by mountd when
old export info is removed from the kernel and before new info is reloaded.
During this time, client operations will fail with "permission denied".
The window opens wider as you export more filesystems.
So, if you have lots of filesystems (i.e., users and projects), do NOT
attempt to use ZFS as described here with pre-9.x FreeBSD as the vulnerable
window is big enough to drive a truck through. Starting with 9.x, mountd
will at least suspends the NFS server while it is doing its thing. This
will cause clients to just block instead of fail.
A. Setting up ZFS users/proj/groups/share.
1. Make a zpool. I am just using the second disk of a d710 right now.
We would want some sort of redundant config on a real machine, though
probably not on a VM where the underlying storage is already RAID.
If it is a VM, you might want to export raw disks to the VM for ZFS
to work with. Otherwise you risk offending the gods of ZFS who say
that ONLY ZFS knows how to properly make a redundant volume. We don't.
We just exported a logical slice of a HW RAID6 for ZFS to use.
Awaiting lightning bolts...
I did not use any special features here:
zpool create z /dev/da1
2. Create "parent" ZFSes (datasets) for /users, /proj, /groups, and /share.
I did this so that we can inherit properties from it, like the mountpoint.
We can also use it to impose a quota for the total space used by /users
or /proj. And it also allows us to perform some operations recursively on
all the children, like snapshot.
zfs create -o setuid=off -o mountpoint=/users -o quota=50G z/users
zfs create -o setuid=off -o mountpoint=/proj -o quota=200G z/proj
zfs create -o setuid=off -o mountpoint=/groups -o quota=50G z/groups
zfs create -o setuid=off -o mountpoint=/share -o quota=50G z/share
If we are paranoid, we should set setuid per-filesystem instead just
to make sure that if it accidentally got turned off at the root, it
would not get turned off everywhere.
As mentioned, the quota here is not necessary, it only serves the
purpose of preventing one of the sub pieces from eating up the whole
zpool. Actually, it might be better in that case to use reservation=NNN
which would effectively statically partition the space of the zpool among
the four datasets.
We can also use these root datasets to enable the default exporting
of them and the descendent filesystems to boss, ala:
zfs set sharenfs='boss -maproot=root' z/users
...
however, as of early 2015 we no longer do that. This was prompted
by the fact that we have over 7500 ZFS filesystems and exporting
each was causing mountd to take minutes to reload. Instead we have
now modified exports_setup to export only the "active" users and
projects to boss. This is a stopgap til we eliminate the need for
boss to have ops FSes NFS mounted.
Anyway, you instead need to disable use of ZFS's export table with:
zfs set sharenfs=off z/users
zfs set sharenfs=off z/proj
zfs set sharenfs=off z/groups
zfs set sharenfs=off z/share
Note that ZFS has a sharesmb attribute for samba exports, but we
do NOT use this (i.e., leave it set to off). Samba exports to nodes
are still handled via exports_setup.
3. Filesystems for users. I created a zfs per-user with quota,
along the lines of:
zfs create -o quota=1G z/users/mike
We need to decide on a mechanism for determining the quota value to use.
Note that the mountpoint is automatically derived from the parent.
4. Filesystems for projects. Ditto:
zfs create -o quota=100G z/proj/testbed
5. Filesystems for groups. I went with just a filesystem per *project*,
not per subgroup of the project:
zfs create -o quota=10G z/groups/testbed
As of early 2015, this is the officially supported model.
Notes:
* We could set a userquota for root on every FS, ala: userquota@root=0
if we really wanted to stop root writes. We currently do not do this.
* Use snapdir=visible to make snapshots visible in .zfs? This would allow
users to look around and get things from previous snapshots, e.g., if
we used snapshots for backup. We currently do not do this.
* We may want to "zfs allow" the ability to snapshot (and rollback?) on
the per-user and per-project filesystems so they could perform their
own backups. We currently do not do this.
B1. Combining with NFSv4.
[ 09/2014 update: after using the setup below for awhile, I noticed that
NFS v4 on FreeBSD will hang for seconds at a time periodically (when
accessing ops from boss). These are frequent enough that it is disruptive
and we are no longer using V4. Instead we are using either AMD (FreeBSD 10.0)
or autofs (10.1) on boss to more-or-less transparently handle the mounts. ]
Our immediate interest in NFS v4 is just for its "inherited mounts"
mechanism so that clients do not have to explicitly mount every user/proj
filesystem. This is really all about boss mounts, since regular nodes
always explicitly mount everything already and don't necessarily have
NFS v4 anyway.
1. On ops.
Need to add to /etc/exports.head:
V4: / -sec=sys boss
and get rid of exports of /{users,proj,groups,share} to boss. Leave
the lines that export /usr/testbed and /var and the /share RO export
to nodes. Then add to /etc/rc.conf:
# NFS v4
nfsv4_server_enable="YES"
nfsuserd_enable="YES"
2. On boss.
Need to add to /etc/rc.conf:
# NFS v4
nfsuserd_enable="YES"
nfscbd_enable="YES"
and fixup /etc/fstab to have:
ops:/users /users nfs rw,nfsv4,nosuid,late 0 0
ops:/proj /proj nfs rw,nfsv4,nosuid,late 0 0
ops:/groups /groups nfs rw,nfsv4,nosuid,late 0 0
ops:/share /share nfs rw,nfsv4,nosuid,late 0 0
in place of the usual mounts for those.
B2. Combining with AMD.
On boss, create /etc/amd.<FS> files for FS=users,proj,groups containing:
/defaults opts:=rw,nosuid,vers=3,proto=tcp
* type:=nfs;rhost:=ops;rfs:=/<FS>/${key}
replacing <FS> as appropriate. Note that we are using TCP-based
mounts here, you don't have to. Maybe you can do this all in one
amd.* file, we didn't try. In /etc/rc.conf add:
amd_enable="YES"
amd_flags="-k amd64 -x all -d <your-domain> -l syslog /users /etc/amd.users \
/proj /etc/amd.proj /groups /etc/amd.groups"
and (first time only) do:
/etc/rc.d/amd start
B3. Combining with autofs.
An autofs implementation is new in FreeBSD 10.1 (and maybe 9.3?).
I gather it is like other autofs implementations. Here, on boss you
need to create /etc/auto_master with:
/users auto_users -nobrowse,nosuid,vers=3,tcp
/proj auto_proj -nobrowse,nosuid,vers=3,tcp
/groups auto_groups -nobrowse,nosuid,vers=3,tcp
Again, we are using TCP mounts. And then create /etc/auto_<FS> files with:
* fs:/<FS>/&
replacing <FS> as appropriate. Then enable it in /etc/rc.conf:
autofs_enable="YES"
and do the one-time start:
/etc/rc.d/autofs start
C. Growing the (non-root) zpool.
[ WARNING: I only did this once and on an unimportant elabinelab.
Think twice (or three or four times) before doing this to your
production system. Don't try it without a backup. YOU HAVE BEEN
WARNED! ]
If your boss and ops are Xen VMs using logical disks exported from
the host, then you can grow the zpool and logical disk. At least I
have done so once!
1. Shutdown the zpool hosting users/proj/groups.
On ops:
zpool export z
You will of course need to shutdown single user first.
2. Grow the logical volume containing the zpool.
On the vhost you will need to detach the logical disk from the VM:
sudo xl block-list # to determine which device you need to detach
sudo xl block-detach ops <vdev>
Now resize the LV:
sudo lvresize -L <newsize> <VG>/<LV>
Note: first I used "lvextend" to do this, but that resulted in a
corrupted zpool when I tried to re-import. I do not know what the
difference between lvresize and lvextend is, but use lvresize.
Now reattach it to the VM:
sudo xl block-attach ops <diskspec>
where <diskspec> is the xm.conf line like:
sudo xl block-attach ops 'phy:/dev/xen-vg/ops.disk4,sde,w'
On ops, reimport the pool and make sure it resizes:
sudo zpool import z
sudo zpool set autoexpand=on z
## I did this "offline" first but it seemed to fail because there
## was no redundancy in the pool, so I don't think that command
## is necessary, but if this "online" fails, try that.
#sudo zpool offline z da4
sudo zpool online -e z da4
where "da4" is the local disk name. At least this worked for me and
yielded a larger, uncorrupted disk!
D. Hacking mountd.
As mentioned, mountd becomes a serious bottleneck when you start
exporting thousands of filesystems. Part of the overhead is unavoidable,
you just need to push a lot of info into the kernel, but mountd is
particularly atrocious at doing it. On the mothership, I have hacked
mountd in two ways. First, is to split the parsing of the exports file
from the step that loads the new info into the kernel. Parsing takes
a surprising long time due to resolution of host names and net groups
and ???. By splitting parsing off, the time during which NFS service has
to be suspended is minimized. Second, is to perform incremental updates
of the kernel info rather than just throwing out the old info and
inserting the new info. Again, this reduces the total time with
service suspended.
For us, the difference is significant. Even with changes to exports_setup
to minimize the FSes we export to boss, a full export still takes over 15
seconds, involving over 13,000 mount calls to remove and add export info.
A typical incremental update caused by an experiment swapin or swapout
still takes over 9 seconds, but services is only suspended for about 1.4
seconds of that due to about 160 mount calls. There is still more that
could be done, in particular tracking down the excessive parsing time
(over 8 seconds in the above case). But I spent a week of my life that
I will never get back hacking mountd, and there are more important things
that need to be done.
Note that I have only tested mountd against the sort of exports file syntax
we use, I may well have broken other things (in particular, I made little
effort to maintain NFSv4 support--which was already a horrible kludge to
the code). For this reason, I won't be submitting these changes back to
FreeBSD. If you really want to use this code, a patch is in the Emulab
source directory as patches/FreeBSD-10.1-mountd.patch. It would be easy
enough to adapt this patch to other versions of FreeBSD as mountd has
changed little in the last 20 years.