Vnodesetup

Vnode Setup

How FreeBSD Jail (and Linux Vserver) based vnodes are setup (and torn down).

1. bootvnodes [-b] [-h] [-k]

Actions are: -b to boot all vnodes, -h to halt them but save their disk setup, -k to kill them, removing all the virtual disks. Halt is typically used when the physical host is rebooting. In fact, the kill option is only used for debugging. Normally when an experiment is being torn down, we don't bother to kill the vnodes as they will go away when the physical node does.

If no action is specified, it is a "cold" boot. In this case we query boss to get the list of vnodes. Then we do any physnode actions (e.g., make a big FS, create vn devices) and then call vnodesetup for each node.

If an action is specified, we just look in /var/emulab/jails for all pcvm* subdirectories and call vnodesetup for each of them.

bootvnodes exits after all calls to vnodesetup have returned. Actually, bootvnodes performs all actions in the background (i.e. returns immediately) unless -f is given to do it in the foreground.

2. vnodesetup -j [-b] [-h] [-k] <vnodeid>

The -j says this is a jailed vnode. -[bhk] are for boot, halt, kill as in bootvnodes. Here an action must always be given.

Booting. In theory, the first thing we do is fork a child process to continue the boot process in the background and the parent exits immediately. However, this causes bootvnodes to fire off all the vnodesetups concurrently, which proved to be a bad thing for reliability. So now, the parent process waits for 1 minute or until it sees that the vnode has gotten as far as firing up its watchdog process (whichever comes first) before exiting. This throttles the concurrency somewhat.

The child process daemonizes itself (creating a new process group for it and its descendents). Then it sets up to catch signals, informs the testbed via tmcc that the vnode is booting and populates the vnode configuration directory using libsetup::vnodejailsetup. This config directory (/var/emulab/jails/<vnodeid>) most importantly contains the jailconfig file which in turn contains the key=value pairs returned by the tmcc "jailconfig" command. Finally the child forks again. Now the original child process (now called the "parent") just waits around until it receives a signal or its child exits. The former case is how jails are killed off (explained later). The parent's pid is recorded in /var/run/tbvnode-<vnodid>.pid.

The new child process now just exec's /usr/local/etc/emulab/mkjail.pl (which on Linux is currently symlinked to mkvserver.pl).

So at this point, there is a watchdog vnodesetup process waiting, and a worker child mkjail process doing the rest of the work.

Halting and Killing. To halt or kill a vnode, vnodesetup reads the watchdog pid file (/var/run/tbvnode-<vnodid>.pid) written when the vnode was started and sends a TERM (halt) or USR1 (kill) signal to that process. Then it waits around for up to 30 seconds for the pid file to be removed, indicating that the vnode was stopped. Currently if the pid file is not removed in that time, we exit(0) anyway.

When the watchdog process receives the signal, it calls the "cleanup" function (via "fatal"). Cleanup informs the testbed via tmcc that we are going down, and sends a signal on to the worker mkjail process which (as we will see) is still around as well. The signals are different here however; we send a USR1 if we are just halting the vnode, a HUP if we are destroying it. Cleanup then waits for the mkjail process to die and then sends a TERM signal to the whole process group for good measure [what does this kill off?] Finally, if this is a kill operation, it removes the whole /var/emulab/jails/<vnodeid> hierarchy (carefully--it will fail if there are any loopback mounts left over) and removes the pid file.

3a. BSD: mkjail.pl -p <exppid> -h <hostname> <vnodeid>

mkjail.pl is only called to boot a jail-based vnode. Halting vnodes is handled by catching signals in this creation invocation, not by re-invoking it.

mkjail sucks in the contents of the jailconfig file created by vnodesetup and uses that info to configure the local kernel to allow the jail to be created as necessary (e.g., using a sysctl to allow jails to use BPF) and to build up a command line for actually creating the jail.

mkjail then either creates or "restores" the filesystem namespace for the vnode. Creation involves setting up a "vnode disk" for mutable local file systems and then loopback mounting /usr and NFS mounted filesystems. Network interfaces are also setup at this point, but these aspects of jail setup are covered in the online docs.

Now mkjail starts up a proxy instance of tmcc which listens on a unix domain socket shared with the soon-to-be jail. The purpose of this proxy is now lost in the mists of time. It is not a performance proxy, as all requests from inside are just forwarded on to boss, there is no caching or aggregation. It appears to be more of a security issue for remote jails (i.e., the never deployed jails on RON nodes), but I am not certain about that.

Finally, mkjail forks and once again, the parent sits around waiting for children to die (tmcc proxy or this child) or for a signal, while the child goes off and actually execs the "jail" command which starts up the jail running /etc/jail/injail. Here the parent remembers the pid of both the tmcc proxy and the forked jail. If the tmcc proxy terminates, it restarts it. If the jail terminates, we cleanup and exit. If a signal is received, we again differentiate halting (USR1) from destroying (INT or HUP).

Cleanup consists of killing the tmcc proxy, sending a USR1 to the jail init process (see below), waiting for it to terminate and then undoing all the mounts and interface setup. Additionally, if the jail is being torn down, the per-vnode disk is destoryed as well.

At this point in the creation process, there are now vnodesetup and mkjail processes both just waiting for vnode termination.

3b. BSD: injail

Finally we are actually running in the jail environment. injail is just a mini-version of /sbin/init whose main job is to fork and fire off /etc/rc in the child. The parent process then--you guessed it--waits around for the child to terminate or for a signal in order to shutdown all jailed processes. A signal can be either sent from outside (mkjail.pl) or inside (shutdown).

So at the end of the day there are *three* processes whose sole job is to wait for termination or signals: vnodesetup and mkjail.pl outside the jail, and injail inside.

4a. Linux: mkvserver.pl -p <exppid> -h <hostname> <vnodeid>

mkvserver.pl is derived from mkjail.pl but is for vservers (duh!) It is only called to boot a Emulab cluster vserver-based vnode, it is not used for planetlab or other remote vserver setups. As with mkjail.pl, mkvserver.pl is only used to start vnodes. There is no seperate invocation for halting them. Halting is done by sending a signal to the creation invocation.

mkvserver sucks in the contents of the jailconfig file created by vnodesetup and uses that info to configure the local kernel to allow the vserver to be created as necessary. In the vserver case, we need (first time only) to create a control net bridge to which all per-vnode cnet devices (tunnels) are attached. We then create the control net tunnel device for this vnode and attach it.

mkvserver then either creates or "restores" the filesystem namespace for the vnode. Creation currently involves "building" a vserver and then creating a copy of the mutable local file systems and loopback mounting /usr and NFS mounted filesystems.