A quick guide to state transitions in the Emulab boot process
[ Note: we are transitioning from a "BSD-based" boot environment which
uses a modified PXE-booted FreeBSD boot loader and memory-based FreeBSD systems (for admin and disk loading) to a "Linux-based" environment consisting of pxelinux, GRUB and memory-based busybox systems. The following discusses only the BSD environment. ]
PC nodes are configured to boot using PXE on a PXE-enabled NICs. We enable PXE on only the control network interface. PXE uses DHCP to determine what to boot next, and uses TFTP to download that next level boot program. The Emulab boss node serves as the central DHCP and TFTP server. Nodes are generally configured to boot a modified version of the FreeBSD boot loader called "pxeboot".
Pxeboot queries the boss node to find out what to do next. (The boss node is identified by either the next-server parameter in DHCP or the host that responded to the DHCP if next-server is not set.) This contact is through the Emulab-specific bootinfo protocol. The bootinfo server on boss returns either a disk/partition to boot from or a FreeBSD kernel directory on the TFTP server. In the case of booting from a disk/partition, pxeboot will just load the first sector of the indicated partition and jump to it. In the case of a kernel directory, it loads the loader.conf file which tells it a kernel and root filesystem image to load. The filesystem image (the "MFS") is a scaled-back FreeBSD filesystem which runs in RAM.
Free nodes (those not in experiments) sit in pxeboot, waiting for orders from Emulab. These nodes have a default OS image loaded, so that if an experiment allocates a node and the experiment wants one of the OSes loaded on the node, the node merely has to boot that OS, giving almost instant start up for such experiments. When a node is freed, it is booted into the disk reloading MFS, the disk is loaded, and the node is rebooted and reenters pxeboot where it waits.
When nodes are allocated to, or freed from experiments, their state is monitored by the central "stated" daemon which makes sure that the node follows an expected series of steps required to get the node to a usable state or get it reloaded. Stated can perform actions on state transitions (correct or not) or when a node fails to make a transition within a reasonable amount of time. A typical action on an invalid transition or a timeout is to reboot the node (i.e., it assumes the failure is transient). While a node is in an experiment, it continues to be monitored, but the only events of significance are reboots (which trigger no actions), disk reloads (which trigger the state machine described below), or experiment swapouts (which trigger the disk reloading sequence as well).
The current state of a node is recorded in the database. State transitions may be generated by the node itself using "tmcc" or by the control infrastructure on behalf of a node. Control programs that generate events include: tmcd (events from the node), bootinfo (reboot related events), and node_reboot (ditto). State transition events are sent from these programs using the event system. Stated listens for all such events (TBNODESTATE). Nodes can query their state through tmcc. Control programs query state typically by accessing the state in the DB.
As mentioned, free nodes sit in pxeboot, looping and waiting for an order. This is the PXEWAIT state. When a node is allocated the node_reboot script sends a wakeup (via the bootinfo protocol) and changes the node state to PXEWAKEUP. The node will then contact the bootinfo server on boss to see what to do. When the node makes this request, it is moved to the PXEBOOTING state (by the bootinfo server on behalf of the node). After bootinfo determines the response, and if it sends back a reply to either boot from disk or an MFS, it changes the node state to BOOTING. It stays in this state until the node OS's Emulab client scripts start up. The first such script moves the state to TBSETUP and then the final script moves it to ISUP. If any script fails along the way, it instead moves it to TBFAILED and aborts setup.
For a normal node reboot, a node will start out in the ISUP state. During the shutdown process, a node sends a shutdown event via tmcc, moving the node to the SHUTDOWN state. When the node reboots and gets into pxeboot, it makes a request to the bootinfo server and is moved to the PXEBOOTING state, and then follows the sequence above (to BOOTING, TBSETUP, and then to ISUP or TBFAILED).
When a node is freed via the swapout process, the node is set to use the RELOADING state machine that drives the node through a different set of transitions. First it is rebooted where it is told to load and execute the "frisbee MFS." After the BOOTING state, the rc.frisbee client script uses tmcc to report states RELOADSETUP (when it starts), RELOADING (when it actually starts the frisbee client), and RELOADDONEV2 (when post-frisbee tweaks are finished). The final state tells stated to reboot the machine, from which it reenters the PXEBOOTING/PXEWAIT states. [ The reason the node does not reboot itself after finishing is to close a race condition where the node could get back to pxeboot before stated had the chance to receive the DONE message and/or clear the reload-related DB state. ]