NetBooting

Network booting techniques used in Emulab
How things boot: a brief summary of the various techniques we use to
network boot nodes and load MFSes.

0. What we can boot.

   All of our nodes are configured to network (PXE) boot and, whereever
   possible, to NEVER boot from the hard drive. The latter is to ensure
   that if the network boot fails, we won't boot some potentially unknown
   OS from the hard drive and that we will try the network boot again.
   In the olden days, for nodes that we could not prevent from falling
   back to a disk boot, we would install an MBR boot loader that just
   immediately rebooted (well, it would give the user a prompt for a
   couple of seconds and then reboot--just in case they really wanted
   to boot from disk).

   The first level boot loader, loaded via PXE, is typically specified
   via DHCP as the "filename" argument. However, this is not always true
   as we will see below. This "pxeboot" will then talk to Emulab via
   the bootinfo protocol to see what to do next. The options are:

   * Boot from a disk/partition combo.
     The disk number is a BIOS disk number (e.g., 0x80). If no disk is
     specified, then boot from disk 0.

   * Boot from a disk/partition with a kernel command line.
     This was mostly useful for OSKit kernel "back in the day", but is
     also used to select an alternate kernel or select the HZ rate of
     a delay node. The actual semantics of the command line depends on
     the combination of the boot loader and kernel. At the very least
     an alternate kernel name (the first arg) should work. Additional
     command line arguments may or may not make it to the kernel and
     some may be interpreted as environment settings in the loader and
     those may or may not make it to the kernel. In other words, use
     of a command line for anything other than setting the HZ rate of
     a delay node kernel may or may not work!

   * Boot a kernel using a memory-based root filesystem (MFS).
     MFSes are used typically when loading a disk or creating an image
     of a disk. The former, the so-called "frisbee" MFS, is not intended
     to be a general OS, its sole purpose is to run the frisbee disk
     loader. The latter is known as the "admin" MFS because it can be
     used for purposes other than just capturing a disk image; e.g., if
     you have screwed up the disk somehow and want to get on and look at
     or fix it. This MFS is intended to be more general purpose. There
     is a third "newnode" MFS that is used only when a machine is first
     booted, to report info back to Emulab about this new node.

   Two stated transitions are related to the initial boot process:
   PXEBOOTING and BOOTING. The first says that a node has made a PXE
   request, the latter says that the OS is booting. Since the node
   cannot self-report these transitions (as it does all others) we
   have to report them by proxy. Typicially, but not always, the
   bootinfo server on boss is the proxy. Whenever a node makes a
   "what do I do next" request, the server reports both transitions.

1. The "classic" Emulab legacy BIOS boot path.

   The original and still most highly used bootstrap path involves PXE
   booting via the BIOS, a FreeBSD-based boot loader, and FreeBSD-based
   MFSes. In this path, /tftpboot/pxeboot.emu is specified in the dhcpd.conf
   file as the program to download. pxeboot is a PXE-savvy version of
   the FreeBSD "stage 2" boot loader which has been modified to talk
   to the bootinfo server and handle the boot scenarios above.

   If bootinfo tells the loader to boot from a disk without any additional
   command line arguments, then pxeboot loads the first sector of that
   partition (or MBR) and jumps to it. Job done. If there is a command line,
   and we are booting FreeBSD, then the loader can directly boot the
   kernel (first arg), after first parsing the remaining arguments
   converting key=val strings into loader variables that can be passed to
   the kernel. This is how delay nodes get booted--we specify either a
   custom kernel name or we pass "kern.hz=10000" as an argument.
   However, if it is a Linux kernel being booted, then the presense of a
   command line will likely cause the boot to fail. You cannot even specify
   an alternate kernel. (Something to keep in mind if we ever want a Linux
   delay node!) Note that the only case where a Linux command line works
   is if the on-disk boot loader is LILO. pxeboot has code to pass LILO
   command line arguments. But only very ancient images have LILO boot blocks.

   Otherwise pxeboot uses TFTP to load a kernel and "mfsroot". pxeboot will
   only load FreeBSD-based MFSes because it only knows how to direct boot a
   FreeBSD kernel. Thus the FreeBSD pxeboot must be paired with FreeBSD MFSes.
   MFS booting basically works by fooling pxeboot into thinking that, e.g.,
   boss:/tftpboot/frisbee is the root filesystem. Hence it tries to read the
   normal boot time things out of "/boot", including loader.conf and the
   kernel. loader.conf contains special variables to tell it to use a file as
   an in-memory root filesystem ("mfsroot"). So pxeboot.emu reads all these
   files from "the filesystem" which is handled behind the scenes via the
   libstand TFTP code. Once the kernel is booted, the OS looks pretty normal,
   modulo the fact that the command set is extremely limited and the disk is
   extremely small.

2. The "Linux MFS" BIOS boot path.

   Back around 2008, Ryan Jackson made a valiant effort to move us out of
   the FreeBSD world and into Linux by creating a boot chain that used
   Linux-related tools. These include a version of Grub (pre-2.0) modified
   to support our bootinfo protocol, a custom grub.cfg to handle booting
   from a partition or an MFS, a busybox-based Linux filesystem that serves
   as the MFS, and a Linux 3.2.7 kernel.

   PXE downloads the custom Grub (/tftpboot/pxeboot_grub2pxe) specified
   in the DHCP config file. It uses /tftpboot/grub2/grub.cfg as its initial
   config. This is the custom script which invokes bootinfo and then either
   boots from a partition or loads an MFS.

   For the MFS case, it reloads a config file from one of
   {admin,frisbee,newnode}_linux and that config file loads the Linux kernel
   and initrd (MFS). The kernel is passed a special "elab_mode=" command
   line parameter set to one of "frisbee", "admin", or "newnode" so that
   only one kernel and MFS is needed for all three uses. This one MFS is a
   much more complete system than what any of the FreeBSD MFSes provide.
   However, it is busybox so many of the standard commands are subsets of
   the real versions.

   For booting a FreeBSD system from disk, grub uses the kFreeBSD command
   to load /boot/loader from the disk passing along environment variables
   via kFreeBSD.* variables.

   *** TODO: figure out exactly what command lines we can handle and
       finish this section. ***

   For booting from disk without a command line in Linux, we typically just
   chainload the partition boot block. ...

   Of note for chain booting. In order to pass (command line?) arguments to
   the FreeBSD bootloader from Grub, we had to hack a special version of the
   BSD bootloader for some older FreeBSD images. This is /boot/emuboot in those
   images. I am not sure if this is needed for newer versions of the laoder.

3. The Moonshot m400 ARM U-boot/pxelinux boot path.

   In 2014, we got the HP Moonshot ARM cartridges that use U-boot and
   PXE boot via a builtin version of pxelinux. Since there is no working
   version of FreeBSD for these boxes, we use Linux-based MFSes--one an
   initramfs (frisbee) and the other an NFS-mounted root filesystem (admin).

   When booted, the nodes still DHCP, but they now ignore the "filename"
   argument that is returned. Instead, the first contact is via a TFTP read
   of a file from /tftpboot/pxelinux.cfg. There is a sequence of files it
   attempts to read starting with the very specific (a file name that is
   the same as the MAC address) up to a generic "default". We use the
   individual files named after the MAC that are cloned from a template
   file /tftpboot/pxelinux.cfg/boot.template. This template has different
   menu entries for booting from the disk (via the boot partition which is
   also the root partition for us), booting from NFS (the admin MFS), or
   booting from an initramfs (frisbee MFS). It also has a special PXEWAIT
   entry, but we won't talk about that. Whenever a node changes its boot
   designation (e.g., set/clear node_admin, during reloading) we create
   a new version of the template with the correct default boot menu item
   (and correct NFS root FS path for NFS booting).

   One interesting aspect of the PXE boot is that, since bootinfo is not
   called on the boot path, we handle sending the initial PXEBOOTING and
   BOOTING state transition events via dhcpd. The dhcpd.conf file has an
   "on commit" section that allows us to invoke a script whenever a client
   has accepted a lease. In our case we call /usr/testbed/sbin/reportboot
   which will send the appropriate events. Since the Utah Moonshot cluster
   now supports both ARM and x86 (see #5 below), we have to be careful to
   only send these events from dhcpd for U-boot nodes.

   The frisbee MFS is a more conventional Linux initramfs, but it is
   utterly unrelated to the "Linux MFS" (#2) that we use on the x86 nodes.
   Trying to recreate Ryan's build environment for the ARM architecture was
   a non starter. Instead, I started with the default "initrd" and just
   kept adding stuff til rc.frisbee worked! So it is just as, or even more,
   limited than the FreeBSD frisbee MFS.

   I took a different tack for the admin MFS. Rather than trying to build
   up the initramfs further (with perl, etc.) I took advantage of two things:
   1) the fact that we had an NFS bootable version of Ubuntu 14 that we got
   from HP and 2) the fact that we are using ZFS on the fs node making cloned
   filesystems fast and easy. The z/nfsroot/m400 zfs, mounted at /nfsroot/m400,
   is the base volume for the m400 admin FS. z/nfsroot/m400@current is the
   snapshot which is cloned on demand to create filesystems for the individual
   nodes; e.g., z/nfsroot/ms0102 would be the volume for ms0102 when it is
   in admin mode. So the node_admin command will "zfs clone" a new version
   of the snapshot when admin mode is entered, and "zfs destroy" when done.
   This is triggered if the osid for the admin MFS contains the path
   "fs:/nfsroot".

4. The Intel NUC got-a-lame-BIOS boot path.

   In 2015, we got some Intel NUCs for PhantomNet. The plan here was just
   to use the traditional boot path (#1) but we ran into a problem where,
   when configured for legacy BIOS and booting via the network, the SATA
   (AHCI?) option ROM (aka, driver) was not loaded. So we could not access
   the disk from the PXE booted loader. This is fine for the MFSes, which
   never access the disk until the OS is loaded, but made it impossible to
   boot from a disk partition since we could not load the boot sector.

   Fortunately, Grub2 has the option to use "native" device drivers rather
   than the BIOS, so using the Linux MFS (#2) seemed feasible...except that
   the pre-2.0 version of Grub Ryan used did not support native SATA. Thus
   I embarked on a project to figure out how Ryan built pxeboot_grub2pxe
   and transfer his changes into Grub 2.02. The result is now in the
   https://gitlab.flux.utah.edu/emulab/emulab-grub2.git repo. To not clash
   with the older grub-pre-2 install, we set the filename differently in
   dhcpd.conf. For these nodes we use /tftpboot/grub2pxe-native-vga/grub2pxe
   as the filename. That directory also contains a further tweaked version
   of grub.cfg which is configured to use native disk drivers via a variable
   setting at the top of the file. The naming convention for directories:
   grub2pxe--, allows us to have a different grub.cfg in
   each, specifying the drivers to use ("bios" or "native") and the console
   ("vga", "sio1", "sio2") via variables. They share modules, fonts, etc.
   via symlinks to the grub2.0 directory.

   The MFSes used are just the VGA versions of the ones discussed in #2
   (/tftpboot/admin_linux_vga and /tftpboot/frisbee_linux_vga -- I hope
   you were not looking for a consistent naming convention from me!)

   Booting from disk however is harder. We cannot just chainload the on-disk
   loader since it most likely expects to use the BIOS disk interface.
   So for Linux, we just boot the kernel directly. We load the grub.cfg
   file from the disk and hope the config file is compatible with our
   version of Grub! For FreeBSD...well, we just don't boot FreeBSD on
   these nodes. Grub2 is capable of directly booting FreeBSD kernels,
   so it would be possible.

5. The Moonshot m510 x86 UEFI-only boot path.

   Now we are in 2016 and we have yet another variant! The new HP Moonshot
   x86 cartridges are UEFI only, no legacy BIOS setting. So it is time to
   suck it up and figure out how to handle UEFI. The goal is to have common
   images that boot on both BIOS and UEFI machines. We can actually pull this
   off if we never want to boot directly from the disk. When booting from
   disk with UEFI, you not only have to have a boot loader that speaks the
   UEFI API but also a dedicated boot partition formatted with a FAT
   filesystem that contains the boot loader. Making that work, along with
   having an MBR boot block with a BIOS-speaking boot loader was going to
   be a task. Fortunately, it is simpler if we assume that we always boot
   from the network (we do). Now we just have to load an EFI-savvy boot
   loader via PXE (which UEFI still supports) and then (hopefully) just
   use that to directly load/boot the kernel and MFS. My travails are
   documented in https://gitlab.flux.utah.edu/emulab/emulab-devel/issues/53.

   We wound up needing to use Grub here as well for the first level boot.
   But of course Grub has to be compiled differently to get EFI support.
   So now we have a new "driver" catagory: "efi". For these particular
   nodes the DHCP "filename" is set to /tftpboot/grub2pxe-efi-sio1/grub2pxe.
   And grub.cfg needed some further tweaks. To boot FreeBSD from disk,
   we just need to chainload /boot/loader.efi from disk. It exists in
   all our existing images for newer versions of FreeBSD.

   Since I just could not stomach the prospect of upgrading the Linux MFS,
   or at least the kernel, to support newer hardware (NVME disk, Mellanox
   10Gb Ethernet), I took advantage of Grub's support for booting FreeBSD
   and figured out how to load a FreeBSD kernel and MFS with Grub.



How to make things.

1. pxeboot.emu, the FreeBSD-based PXE boot loader.

   pxeboot.emu is still built using the FreeBSD 7.2 sources from
   ops.emulab.net:/share/freebsd/7.2/src/sys/boot/i386/emuboot along with  
   a tweaked version of libstand. IMPORTANT: these changes only exist in
   that source tree (and the corresponding RCS directories). They have
   never been isolated and put in a git repo anywhere! To rebuild pxeboot,
   you will have to allocate a node running the FBSD72-STD image (a 32-bit
   version of FreeBSD 7.2--anybody see any weak links in this chain?),
   login and:

   mount -o ro fs:/share/freebsd/7.2/src /usr/src
   cd /usr/src/lib/libstand
   make obj all install
   cd /usr/src/sys/boot
   make obj all
   cd /usr/obj/usr/src/sys/boot/i386/emuboot
   cp pxeboot 

   By default this will create the "sio1" serial port console version.
   To create "null", "vga", "sio[234]" versions you will need to tweak
   /share/freebsd/7.2/src/sys/boot/i386/Makefile.inc as follows:

   "vga" version: comment out:

   FORCE_SERIAL_CONSOLE=      1
   BTX_SERIAL=                1
   BOOT_PXELDR_ALWAYS_SERIAL= 1

   and build.

   "null" version: uncomment:

   #CFLAGS+=               -DNONINTERACTIVE

   To be safe, also comment out the serial console lines as per the "vga"
   build above.

   "sio2" verison: make sure the three serial console lines above are
   *not* commented out and then uncomment the "2f8" line from:

   #BOOT_COMCONSOLE_PORT=  0x2f8
   #BOOT_COMCONSOLE_PORT=  0x3e8
   #BOOT_COMCONSOLE_PORT=  0x2e8

   "sio3" version: include serial console lines, uncomment the "3e8" line.

   "sio3" version: include serial console lines, uncomment the "2e8" line.

   That is it! You should now have all five versions of pxeboot.

2. The FreeBSD MFSes.

   The FreeBSD MFSes were hand-rolled long ago (circa 2000) and have been
   lovingly maintained ever since. They all started life as FreeBSD 6(?)
   filesystems. Since then, they have been hand updated a couple of times.
   Once to move them to FreeBSD 8 binaries and once to create a version
   with 64-bit binaries. They are waaay past due for another update, or
   even better, a reproducible reimplementation using NanoBSD or something.

   The disk-loading ("frisbee") MFS was mercilessly (and non-systematically)
   hacked to remove all packages and standard binaries that were not needed
   to run our rc.frisbee script. It is truly a unicorn--a 12MB filesystem
   (4MB compressed) that is fast to download with TFTP. In fact, it is now
   smaller than the kernel that is downloaded with it. The frisbee MFS does
   NOT include perl or ssh among many other utilities normally thought of as
   indispensable.
   
   The admin ("freebsd") MFS was basically the same, but without taking out
   the more useful things like perl and ssh. It is a 40MB filesystem.

   The "newnode" MFS, used only when adding new nodes to the testbed, is very
   similar to the admin MFS--basically the same with different Emulab apps
   loaded.

   "Rebuilding" these basically means updating the Emulab client-side that
   is installed. Start with a FBSD83-STD or FBSD83-64-STD image and copy over
   the "mfsroot" files from /tftpboot/{frisbee,freebsd,freebsd.newnode}/boot.
   Then set up an Emulab clientside build tree ala:

   mkdir obj; cd obj
   /clientside/configure --with-TBDEFS=/defs-utahclient

   and for each MFS (e.g. frisbee):

   mdconfig -a -t vnode -f mfsroot.frisbee -u 5
   mount /dev/md5 /mnt
   cd 
   setenv DESTDIR /mnt
   make frisbee-mfs-install
   umount /dev/md5
   mdconfig -d -u 5

   And then copy back the updated mfsroot file. There are make targets
   "mfs-install" (admin MFS) "newnode-mfs-install" (newnode MFS) for the
   others.

   If you only need to update a shell/perl script, then you don't need
   a FreeBSD 8 node. You can just mount up the MFS and copy in the script
   by hand.

   It is possible to pair one of a FreeBSD 8, 9, or 10 kernel with the
   mfsroot. We have had to use increasingly newer kernels to pick up
   support for newer hardware over time.

   These kernels used to be of the custom unicorn variety, reflecting just
   the hardware we had, but nowadays we use a variant of the GENERIC config
   so we get a more broad range of hardware coverage. There should be a
   {i386,amd64}/conf/TESTBED-PXEBOOT-GENERIC configuration file to use.
   This includes a custom Emulab enhancement (ipod) as well as additional
   drivers and setting we need. To build the kernel, start with a node
   running the appropriate image and then:

   mount fs:/share/freebsd//src /usr/src
   cd /usr/src
   make -j8 buildkernel KERNCONF=TESTBED-PXEBOOT-GENERIC
   cd /usr/obj/usr/src/sys/TESTBED-PXEBOOT-GENERIC
   cp kernel 

3. The Linux MFS.

   Swap in Mike's emulab-ops/Linux-MFS experiment.

   Do magic shit.
   (I think there is a README out in the persistent blockstore that has the
   environment...)

4. The HP Moonshot m400 MFSes.

   How to update the initramfs.
     - unwrap from u-boot, unpack, update, re-pack, rewrap for u-boot
     - all in boss.utah.cloudlab.us:hibler/Moonshot/mfs

   How to update the NFS filesystem.
     - install new stuff into /nfsroot/m400 on fs
       (no makefile target yet, I don't think)
     - move old @current snapshot to @old
     - create new @current snapshot

   How to update the kernel.
     - don't! I tried once and broke it.
     - need to have NFS (NFSROOT?) built into the kernel

5. Building "grub2pxe", the modified version of Grub for PXE boot.

   Hopefully you will never have to build the 1.97+ version Ryan used
   for the  original Linux MFS environment. The source repo is:
       git-public.flux.utah.edu:/flux/git/users/rdjackso/grub
   and I have it checked out in ops:~mike/grub-ryan.

   For the 2.02 version, the repo is
       http://gitlab.flux.utah.edu/emulab/emulab-grub2.git
   and there is a README.md file that describes how to build and install it.