Vnodeimpl

Rationale

This file describes the implementation of (and a little bit of the rationale for) our FreeBSD-based "virtual node" support.

The summary is that we use a highly customized FreeBSD kernel supporting beefed-up jails, virtual disks, virtual ethernet interfaces, and multiple routing tables to implement reasonably isolated virtual nodes for network- centric activity. We do not yet provide complete resource isolation, in particular to limit CPU or memory consumption.

You may also want to look at the paper describing some of this, including the larger system of which it's a part:

"Large-scale Virtualization in the Emulab Network Testbed" Hibler et al, 2008 Usenix Annual Technical Conference, June 2008. http://www.cs.utah.edu/flux/papers/virt-usenix08-base.html

Rationale

Why jails?

One way to achieve multiplexing of activity on physical nodes is simply to start up multiple copies of an experimenter's desired applications. For example, if they want 10 "nodes" running a background traffic generator, we just start up 10 copies on a single physical node. There are a number of problems with this approach. One is that it will likely require customization of the application, possibly even changing the code itself, to allow it to co-exist with other instances of itself. For example, the application might have a hardwired path in the filesystem for configuration or logging information or require a specific port number. The first step in virtualizing an application's environment is thus to give it its own name space.

BSD jails (jail(2)) serve to restrict a process and all its descendents to a unique slice of the filesystem namespace using chroot. This not only gives each jail a custom, virtual root filesystem (/, /var, etc.) but also insulates them from the filesystem activities of others (and visa-versa). Jails also provide a mechanism for virtualizing and restricting access to the network. When a jail is created, it is given a virtual hostname and a set of IP addresses that it can bind to. These IP addresses are associated with network interfaces outside of the jail context and cannot be changed from within a jail. Hence, jails are implicitly limited to a set of interfaces they may use. Further, jails allow processes within them to run as root, albeit a watered-down variant. With root inside a jail, applications can add/change/remove whatever files they want (except for device special files), bind to privileged ports, and kill any other jailed processes.

Why virtual disks?

One potential problem with the filesystem virtualization provided by jails, is constraining disk usage. Even though each jail has its own subset of the filesystem name space, that space is likely to be part of a larger filesystem. Jails themselves do nothing to limit now much disk space can be used within the hosting filesystem. Disk quotas aren't useful since, within the jail's name space, files are not restricted to a single uid or even subset of uids, they can be owned by anyone.

The BSD vnode disk driver (vn(4)) allows us to create a regular file with a fixed size and expose it via a disk interface. These fixed-size virtual disks are used to contain a root filesystem for each jail which is mounted at the root of each jail's name space. Since the virtual disks are contained in regular files, they are also easy and efficient to move or clone.

Why virtual ethernet interfaces?

An important part of BSD jails is the ability to restrict them to specific IP addresses. While the jail mechanism does provide some degree of network virtualization as described above, it falls short of what is needed in our environment. In particular, though jails have their own distinct IP addresses, those IP addresses are associated directly with physical interfaces which are shared among all jails. Thus, interface-centric operations such as tcpdump or ipfw/dummynet either are not correctly isolated or require special handling in many different places.

Further, with jails, when packets leave a physical host they lose the identity of the virtual node (jail) that was the most-recent hop of the packet. This identity is essential for handling the "revisitation" problem, where multiple nodes in a topology can be located on the same physical node. The loss occurs because packets from the jails are multiplexed on to physical interfaces. The IP header of a packet does contains a virtual node IP address, but it is that of the originator of the IP packet, not that of the most-recent hop router. The MAC address in the wire packet will be for the physical node, not the virtual node that sent it (since there is no virtual MAC). Preserving the necessary information in the ethernet packet would require help from the switching fabric, either by supporting VLANs or by supporting arbitrary numbers of fake MAC addresses per switch port (where the fake addresses would be derived from the virtual IP addresses). Another problem with the fake MAC address scheme is that broadcast traffic cannot be associated with the correct set of virtual links since the source virtual MAC address is the tag that is used to multiplex and demultiplex traffic. MAC-level broadcast packets are thus seen by all virtual links sharing a physical link.

We solve all of these problems in one fell swoop, with a virtual ethernet device. The BSD virtual ethernet (veth) driver, which we wrote, is a goofy hybrid of a virtual device, an encapsulating device and a bridging device. It allows us to create lots and lots of ethernet interfaces (virtualization), multiplex them on physical interfaces or tie them together in a loopback fashion (bridging) and have them communicate transparently through our switch fabric (encapsulation).

Virtualization gives us per-jail interfaces above the physical interface to which we can apply jail-specific ipfw/dummynet rules or on which the jail processes can operate. Bridging allows the correct routing of packets at the link level so that virtual interfaces only receive the packets that they should. Two attributes of a virtual ethernet device define the bridged topology. The "parent interface" attribute associates a virtual device with a physical device allowing for multiplexing/demultiplexing of virtual devices on physical devices. A 16-bit tag, essentially identical to that of the 802.1q VLAN standard, identifies a broadcast domain for a set of veths, all veths with the same tag will see each other's traffic. Finally, encapsulation preserves the virtual link information necessary to implement revisitation (the tag) when crossing physical links, without making any assumptions about the switching fabric.

Why virtual routing tables?

While virtual ethernet devices are sufficient to enable construction of virtual ethernet topologies, they are not sufficient to support arbitrary IP topologies. This is due to shared IP infrastructure, in particular, the routing table. Since the routing table is indexed by destination and returns a next-hop address, it is only possible to have one entry per destination. But with a physical node hosting multiple jails representing different virtual nodes at different points in the topology, we need to be able to support multiple routes to (next hops for) a single destination.

The obvious solution is to virtualize the IP routing table. To do so, we started with the FreeBSD work done by Scandariato and Risso which implements multiple IP routing tables to support multiple VPN end points on a physical node. Routing tables are identified by a small integer routing table id (rtabid). Rtabids are the glue that bind together jails, virtual interfaces and routing tables. Our extended jails each have a unique rtabid and thus every jail has its own routing table. Virtual ethernet interfaces are likewise associated with an rtabid when they are created, forming a unique set of virtual interfaces per jail. Incoming packets for the same destination can thus be differentiated by the rtabid of the receiving interface and can be routed differently based on the content of the associated routing table.

Implementation

All virtual nodes on a physical node will belong to the same experiment. This eases the immediate burden of providing isolation somewhat and also allows us to evade tricky issues of who has what access to the hosting physical node. So a physical node, mapped to an experiment, boots up and eventually runs the bootvnodes script. That script contacts Testbed Central and discovers that it has vnodes to setup. It performs a couple of global "one time" actions: it creates a filesystem on the 4th DOS partition for jail disk space and ensures that sufficient virtual disk (vn) devices exist for all vnodes. It then runs vnodesetup for every vnode.

The vnodesetup script is used to start ("boot"), stop ("halt") and restart ("reboot") vnodes. It is also used to setup vnodes on widearea nodes, but we consider only local cluster nodes here. Vnodesetup forks and runs another script, mkjail.pl, in the child. The parent hangs around and cleans up if the jail dies. It also serves as a focal point for killing the jail, catching signals and forcibly terminating the jail. It is the parent vnodesetup process that handles informing stated of state transitions in the jail.

In mkjail.pl we finally get down to it. This script builds up the filesystem hierarchy used by the jail, sets up its interfaces (including the virtual control net interface, routes and dummynet delay pipes), and then starts the jail. Note that the filesystem and interfaces are setup outside the jail and passed as parameters into the jail.

The filesystem consists of a per-jail vnode-disk and loopback (null) mounts of various physical node filesystems. The whole shebang is located in /var/emulab/jails/<vnodename>. root.vnode is the regular file which serves as the root disk. It is attached to a vn device that is then mounted at /var/emulab/jails/<vnodename>/root. The mkjail.pl script populates the disk by copying in some directories from the physical node and customizing the content. This loading of the disk only happens on the first boot up of the virtual node. Subsequent boots simply mount the vn disk. To save space in the per-jail disk, the binary directories /sbin and /bin are remounted read-only with a loopback mount as is /usr. The shared /share, /proj, /users and /local directories are also loopback mounted read-write from the physical node. From the perspective of the physical node, there will be at least 8 mounts for every jail: /dev/vn?c (the root), /bin, /sbin, /usr, /proc, /share, /proj/<pid>, and /local/<pid>, plus one in /users for every user in the project (and maybe a /group/<gid> too!).

The network setup starts with configuration of the virtual control net interface. This consists of a 172.16/12 address alias assigned to the real control net interface. There will be one such alias per virtual node. These aliases allow us to further isolate jails from each other as each jail now has a unique address on which to run services. We no longer have to assign port ranges within the primary control net address and run daemons on weird port numbers. We can also use DNS to map symbolic names to the virtual nodes. Note that these address are "unroutable" and thus are not exposed outside emulab.net. Accessing services on the nodes from outside requires a proxy on ops.emulab.net, such as an ssh tunnel.

For experimental interfaces, mkjail.pl creates an rc file to setup the virtual interfaces and routes, and runs them. A virtual interface's parameters are assigned as follows. We assign veth virtual MACs based on the interfaces IP address, in the form: 00:00:IP#0:IP#1:IP#2:IP#3, ensuring uniqueness. We assign veth tags using the subnet part of the interface's IP address. Since we use 10.n.n.h space, where n.n is the subnet, we use the second and third octets. A veth's physical interface is determined by assign. We may use multiple physical interfaces between nodes or we may use no physical interfaces at all. The routing table ID is unique per virtual node on a physical node, so we simply use a per-physical node counter to assign these when the vnodes are booted. All veths for a virtual node get the same counter value. There is nothing magical in route setup, just an extra argument to the route command to ensure the routes get added to the correct table. Likewise for delay setup, ipfw rules are simply applied to veths rather than physical interfaces. Setting up routes and dummynet outside the jail is largely historical, both could be done from inside.

Finally, the jail startup is done. Our augmented jail implementation takes some new parameters in addition to a "primary" IP address, the root directory of the jail, and the program to run. The important additional parameter is a list of IP addresses. These addresses, along with the primary, implicitly define which interfaces are accessible to the jail: those to which the IP addresses are assigned. This is analogous to the root directory specification which determines which mounted filesystems are accessible: those at or below the level of the root directory. The program run by the jail, /etc/injail, is effectively the /sbin/init of the virtual node. Its primary jobs are to fire off /etc/rc to bring up the virtual node and then sit around and wait for a signal to shutdown the jail. The startup scripts run by /etc/rc in the jail are scaled back versions of what would run on a real node. This scaling back reflects the fact that the node has already been partially initialized and also that it usually will not run as many services as a real node. A typical jail syslogd, cron and sshd, as well as the Emulab watchdog and optional agents like trafgen or delay_agent. From the perspective of the physical node, each jail has at least 8 processes running: vnodesetup, mkjail.pl and proxy-tmcc outside the jail as well as injail, syslogd, cron, sshd, and the Emulab watchdog inside the jail. Empirically, it appears that each jail requires 12-16MB of physical memory for its base processes.

Assorted Details

FreeBSD Jails [ lifted from Leigh's jail.html... ]

Jails provide filesystem and network namespace isolation and some degree of superuser privilege restriction. Following is a list of the features we added, and bugs we fixed in FreeBSD jails. All of the new features are optional, controlled by sysctl MIBs and per-jail flags. This new jail implementation is backward compatible with the original implementation, meaning all new features are disabled by default.

1. Allow a jailed process to bind to multiple IP addresses.

The default implementation of jail allows processes inside of a jail to bind to just one IP, the IP that was specified to the jail command. In that implementation, if a process specifies INADDR_ANY, the kernel silently changes it to the jail IP. If however there are other interfaces on the node, or if tunnels are being used to construct an overlay for the experiment, it is necessary to allow processes inside the jail to bind to those interfaces. In our modified implementation, when the jail is created, a list of auxiliary IPs can be specified on the command line, telling the kernel to allow processes inside the jail to bind to any of those IPs (including the jail IP). When the bind happens, the kernel checks the jails list of IPs; this applies to sockets bound for outgoing traffic, as well as incoming traffic. Further, the set of accessible IPs determine the list of interfaces that a jail can see so that, for example, ifconfig inside a jail will only list the interfaces and IPs available to the jail.

2. Allow jails to bind to INADDR_ANY.

The default behavior (and original implementation) of jail maps INADDR_ANY to the jail's main IP address. However, when a jail is allowed to access other IPs, then INADDR_ANY actually means a subset of all the interfaces on the node that the jail is allowed to use (which might also be tunnels). There are two situations in which this matters:

A process is connecting to another address, and has specified its local address as INADDR_ANY (which is typical). Instead of binding the local address of packets to the jail IP, the local address is set to the actual address of the interface that the packet is routed out of. If there are IP aliases on the interface, the list of aliases is searched for a match against one of the allowed prison IPs. If there is a match, the local address is set to that IP. Otherwise the address is set to the main address of the interface (this is not correct; it should be an error). This is to support multiplexing links using IP aliases. If we were to use IP tunnels or some other form of virtual interface, there would be no need to search the list of aliases.

A process is binding a local socket for an incoming connection. In this case, any of the prison IPs can be the local target of the connection, but it is not until the connection is actually made that the address can be checked. This is done in the pcb lookup routine. For each pcb, if the port matches and the local address is INADDR_ANY, and the pcb was created within a jail, then the list of the prison IPs is searched, looking for a match. If no match is found, the pcb is skipped. This behavior improves compatibility with existing server applications which typically specify INADDR_ANY. If the kernel were to continue binding INADDR_ANY sockets to the main IP address of the jail, such applications would only be able to receive packets on the primary jail interface.

3. Allow access to raw sockets.

The jail is allowed to both read and write, but is restricted from accessing the firewall, dummynet, route, and RSVP interfaces. We also ensure that the packet header reflects a source IP address appropriate for the jail: INADDR_ANY is mapped to an appropriate address for the outgoing interface and fixed addresses that are not part of the jail set are rejected. This feature allows ping, traceroute and gated to work in jails.

4. Allow read-only access to BPF devices.

The interface is not put into promiscuous mode, so the jail is not able to see all of the packets on the wire, but only those addressed to the node. However, if the interface is already in promiscuous mode (say, because someone outside the jail is using tcpdump), then the jail will also be able to see any packet that goes by. Even when not in promiscuous mode, a jail will see all packets destined for the interface whether targeted to a valid jail IP address or not. This could be fixed, and the promiscuous-mode problem avoided, by augmenting the filter given when the bpf device is setup. Allowing BPF access enables use of tcpdump and other packet trace tools within jails.

5. Restrict the port range to which a jail can bind.

This allows multiple jails on the same node to safely share the port space without stepping on each other in environments where jails cannot be assigned their own IP addresses. Since the ultimate goal is to allow different experiments to coexist in jails on the same node, the port space has to be allocated globally, with the same port space assigned to all jails across an experiment, so as not to conflict with any other experiments. This assignment is done when the experiment is swapped in so that swapped experiments are not holding ranges (16 bits of port space does not go very far).

6. Disallow FS unmounts inside a jail unless the mount was done in the jail.

This is a bug fix that prevents a jail from unmounting a filesystem and exposing the underlying mount point to which it likely shouldn't have access.

7. Added per-jail flags to control various existing and new jail features.

These are in addition to sysctls which control the global availability of a given feature. Existing features thus controlled are: access to SYSV IPC facilities, access to routing sockets and ability to turn on and off filesystem quotas. New features controlled are: access to raw sockets, access to read-only BPF and the ability to use INADDR_ANY. Additionally, there is a new global sysctl to allow jails to be configured with multiple IP addresses.

Virtual ethernet devices.

Virtual ethernet devices (veths) are configured with a few parameters: a virtual MAC (VMAC) address, a broadcast domain tag, an associated physical (parent) ethernet interface, and a routing table ID. The VMAC obviously identifies the interface and needs to be unique within a broadcast domain, *not* unique per physical node. Thus, you could have the same VMAC on multiple veths on the same physical node (though we don't do that) and you may have to have distinct VMACs within a set of physical nodes, it depends on the topology. The broadcast domain tag is basically a VLAN tag, it allows us to use the same physical wire for multiple LANs. Unlike VLANs where you can have up to 4096, we currently use a 16-bit value allowing up to 64k LANs. Again, a tag's uniqueness has nothing to do with physical node boundaries, veths on different physical nodes may need to have the same tag while veths on the same node may have different values. The parent interface parameter determines which interface the veths send and receive encapsulated packets to and from. All veths with the same parent interface and broadcast domain tag, can talk to each other. If such veths are on the same physical node, they talk via a loopback with no encapsulation and without packets going out on the physical interface. Specifying a null parent can be used for a strictly loopback connection. The routing table ID is used with incoming packets to determine which table to use for lookups when forwarding. The route table ID effectively identifies a virtual node: all interfaces associated with a virtual node have the same ID, every virtual node has its own unique ID. The route table ID is a local-node only value, different physical nodes can use the same ID for different purposes.

Virtual routing tables.

The virtual routing table code originally came from the FreeBSD multiple routing table work done by Scandariato and Risso:

http://softeng.polito.it/riccardo/docs/paper_eurobsd02.ps.gz

but had to be hacked mightily here to make it reasonably complete and generally functional. The general idea is to have a set of routing tables associated with IPv4. Each is identified by its index, the rtabid, which is used in various places through the network subsystem. Network interfaces are associated with a routing table so that incoming packets are tagged with the route table to use when making forwarding decisions. Sockets can similarly be associated so that outgoing packets will use a specific routing table. When applied to routing sockets, this allows routes to be added to specific routing tables. Finally, mbufs themselves are tagged with an rtabid for those contexts in which the originating socket or interface is unavailable.

To this basic framework we added an rtabid to the jail structure so that all sockets created by a jail are tagged with the appropriate rtabid. This ensures that all jail traffic is controlled by a particular routing table. We also made innumerable bug fixes, mostly related to ensuring that the correct rtabid was available in the numerous code paths where routes are created, cloned, looked up, or removed. Many of these places were missed because they were not exercised in the paths the authors used. The ARP code in particular required extensive violence as it didn't work at all (the authors were just using tunnel interfaces, so ARP wasn't a problem).

The code is ifdef'ed in our source tree under MULTI_RTABLES_INET and is pervasive. In part, this was intentional on the part of the authors, as they desired to change the underlying BSD code as little as possible. For example, instead of changing the generic rtalloc routine to accept an extra parameter that would be ignored without the MULTI_RTABLES code, they added a new ifdef'ed routine for handling the extra param. Thus every place that calls rtalloc had to be ifdef'ed to call multi_rtalloc instead. The upside is that it is easy to identify (and isolate) the changes required by multiple tables. But the fact that the changes are so wide-spread also indicates to me that we are virtualizing in the wrong place. In retrospect, the multiple network stack work done by Zec, which virtualizes the entire network stack, would have been a better approach.

The problem with ARP.

The ARP protocol has proven to be a major pain in the ass, due to a confluence of factors. One is just the way BSD implements ARP, another the way the virtual routing tables work and finally, how we setup the virtual control net.

In BSD, there is no distinct "ARP table", instead ARP entries are just route table entries where the next hop is a link address rather than an IP address. There is some auxiliary info hanging off of the routing entry however. In particular, when say an IP packet is sent, and we must first ARP for the next hop IP, the original "triggering" IP packet is held and associated with the route table entry while the ARP exchange is done. When the ARP reply comes in, that original packet is then sent out on the wire. The point is that a "pending" ARP entry is setup at request time and then, when a reply comes in, a lookup is done to locate that entry so that requests and replies are matched up.

Enter factor two, the multiple routing tables. Since outgoing packets use the rtabid of the socket that sent them (the jail rtabid for us) the pending ARP entry will be in that route table. However, incoming packets, which may need to be further routed if we are an IP forwarder, are assigned the rtabid of the interface in which they come in on. This rtabid could be different than that for packets that go out the same interface and thus an ARP reply could be "tagged" differently than the request that caused it. Note that typically this is not a problem, since interfaces are usually private to a jail and would have the same rtabid as sockets which produce packets. The problem is with shared interfaces.

Thus we reach the final piece of the puzzle, the virtual control net. This is implemented by assigning each virtual node an IP alias on the control interface of the physical node. That is, we do not use virtual interfaces (veths) for the control net, to do so would require using veths on boss and ops and anything else the vnodes talked to. Anyway, the control net interface is associated with rtabid 0, the main routing table, and is visible in all vnodes. Now we have a case where outgoing packets will be tagged with (and use) their own private routing table, but incoming packets will be tagged with rtabid 0. (As an aside, this means that a vnode cannot forward packets between the control net and its other interfaces). For ARP, specifically, it means that the pending ARP entry will wind up in the vnode's routing table, but when the reply comes in, we will look to match it up with an entry in the main routing table. So a more sophisticated matching is needed. Currently, we do this by looking up the source of the incoming ARP packet in each routing table. The first such table we find that has an outstanding request for that address is matched up with the reply. Note that if multiple vnodes have outstanding ARPs to a machine, we may not match up with the correct one. But that shouldn't matter as each vnode should get a reply eventually and each reply should have the same info.

The virtual control net.

As mentioned, we implement a "virtual control net" that allows each jail to have its own unique address address and port name space. We use the 172.16/12 unroutable range for this, which we do route internal to Emulab. External access to these jails is provided via ssh forwarding on ops (e.g., as encapsulated in the ssh-mime.pl script). The current convention for naming in the 20 available bits of 172.16/12 is:

        12 bits for (up to 4096) physical host id
         8 bits for (up to 256) virtual host id

where we might possibly have to reduce the physical host id by a bit if we want mainbed (say 172.16) and minibed (say 172.24) to exist together. [ Note that this isn't what we do right now, we mistakenly used 172.17 as a second testbed range. But we ignore the existence of a second testbed for now... ]

We need 172.16 addresses for the router and for each physical node. The router is 172.16.0.1 as convention would dictate. We could use virtual host id 0 in each net to be the physical host so that, for example on pc79:

        172.16.79.0     pcvm79-host (i.e., an alias for pc79's control IF)
        172.16.79.1     pcvm79-1
        ...
        172.16.79.254   pcvm79-254

Note that we don't use .255. Even though this is not a real /24 net, we treat it as such for routing internal to the jail (see below).

We setup a number of special routes for a jail, some setup externally, some from inside:

  Destination        Gateway            Flags    Refs      Use  Netif
  default            172.16.0.1         UGSc        2        0   fxp0
  127.0.0.1          127.0.0.1          UH          0        0    lo0
  155.98.36/22       link#5             UCSc        0        0   fxp0
  155.98.36.79       127.0.0.1          UGHS        0        0    lo0
  172.16/12          link#1             UCSc        1        0   fxp0
  172.16.79/24       lo0                USc         0        0    lo0

The "default" route gets out to "the world", which means the testbed servers (boss/ops/tipserv). 172.16.0.1 is an alias on the router for the physical control net. Using a virtual control net address for the router is not necessary for most applications but was added for gated, which checks that next hops are accessible via attached interfaces. Since the control net interface appears internal to the jail as 172.16.x.x/255.255.255.255, this still isn't quite correct, but we use a config file feature of gated to finesse it. We have to apply more finesse when setting up routing, as described below.

The loopback route "127.0.0.1" is not as straight-forward as it might appear. Since lo0 is a shared interface, how do we ensure that packets loopback within a jail and are not received by a different jail? The kernel jail code takes care of this by changing the source IP address from the loopback address to the primary jail address. However, since it still uses the shared loopback device, there are a couple of implications. First, any jail can see the traffic with tcpdump. Second, since the interface is not tagged, replies are routed using the primary routing table. Thus there must be a route for reaching 172.16.x.x in that routing table. We ensure this as part of the jail setup process.

The "155.98.36" routes ensure we can reach nodes via their physical control net addresses (e.g., using the canonical "pcXX" names). The first reaches others hosts, the second is a loopback route for the local host. Strictly speaking, this is a violation of the virtualization, but it is a pragmatic one.

The "172.16/12" route is the general virtual control net route. This route might seem redundant given the default route, but it is actually needed to setup the default route. If this route didn't exist first, replies for ARP requests for the gateway would be rejected as "not on local network" since the control net interfaces appears as a /32 net and 172.16.0.1 is technically not reachable via it.

Finally, "172.16.x/24" is a loopback route used to reach the set of vnodes on this pnode. Note that this includes the virtual alias of the physical host (.0). [ Note that in the real testbed the .16 is .17 due to the botched main/mini-bed naming. ]

More about the startup pieces.

vnodesetup hangs around so that you can signal it and easily reboot the vnode. I guess the idea is that it is also jail/vserver independent as opposed to...

mkjail.pl is jail specific and hangs around so that it can clean up jail specific things when the jail exits.

injail is the jail's init process. This is the single point of contact for killing the jail.