The purpose of this document is to share our experience in selecting the hardware in Emulab, to aid others who are considering similar projects. Rather than a recipe for building testbeds, this document gives a set of recommendations, and outlines the consequences of each choice.
Before you start though, you should read this OSDI paper.
The Emulab software can be found on our software page.
The installation instructions are very helpful in determining how much effort is involved. Note that installing Emulab is a fairly invovled process; it might be more work than would be worth it for a small system.
NICs:: At least 2 - one for control net, one for the experimental network. Our attempts to use nodes with only 1 NIC have not met with much success. 3 interfaces lets you have delay nodes, but only linear topologies (no branching) unless you "multiplex" the physical links. 4 lets you have (really simple) routers. We opted to go with 5. The control net interface should have PXE capability and need only be 100Mb. All experimental interfaces should be the same, unless you are purposely going for a heterogenous environment. We use Intel cards (Pro100, Pro1000) almost exclusively, but others have also used Broadcom, 3Com and D-link interfaces. The control net interface can be different. For the control net we have used Intel Pro100 (fxp) and Pro1000 (em) cards, as well as builtin Broadcom (bge), 3Com (xl) and RealTek (re) devices. Note that the older the card/motherboard, the more likely it is that either it will not have PXE or that the PXE implementation will have bugs, so it's best to get a sample first and try it out before committing to a card. Newer (say 2003 or newer) cards will most likely have a correctly functioning PXE. Depending on usage, it may be OK to get a large number of nodes with 2 interfaces to be edge nodes, and a smaller number with more interfaces to be routers.
Case/Chassis:: This will depend on your space requirements. The cheapest option is to buy standard desktop machines, but these are not very space-efficient. 3U or 4U rackmount cases generally have plenty of space for PCI cards and CPU cooling fans, but still may consume too much space for a large-scale testbed. Smaller cases (1U or 2U) have fewer PCI slots (usually 2 for 2U cases, and 1 or 2 in 1U cases), and often require custom motherboards and/or "low profile" PCI cards. Heat is an issue in smaller cases, as they may not have room for CPU fans. This limits the processor speed that can be used. Smaller cases also make for denser wiring. See the back of one of our pc850 racks for example. Denser cabling can affect air flow exacerbating heat issues. For our first round of machines ("pc600"), we bought standard motherboards and 4U cases, and assembled the machines ourselves. For our second round of PCs ("pc850"), we opted for the (now discontinued) Intel ISP1100 server platform, which includes a 1U case, custom motherboard with 2 onboard NICs and serial console redirection, and a power supply. For the third round of PCs ("pc3000") we got Dell 2850s which sport 2U cases and a custom motherboard with 2 builtin NICs and serial console redirection.
CPU:: Take your pick. Note that small case size (ie. 1U) may limit your options, due to size and heat issues. Many experiments will not be CPU bound, but some uses (ie. cluster computing experiments) will appreciate fast CPUs. Our latest round of machines have 3Ghz processors.
Memory:: While many applications don't need much, memory is still cheap and you should get at least 512MB, and preferably 1-2GB. We chose to go with ECC, since it is not much more expensive than non-ECC, and with our large scale, ECC will help protect against failure.
Motherboard:: Serial console redirection is nice, and the BIOS should have the ability to PXE boot from a card or a builtin interface. Health monitoring hardware (temperature, fans, etc.) is good too, but not required. If you are planning to have 1 or more 1000Mb interfaces you should be looking for PCI-X or PCI Express buses. Onboard network interfaces can allow you get get more NICs, something especially valuable for small cases with a limited number of expansion slots.
Hard Drive:: Pretty much any one is OK. With a large number of nodes, you are likely to run into failures, so they should be reasonably reliable. As for size, whatever the current market "sweet spot" is should be more than adequate. Our standard images are at most 6GB. Consider SCSI if you have money and need speed. Consider a second drive if you have money and case room, and need a whole lot of space for logging, etc.
Floppy:: Handy for BIOS updates, diagnostics and possibly for older OS installs. You definitely need one if you don't have a CD-ROM drive.
CD-ROM:: Recommended if you might grow your testbed large (more than a couple hundred machines) some day. For reliable scaling we may someday migrate away from PXE-booting to CD-based booting, as we already do with our widearea and wireless nodes.
USB:: We currently do not use any USB devices, though booting from a write-protected USB flash drive is another alternative to booting via PXE.
VGA:: Only needed if motherboard does not have serial console redirection or if you want to run Windows. Cheap is fine, cheap and builtin is better.
Number of ports:: You'll need one port for each testbed node, plus ports for control nodes, power controllers, etc. 100Mb ports are fine.
VLANs:: Allows partitioning of the control network, which protects control hw (switches, power, etc.) from nodes and world, and private machines. Without them, control hardware can be attached to an additional NIC on the boss node (requires extra cheap switch.)
Router:: Required if VLANs are used. Must have DHCP/bootp forwarding. (Cisco calls this the 'IP Helper') A MSFC card in a Cisco Catalyst supervisor module works well for us, but a PC would probably suffice.
Firewall:: Can be the router, without it VLAN security is pretty much useless, meaning that it may be possible for unauthorized people to reboot nodes and configure experimental network switches.
Port MAC security:: Helps prevent nodes from impersonating each other and control nodes. This is an unlikely attack, since it must be attempted by an experimenter or someone who has compromised an experimental node already.
Multicast:: Support for multicast in some form is required (as we use it for reloading disks). Many cheaper switches simply deal with multicast by broadcasting instead: this will work, but will cause lots of unnecessary packets to be sent on the control net. We highly recommend a switch that supports IGMP snooping, which allows multicast packets to be limited to the set of ports that have subscribed to the group.
Vendor:: Our software currently supports: Cisco Catalyst 65xx, 40xx, 29xx, 35xx, and 55xx series, Nortel 1100 and 5510, Foundry 1500 and 9604, HP Procurve 5400, 3500 and 2800, and Intel 510T switches (although the Intels don't support everything you'd want for a large-scale production testbed). Other switches by the same vendors will likely be easy to support - in particular, we expect that Cisco switches supporting the 'CISCO-VTP-MIB', and the 'CISCO-STACK-MIB' or 'CISCO-VLAN-MEMBERSHIP-MIB' are likely to 'just work'. (You can see a list of which MIBs are supported by which devices at Cisco's MIB page.) If you will be using multiple Ciscos for the experimental net, and trunking between them, your switch should also support the 'CISCO-PAGP-MIB', though this is not required. Switches from other vendors can theoretically be used if they support SNMP for management, but will likely require significant work. We have found that you can often find good deals on new or used equipment on EBay. For used Cisco equipment, we buy from Atantica.
Number of ports:: You'll need as many ports as you have experimental interfaces on your nodes. In addition, you need 1 port to connect the experimental net switch to the control net (for SNMP configuration.) If you're using multiple switches, you need sufficient ports to 'stack' them together - If your switches are 100Mbit, Gigabit ports are useful for this.
VLAN support:: Optimally, configurable via SNMP (we have no tools to configure it otherwise.) If not available, all experimental isolation is lost, delay nodes can't be used well, and method for coming up with globally unique IP addresses will be required. So, VLANs are basically necessary.
Trunking:: If multiple switches, should have a way to trunk them (or experiments are limited to 1 switch.) Ideally, all trunked switches should support VLAN management (like Cisco's VTP) and a VLAN trunking protocol like 802.1q . It's good if the trunk links are at least an order of magnitude faster than node links, and link aggregation (ie. EtherChannel) is desirable.
What we are buying lately
Lately, we have been buying the HP Procurve 5406zl and 5412zl switches, with a J8726A supervisor card, and 24x1Gb J8702A line cards. These have worked very well and are competitively priced.
Two servers are preferable, though one could be made to work if you are willing to invest some time and effort. The NFS server needs enough space for /users and /proj , as well as a few extra gigs for build trees, logs, and images. If you have more than 128 nodes, and plan to use the RocketPort serial ports, you need one "tip server" machine per 128 serial lines (other serial muxes may have similar limitations.) Tip servers do not need to be very powerful, we use old desktop machines that have full-height PCI slots. A 1000Mb network connection is suggested for the disk image distribution machine (usually boss.) though you can get by with 100Mb. The database machine (boss) should have reasonably fast CPU and plenty of RAM.
Network cables:: We use Cat5E, chosen because they are not much more expensive than Cat5, and can be used for gigabit Ethernet. It has been our experience that 'boots' on cables do more harm than good. The main problems are that they make it difficult to disconnect the cables once connected, and that they get in the way on densely-connected switches. Cables with 'molded strain relief' are better than cables with boots, but are often much more extensive. We buy cables in two-foot increments, which keeps slack low without making the order too complicated. Our "standard" so far has been to make control net cables red, experimental net cables yellow, serial cables white, and cables for control hardware (such as power controllers) green. We've bought all of our cables from dataaccessories.com, and have had excellent luck with them. With prices of $3.00 to $4.25 for premade cables, it's not too expensive to by them rather than make your own, and it's much easier.
Serial cables:: We use Cat5E, but with a special pin pattern on the ends to avoid interference between the transmit/receive pairs. We use RJ-45 connectors on both ends, and a custom serial hood to connect to the DB-9 serial ports on the nodes. Unfortunately, the custom pinout is different for different serial mux manufacturers. Contact us to get our custom cable specs for RocketPort serial mux.
Power controllers:: Without them, nodes can not be reliably rebooted. We started out with 8-port SNMP-controlled power controllers from APC. We then switched to using the RPC-27 from BayTech, 20-outlet vertically-mounted, serial-controlled power controllers. Currently, we use the BayTech rackmount RPC14 series. The serial controllers are generally cheaper, and the more ports on each controller, the cheaper. Take note of the power requirements of your machines before buying power controllers! While we can support 20 old 850Mhz machines on a single 30A/120V power controller, we can only support 7 Dell 2850s per 30A/120V controller. So we went with 8-port controllers for the latter. DETER uses APC 7900 series SNMP-controlled power controllers.
Serial (console) ports:: Custom kernels/OSes (specifically, the OSKit) may not support ssh, etc. Also useful if an experimenter somehow scrogs the network. We currently use the Comtrol RocketPort series of serial muxes. In particular, we use the 32-port !RocketPort Universal PCI interfaces (pn 99356-8) and 1U rackmount panels (pn 99380-3). You can only have 4 cards (or 128 ports) per machine, so if you have more than 128 nodes, you will need multiple machines to host the interfaces. Note also that the 32-port RocketPort is a full-height PCI card, so make sure you have a machine with full-height (not low-profile) slots. (In the past we have used the Cyclades Cyclom Ze serial adapters, which allow up to 128 serial ports in a single PC. But those have been discontinued and they also require a 5V PCI slot). Berkeley's half of the DETER testbed have been using Digi's standalone Etherlite32 http://www.digi.com/products/serialservers/etherlite.jsp modules, which require a devoted linux machine, but multiple boxes can be supported by the same server. Only the "capture" process need be compiled for linux.
IPMI and ASF support:: Most server motherboards, and some desktops, support some form of remote management including remote reset and even "Serial Over LAN." To date, we have only experimented with ASF for power cycling on a particular Dell Optiplex box with a Broadcom controller. Getting this working in a non- Windows environment required a Dell-mediated agreement with Broadcom to obtain a Linux configuration client and, in at least this instantiation, the ASF controller can be disabled (confused?) by the host OS driver. So this has not been a cost-effective solution thus far.
"Whack on LAN" power controllers:: One of our students developed a hardware modification to the Wake-on-LAN mechanism to allow us to reset machines via the network. Seethis abstract for a little info.