kb28b

Emulab FAQ: Using the Testbed: I have multi-gigabytes of data for my experiment, where should I store it?

If your experiment requires multi-gigabytes of data, transferring that data into Emulab will require a significant amount of time, so consider the following choices when deciding where to put that data.

Is the content of the data meaningful or is it just the volume?
If you don't care about the actual values of the data, then the best way is to just conjure up "fake" data on the node or nodes and don't transfer it in at all. You can use dd from /dev/zero or /dev/urandom to produce data files on the node-local disks.

How frequent are accesses to the data and how important is latency and throughput of the accesses?
Assuming your data are actually meaningful, you will need to first transfer it into Emulab to make it accessible to experiments. If accesses to the data are infrequent and the access characteristics are unimportant, then you can consider storing the data on users.emulab.net in one of the shared filesystems and accessing it via NFS. However, since these filesystems are accessed via the shared (by all experiments) control network, there are no guarantees for throughput or latency.

In general, we strongly discourage this practice since it places additional load on our shared control network infrastructure and reduces the repeatability of your experiments. It is much better to transfer the data to one or more of the experiment nodes and access it on the node-local disks.

How many nodes in your experiment need to access the data?
If you are transferring the data to nodes, there are several ways to do this. If you want the same data set on multiple nodes in the same experiment, then it makes sense to take advantage of our multicast data distribution mechanism, Frisbee. If all your nodes are running the same OS image, then the easiest way to take advantage of Frisbee is to create a custom disk image containing your data. You will need to format the extra partition on the boot disk of a node, put the data out there, and then create a "whole disk" image from the node. Note however that there is a limit (currently 20GB) on how big of an image can be saved.

If however you need multiple OS images within the experiment, it would be wasteful to embed your data in multiple custom images. If you need the data on all nodes in the experiment, regardless of the OS they run, then you can use the Emulab NS extension tb-set-tarfiles, which uses Frisbee to transfer data after the nodes have been loaded with their OS.

Finally, if you only need the data set on a single node in your experiment, Frisbee is not the best answer as the protocol is UDP-based and not congestion controlled. Using a standard transfer mechanism would be nearly as efficient and much more friendly to the shared control network. In this case you could use the Emulab tb-set-node-tarfiles command to specify the node to transfer data to. This will use HTTP to transfer the data from your project filesystem space on the file server to the indicated node. You could also use tb-set-node-startcmd to run a script at experiment startup, and have that script use scp to transfer the data needed.

Isn't there any reasonable shared-storage option?

The newly developed Emulab storage model allows for access to large per-project persistent data sets stored on a SAN. The current prototype allows for SAN-based storage, both persistent and ephemeral (for a single experiment swapin only).