Eventsystemimpl

The Emulab Event System

Emulab uses a publish subscribe mechanism to implement an event distribution system. This document just attempts to fill in some of the holes in other documents. For an overview of the event system you should see:

User documentation in the Wiki.
Event system API.
Chapter 7 (Dynamic Experiment Control) of An Evaluation of Emulab Software and Its Evolution for the National Cyber Range
Design notes for "clusterd"; our plan for, and implementation of, a hierarchical event system.

Major components:

The global event router(s). One instance of the pubsub router (pubsubd) runs on "boss" and one on "ops". The former is used for infrastructure events, the latter for user (experiment) events.

The per-node event router. An instance of pubsubd runs on every physical node. Its purpose is to reduce the number of connections to the ops pubsubd (via the event proxy, described below) from a node.

The event scheduler. One per-experiment, responsible for queueing and sequencing experiment events. When the time comes to fire an event, it is sent to the ops pubsubd for distribution. This scheduler runs on "ops".

The event proxy. Not an essential component, acts as glue between the ops pubsubd and the per-node pubsubd. Subscribes to all events from the local pubsubd and forwards subscriptions on. Receives all messages from the ops pubsubd and pushs them down.

The event API library. Includes language bindings for C, C++, Perl and Python. Used by the various event agents.

Experiment event agents. These are the applications that run on nodes (physical or virtual or ops) and perform functions on behalf of experiment users. These are detailed below.

Infrastructure event agents. Agents that run on boss and perform functions on behalf of the Emulab infrastructure. There are not many pure agents, but there are a number of infrastructure services that generate events to signal activities.

The experiment NS file. Events can be statically scheduled from an experiment's NS file via "at" commands. In addition to single events, there are event groups and event sequences described in the online documentation: http://users.emulab.net/trac/emulab/wiki/eventsystem.

The event command line tool, tevc ((t)estbed (ev)ent (c)lient). Can be used to dynamically generate events to specific agents.

Event semantics:

Events are authenticated and integrity-protected pubsub notifications sent via TCP. Authentication and protection are provided by an HMAC: an opaque attribute added to each pubsub notification by the event system. The HMAC is a SHA1 hash of the complete (other than the hash itself) contents of the pubsub message computed using a per-experiment key stored in a project's shared NFS space.

All events has a set of Emulab-related attributes that will always be present, and identify common "targets" of events:

SITE: name of an Emulab site? Not currently used.
HOST: DNS name of an Emulab host
EXPT: name of an experiment
OBJTYPE: an Emulab event object type
OBJNAME: an Emulab event object name
EVENTTYPE: an Emulab event type
GROUP: name of an event group
TIMELINE: name of an event timeline

If not explicitly set by the creator of an event, they will be set to a "match any" value. Note that these are merely "convenience" attributes that many Emulab event clients expect and use by convention. The event library does not itself use them.

All events are part of a timeline. A timeline dictates the (relative to the start of the timeline) time at which each event fires. There is a default timeline, but there can also be other explicitly named timelines.

Events may also be part of a sequence. If two events are in a sequence, it is guaranteed that the second will not fire until the first "completes" where completion is defined by the receipt of a completion event (see below). Note that this differs from a timeline, in that timelines only sequence the *start* of an event, and sequences cannot be used to schedule events with greater precision than "after the previous event."

Events may be part of an event group. Triggering an event group causes all events in the group to be triggered. Event groups are typically used as a programmer convenience to, for example, start the same event on all nodes in an experiment at the same time without needing to individually fire each event. Currently event groups are implemented in the event agent, which does fire each event individually; i.e., event groups do not currently map to any sort of underlying multicast mechanism that might optimization transmission.

Events can have associated "completion events" which the agent handling a particular event type will generate when the action associated with the event has finished. This is used to assure synchronous execution of certain actions. Not all events have associated completions (or rather, the agents responsible for enacting certain events don't send them) and this can lead to confusion; e.g., if tevc is invoked with "-w" to wait and the event does not have a completion.

All submitted events are received by the event_scheduler...

Event API

The event system API wraps the pubsub API and is documented here. The additional semantics added by the event system (mostly described above) can be summarized as:

Adds an HMAC attribute to notifications for authentication and integrity-protection.
Fixes the set of attributes that subscriptions can match on. The fixed set is: SITE, EXPT, GROUP, HOST, OBJTYPE, OBJNAME, EVENTTYPE, TIMELINE. That is, the "tuple" argument to event_subscribe is a struct which contains fields for each of these.
Adds the notion of a "scheduled" event.
Adds the notion of a "completion" event.
Language bindings for Perl and Python.

Event limits:

Pubsub messages can contain at most 4076 bytes of data. This is a hardcoded 4096 - 20: PUBSUB_MAX_PACKET (in network.h) minus the basic packet header size (pheader_t in network.h). While this is not a wire format limitation (the packet header has an explict length field), both client and server side would need to be recompiled to support a larger fixed length (or arbitrary length). Thus there are potential compatibility issues associated with changing this constant. Moreover, a number of functions in clientapi.c also allocate stack variables of type max_packet_t which are this size, so raising this constant too high might blow out small thread stacks.

A pubsub subscription expression can contain at most 4072 bytes of data. This is the 4076 bytes for a max pubsub packet minus another four bytes for one additional network header field.

A pubsub notification can contain at most 2048 bytes of data. This is based on the constant PUBSUB_MAX_NOTIFICATION in the public header pubsub.h. This value should not be set higher than PUBSUB_MAX_PACKET - 28 bytes (the code does not check this, either at compile time or run time) Notifications handed out by pubsub_notification_alloc (and _clone) are a fixed PUBSUB_MAX_NOTIFICATION + 8 bytes, so increasing the size of this constant could considerably raise the amount of memory used by a pubsub application.

Pubsubd can support at most 1024 simultaneous clients. This is just a compiled in constant and could be changed (to any other arbitrary fixed value) pretty easily. Though the value might also be limited by OS constants such as the number of open file descriptors or the size of a poll/select fd mask.

Experiment event agents:

The event library has C, C++, perl and python bindings.

Though the event scheduler (event-sched) is mostly about injecting experiment events at the appropriate time, it also doubles as an agent for various events. As a "simulator agent", event-sched can perform experiment operations such as modify, swapout or terminate as well as experiment log functions such as send, snapshot or reset. See the list of supported events below for details. As a "node agent" it can reboot, create images from, or install images on, groups of nodes within an experiment.

sched, delay-agent, link-agent, program-agent, trafgen, linktest.

Infrastructure event agents:

stated.

Defined events:

The set of supported events is currently static. Adding new event classes or events requires modification of a number of pieces of Emulab state:

The event scheduler must be modified if the new event sends a completion.

(XXX which are...).

List of supported events.

event-sched (as simulator agent):

TBDB_EVENTTYPE_MODIFY::

Performs an experiment modify by invoking the XMLRPC interface. When invoked with mode=stabilize, it uses feedback data to remap virtual nodes (this is currently the only mode supported). Generates a completion event.

TBDB_EVENTTYPE_SWAPOUT, TBDB_EVENTTYPE_HALT::

Swapout or terminate the experiment.

TBDB_EVENTTYPE_DEBUG::

Prints debug message to event-sched log.

TBDB_EVENTTYPE_MESSAGE::

Prints a message to event-sched log and adds it (no timestamp) to the experiment report.

TBDB_EVENTTYPE_LOG::

Adds a timestamped message to the experiment log. Generates a completion event.

TBDB_EVENTTYPE_REPORT::

Creates a report based on logs from all nodes and mails it to the invoking user. Generates a completion event.

TBDB_EVENTTYPE_RESET::

Reset all node logs via loghole. Generates a completion event.

TBDB_EVENTTYPE_SNAPSHOT::

Take a snapshot of the current state of all node logs via loghole. Generates a completion event.

TBDB_EVENTTYPE_STOPRUN::

Stops a template run.

event-sched (as node agent):

TBDB_EVENTTYPE_REBOOT::

Reboot a set of experiment nodes.

TBDB_EVENTTYPE_RELOAD::

Reload the current, or load a new, disk image on set of nodes. Via the XMLRPC interface, causes the nodes to go through the Emulab disk loading process.

TBDB_EVENTTYPE_SNAPSHOT::

Take a snapshot of (i.e., create a custom images from) a set of nodes. Via XMLRPC, causes the nodes to go through the Emulab disk image creation process.

TBDB_EVENTTYPE_SETDEST, TBDB_EVENTTYPE_MODIFY::

Not sure what these are intended to do. Possibly not implemented?

pcapper (link monitoring agent):

TBDB_EVENTTYPE_START, TBDB_EVENTTYPE_STOP::

Start or stop tracing on a link.

TBDB_EVENTTYPE_KILL::

Terminate the link tracing agent.

TBDB_EVENTTYPE_RELOAD, TBDB_EVENTTYPE_SNAPSHOT::

Both the same at the moment: save off the current trace output file and start a new output file.