Skip to content. | Skip to navigation

Personal tools

Navigation

You are here: Home / Wiki / Clusterd

Clusterd

An overview of the current clusterd.

An overview of the current clusterd.

0. Definitions

"upstream"::

in the direction of a clusterd's server

"downstream"::

in the direction of a clusterd's clients

"subscription"::

a message requesting any event notification that matches the provided subscription info (a Boolean expression)

"notification"::

a message notifying that an event has occurred

"completion"::

an Emulab event system message implemented as a distinct notification type; sent upstream in response to particular downstream notifications.

1. NCR use

We have a range with O(1,000) physical nodes and up to O(10,000) virtual nodes. So let's assume we have multiple experiments (testbeds + tests) running consuming those 10,000 virtual nodes. Conceivably, those nodes will be sporting multiple event-based agents each. It is also possible that each of those agents will subscribe to multiple events. Let's figure that there could be 100,000 active subscriptions in the system at once.

What about the events themselves (notifications)? What is the purpose of events in the NCR? Traditionally, we use them for control actions from the infrastructure to enact user-specified behaviors. In other words, they mostly flow from "ops" to the nodes. However, there are optional (and common) completion events, which are sent upward. Events are not typically used between nodes. In the NCR, we are assuming a similar usage pattern, only bigger.

Aside: are there additional infrastructure uses of events in the NCR? You could postulate the instrumentation subsystem using them, but I think events are too heavy-weight and general for this. Instrumentation events would probably be much more frequent, especially if they were used to report data. At the very least, there would probably be a separate "instrumentation plane" using a distinct event system instantiation. Let's limit ourselves to the traditional control plane.

What about event frequency? From the user perspective, individual events are probably sent on the order of seconds, or at most tenths of seconds. For argument, we will take those 100,000 subscriptions and attempt to send to each 10 times a second--so we could be talking 1,000,000 events per second in the system.

What about event bandwidth? Well, if each of those million events was the maximum 4K, we would need around 4GB/sec (32Gb/sec) of bandwidth. In reality, events are probably only going to be hundreds of bytes at most, so we are likely talking 1Gb/sec or less of aggregate bandwidth. Unfortunately, if all these events come from ops, even a hierarchical arrangement would require this much BW from ops, but this is still less of a concern than meeting the 1M messages per second.

But this is all still assuming a centralized event system, albeit one that is hierarchically arranged. In the NCR, one would assume that each testbed will have an independent event control framework, and the typical scenario would be to have 10-100 simultaneous testbeds driving those 1M messages. So the reality is that the requirements for an instance of the event system would be 1-2 orders of magnitude smaller. There might still be "range level" events, but the bottleneck there would probably be the security gateway between the nodes in testbed and the central range infrastructure.

And there is always the case of the single-testbed-using-all-range-resources, so we need a plausible story and implementation here.

2. Assumptions

Assumption !#0: We are not intending pubsub to be a general group-communication mechanism. By this I just mean that we did not design pubsub to be a general-purpose, many-to-many, highly-optimized, semantically-rich, self-organizing, yadayada... communication mechanism. pubsub has a fairly specific purpose which guided its design.

Assumption !#1: The scaling bottleneck for the event system is related to the number of messages (i.e., latency caused by subscription matching and message transmission) and number of clients (i.e., connection overheads), and not the volume of message data (i.e., link bandwidth) between the server and clients.

Assumption !#2: The majority of Emulab event traffic is directed "downstream." In other words, events are mostly triggered from the infrastructure, likely acting on behalf of a user. Client nodes mostly request notification of such events and don't generate many events themselves. Implicit in this is the the volume of notification data is larger than the volume of subscription data; i.e., notifications are bigger and/or more frequent than subscriptions. [Hmm...completion events may affect this...]

Assumption !#3: All Emulab events should be seen by the infrastructure. In other words, all notifications should be passed up to the top level. One reason for this is so that we can schedule events for the future and maintain a time-ordered list of such events. The other reason is that the majority of events are of interest to the top-level (i.e., the infrastructure or the user). Implicit in this assumption is that clients will not heavily use the event system for local (node-to-node) communication. (An experimenter wanting to send high-volume events between nodes in an experiment could use a local event agent hierarchy.)

Assumption !#4: Multiple instantiations of the event system will be independent. Any given event agent/clusterd/pubsubd will only be a part of one event system. Multiple instantiations of the system don't need to be aware of each other, they just should not conflict; e.g., they all cannot listen on the same fixed port.

3. The way it works

A hierarchy of clusterds will be instantiated in the testbed, reducing the degree of fan-out for any one clusterd and thus addressing the scaling issue in Assumption !#1.

A clusterd subscribes to all events upstream. This means every clusterd in general receives a lot more notifications from upstream than its clients care about and results in more traffic from upstream. It also means less CPU load upstream as the upstream clusterds don't need to match notifications against subscriptions. It also means that we could multicast notifications from upstream since all downstream clusterds get all notifications. Alternatives to this approach:

  • Just forward individual subscriptions from clients as they arrive. This trades less upstream traffic for greater upstream message latency. The only advantage of this approach over no clusterd, is a decreased number of connections at the master.
  • Somehow combine individual subscriptions into aggregates. This would most likely mean a special-case clusterd with knowledge of Emulab event messages. Even then it is not clear that we can aggregate in a useful way. Most obviously we would try to limit subscriptions to only those affecting our client "nodes" where nodes are identified by Emulab name or IP. Or more likely, by client experiments. The problem is that we need some sort of qualifier that is present in ALL subscriptions from our clients. But again, the more complex the upstream subscription becomes, the more latency added upstream for processing. Plus we would be constantly changing the set of upstream subscriptions.

When a clusterd client subscribes to an event, clusterd adds the subscription to a list of subscriptions for that client. Whenever a notification comes from upstream, it is compared to all subscriptions for all clients and sent to those that match.

A clusterd forwards all notifications from its clients upstream to its parent. So all client notifications propagate up to the top of the hierarchy, even if all interested parties are "local" to a particular clusterd. Alternatives:

  • We could attempt to "short-circuit" and send notifications to other clients which have subscribed to the event. However, we must still send the event upstream since we don't know who else is subscribed. This can (will) result in the notification coming back to us from above and us delivering it a second time to clients. We could keep track of outstanding notifications and ignore them if/when they come back, but the "if" part of this makes it a hard problem (i.e., when do we timeout these "outstanding" requests).
  • We could accurately track subscriptions from upstream so we would have a better idea if we need to forward a notification upstream. But if we do have to forward it upstream, it will come back at us unless we are more specific in our subscriptions. So we are not much better off.

A clusterd ignores (drops) all subscription requests from upstream. In part this is because we propagate all client notifications upward.

4. Robustness issues

If a clusterd dies, how do we recover? The only state a clusterd maintains is a list of clients and subscriptions. The pubsub protocol already requires clients re-register their subscriptions after losing and then regaining connectivity with their server. So everything should Just Work. clusterd itself will re-register upstream when it is restarted. All clients will eventually contact clusterd with their subscriptions and state will be recreated.

What about if clusterd stays up but loses contact with its server? clusterd will continue to try to reconnect its server, re-registering when it does, so there is no problem there. However, clients will still be sending notifications to clusterd that should go upstream even when the upstream server is gone. The easy thing to do is just block on the first message we get that should go upstream. This effectively ties up clusterd, making it unresponsive and presumably triggering timeouts and reconnections. This is probably what we want. But we should probably not then deliver the message we were trying to force upstream since an arbitrary amount of time has passed and we don't even know if the client that sent the notification is still there. So instead, we try to deliver a message once, and it it times-out/fails, we drop all current client/subscription info, drop all existing connections, and go into ping-the-server mode until it comes back. Then we start taking calls again and building up client state again.

If one or more of the clusterd's clients go away, then nothing special needs to be done. We just behave as pubsubd in this case.

5. The difference between pubsubd and clusterd

All differences stem from the fact that clusterd is asymmetric; there is a distinguished server (upstream) connection that is treated different than client (downstream) connections. Pubsubd on the other hand has only clients; making a hierarchy of pubsubds involves inserting a pass-through client at each level.

Clusterd cannot be the root of the hierarchy as it must have an "upstream" and it cannot forward notifications from clients to other clients. In Emulab context, this means that it would be unable to accept events from the event schedulers and pass them down to the clients or to simply propagate events between clients.

Clusterd cannot be the parent of a pubsubd since it does not propagate subscriptions downward. Hence the pubsubd would never see any subscriptions and would never pass notification upward.

6. Where we plan to use clusterd

In Emulab:

  • Replace per-node pubsubd/evproxy with an instance.
  • Run as a "sub-boss" service for our new cluster nodes.

In NCR:

  • An instance per-testbed if there is some centralized event server above it.
  • On sub-boss-style nodes within testbed at select points (where select points are determined statically during testbed planning and config).