Skip to content. | Skip to navigation

Personal tools

Navigation

You are here: Home / Wiki / Kb246

Kb246

Emulab FAQ: Testbed Operations: Problem with nodes getting stuck in an infinite loop under os_load

Emulab FAQ: Testbed Operations: Problem with nodes getting stuck in an infinite loop under os_load

Our nodes are getting stuck in an infinite loop when we try to os_load them. Rob and I looked at it a bit and the sequence seems to be this:

  Node:
  1. Reboots into frisbee and frisbees fine.
  2. Node says "Waiting for server to reboot us ...".  This never happens.
  3. Stated times out in state RELOAD/RELOADDONEV2.
  4. Node times out and "No response from server, rebooting myself ..."
  5. As stated timed out the node boots back into frisbee. GOTO 1.

Any idea what could be causing this? It seems the problem is that the "Waiting for server..." never gets satisfied.


> Make sure your stated is sending an "ipod" for nodes in the
> RELOADDONEV2 state.

How do I check this? There is no mention of ipod in stated.log.


handleCtrlEvent() in /usr/testbed/sbin/stated on Boss should have:

	      if ($event eq $TBRELOADDONEV2) {
		  info("Sending an apod to $node");
		  system("$apod $node") == 0 or
		      notify("Could not apod $node after $TBRELOADDONEV2!");
	      }

If it does, maybe the apod (authenticated ipod) is not being properly triggered. If the IP/mask/key were wrong, then the client would be putting out a message on the console about a failed ipod. Since you don't see that, it would mean that maybe icmp type 6 packets are not making it to the node.


The code is there.

There is no such console message. If I run apod manually it correctly reboots the node.


> Does your stated log have lines like:
>   Mar  7 00:01:17 boss stated[323]: Clearing reload info for pc211
>   Mar  7 00:01:17 boss stated[323]: Sending an apod to pc211
>

No.


> Do you have this in your DB:
>
> mysql> select * from state_transitions where state1='RELOADDONEV2' or state2='RELOADDONEV2';
> +---------+-----------+--------------+------------+
> | op_mode | state1    | state2       | label      |
> +---------+-----------+--------------+------------+
> | RELOAD  | RELOADING | RELOADDONEV2 | ReloadDone |
> +---------+-----------+--------------+------------+
> 1 row in set (0.00 sec)
>
> though I am not entirely clear what all "state" state there is in the DB
> and what is important.

Okay, I found the problem.

I had mismerged my scenario additions to database-fill.sql and as a result had missed a single line, a state trigger to trigger a reboot on reloading. After fixing database-fill.sql and refilling it looks like everything is working.