CheckNode
Machine Check : Checknode
This page last changed on Feb 7, 2014 by mike.
Checknode
Brief Description The checknode sub-system is a collection of the programs to check the health of a node at reloading time, boot time, or anytime a user wishes to run the nodecheck test(s). An important feature of the checknode system is that at reloading time (aka "in the frisbee-MFS") it logs inventory files on the server "ops" for post processing. Node checking is run automatically at swapin time and can also run manually--and non-destructively--at any time during an experiment. Here, non-destructive means that it would not destroy any data or prevent further use of the node. Like linktest, it would require that the nodes be quiescent while the test is run or else it could affect (or be affected by) whatever else is going on.
User DocumentationStandard OperationThe checknode suite is run automatically at boot-time. Users will only see output from the tests if they have a connection to the console output, looking at the console log file, or by looking at the local nodecheck.log file. Logging done on the console is in a condensed output form from what can be found in the local nodecheck.log file. Boot time console output
Running nodechecks Starting timecheck.. offset 0.000019 < 0.005 OK Cpucheck..Arch:x86_64 Sockets:1 Cores_socket:4 Threads_core:2 Mhz:2400 HT:1 64bit:1 HV:1 OK Starting Memcheck..12GiB OK Starting niccheck.. 6 interfaces OK Starting diskcheck../dev/sda 500 WD-WMAYP4243392 enabled /dev/sdb 250 9SF16T3N enabled Have 2 drives OK Done with nodechecks
If something is amiss it will be reported like: Missing disk drive console output
Running nodechecks Starting timecheck.. offset 0.000056 < 0.005 OK Cpucheck..Arch:x86_64 Sockets:1 Cores_socket:4 Threads_core:2 Mhz:2400 HT:1 64bit:1 HV:1 OK Starting Memcheck..12GiB Starting niccheck.. 6 interfaces OK Starting diskcheck../dev/sda 500 WD-WMAYP4243392 enabled Have 1 drive TBmiss: TB Claims 9SF16T3N FAILED Done with nodechecks In the above example one of the Hard Drives is missing according the testbed database. The database claims we should have a drive with the serial number Manual OperationIndividual checknode tests, or the entire suit, can be run after boot. The test names are: timecheck - Clock synchronization. Is the clock reasonably in synch with real time. Individual tests can be run by changing to the directory Example of cpucheck
$ cd /usr/local/etc/emulab $ ./cpucheck Cpucheck..Arch:x86_64 Sockets:1 Cores_socket:4 Threads_core:2 Mhz:2400 HT:1 64bit:1 HV:1 OK All the checks can be run in a batch using the startup rc script Example of running rc.nodecheck
$ cd /usr/local/etc/emulab $ rc/rc.nodecheck Running nodechecks Starting timecheck.. offset -0.000710 < 0.005 OK Cpucheck..Arch:x86_64 Sockets:1 Cores_socket:4 Threads_core:2 Mhz:2400 HT:1 64bit:1 HV:1OK Starting Memcheck..12GiB Starting niccheck.. 6 interfaces OK Starting diskcheck../dev/sda 500 WD-WMAYP4243392 enabled /dev/sdb 250 9SF16T3N enabled Have 2 drives OK Done with nodechecks Node local log fileStored locally on the node is the file Tue Nov 12 11:14:48 MST 2013------ Start boottime_nodecheck ------ return to top Administrators DocumentationEmulab IntegrationThe checknode suite is a reporting system it does not directly affect Emulab itself. The interface from Emulab is the hwinfo command of the tmcc utility. Other tmcc operations used by the checknode suite are ntpinfo and node_id. The following two tables show the output syntax of the hwinfo operation. If the testbed database does not have a type of information then the output of 'tmcc hwinfo' does not display it. As checknode runs during the node reloading phase it writes inventory files to the ops server. These files can be processed out-of-band on the server to: 1. populate the database with node information and 2. check for changes in the hardware configurations of the testbed nodes. calling tmcc hwinfo
TESTINFO LOGDIR="/proj/emulab-ops/nodecheck" COLLECT=1 CHECK=1 CPUINFO SOCKETS=1 CORES=1 THREADS=2 SPEED=3000 BITS=64 HV=0 MEMINFO SIZE=2048 DISKINFO UNITS=2 DISKUNIT SN="3KS0WKKN" TYPE="SCSI" SECSIZE=512 SIZE=146815 DISKUNIT SN="3KS0XP4L" TYPE="SCSI" SECSIZE=512 SIZE=146815 NETINFO UNITS=6 NETUNIT TYPE="ETH" ID="001143e49261" NETUNIT TYPE="ETH" ID="001143e49262" NETUNIT TYPE="ETH" ID="000423b7211e" NETUNIT TYPE="ETH" ID="000423b7211f" NETUNIT TYPE="ETH" ID="000423b720fe" NETUNIT TYPE="ETH" ID="000423b720ff" tmcc hwinfo output syntax
1. Info about how to run the test (one line): TESTINFO LOGDIR="<path>" COLLECT=(0|1) CHECK=(0|1) 2. CPUs (one line): CPUINFO SOCKETS=<#> CORES=<#> THREADS=<#> SPEED=<MHz> BITS=<32|64> HV=<1|0> 3. RAM (one line): MEMINFO SIZE=<MiB> 4. Disks (one line): DISKINFO UNITS=<#> 5. Disk units (one line per unit): DISKUNIT SN=<serial> TYPE=<PATA|SATA|SCSI|RAID> SECSIZE=<#> SECTORS=<#> RSPEED=<MBs> WSPEED=<MBs> 6. NICs (one line): NETINFO UNITS=<#> 7. NIC ports (one line per port): NETUNIT TYPE=<ETH|WIFI|IB> ID=<mac> return to top InstallationClient side programs The following files are installed on the client nodes. They are contained and installed as part of the standard install of emulab client OS images. /usr/local/etc/emulab/ A local log file is created when checknode is run during the boot of the client OS. /var/emulab/logs/nodecheck.log Inventory Files
Remote logging. referred to as inventory files are written to the ops server in the location: /prog/<pid>/nodecheck/<node_id> This is the default location however the actual pathname is set by the ' When the checknode system saves information for the node being tested it saves several files in the directory /proj/<eid>/nodecheck/<node_id>*
The difference, that the client software, found between the node and tmcc listings. When checknode is run during the reloading phase the inventory files are created in: /prog/emulab-ops/nodecheck/<node_id> The information used to populate the testbed database is gathered at frisbee-MFS node load time. This MFS image likely does not have bash, dd, or smartclt installed, these programs are run from the server ops directory /proj/emulab-ops/nodecheck/`uname -s`/`uname -m`/ /proj/emulab-ops/nodecheck/FreeBSD/bin-i386 /smartctl Each of these helper programs are statically linked.
return to top Bootstraping the Checknode system
Enabling the hwinfo operation of tmcc.The hwinfo operation only returns information for allocated (i.e., not in the free pool), physical nodes. Exactly what hwinfo returns, depends on how much state there is for a node in the Emulab database. tmcc will always return the site-wide TESTINFO information based on the values of the nodecheck/collect and nodecheck/check site variables. The former indicates that the checknode suite should collect current information about the node and create an inventory file as previously described. The latter tells checknode that it should check the current values against those returned by hwinfo and report problems. An Emulab site will probably have nodecheck/check always set non-zero but might only set nodecheck/collect to gather initial information about nodes for populating the database. At Utah, we always collect information and periodically compare the collected reports to check for consistency. CPUINFO and MEMINFO lines are only returned if there is per-node-type (or per-node) information in the database node_type_attributes (or node_attributes) table. The relevant attributes are:
Typically these are just set per node-type, but if you have an instance of a node type that, say, has less memory than others, you can set a node_attributes value for that node to override the type value. hwinfo will return values for whichever of these attributes are set. Note that there are three related pre-existing node_type attributes: frequency, memory and processor, that are set when a node_type is added to the testbed, but we opted not to use these to avoid any future conflicts with their meanings. It is quite possible that these will be removed in the future. The values of the DISKINFO fields come from the database storage subsystem tables and you should use the gen_sql script mentioned below to populate these. The basic entry for a drive is in the blockstores table, and this row contains the unique index, node name, a disk name, and the size. A blockstore_attributes table row should contain the same index and the serial number for the drive. Currently the read and write speeds of the disk are not represented in the database. Likely, they will also be attributes. Finally, NETINFO information comes from the database interfaces table. This interface information for a node is added when the node is added to the testbed. operational flagsThe first line returned by a call of tmcc hwinfo gives some operational parameters for how the tests should be run. TESTINFO LOGDIR="/proj/emulab-ops/nodecheck" COLLECT=1 CHECK=1 LOGDIR path to persistent storage for saving output results. COLLECT run test, output results some place persistent. CHECK run test, normalize results, compare to DB info, report any errors. Database Insertion of common node_type HW parameters such as cpu characterizes, memory size, etc. Setting of parameters for individual nodes Populating the testbed with hard drive informationThe population of hard drive information is done with a utility call gen_sql installed in the checknode source directory. It must be on the boss machine of the testbed. It takes no arguments and does nothing but output sql script. boss:./checknode$ ./gen_sql #BYHAND mysql -e "insert into interfaces set node_id='gpu1',mac='021fc600a848',card=X,port=X,interface_type='?',iface='ethX',role='?',uuid='1edf31cf-876c-11e3-83eb-001143e453fe';" tbdb #BYHAND mysql -e "insert into interfaces set node_id='pc423',mac='0024e87928ba',card=X,port=X,interface_type='?',iface='ethX',role='?',uuid='1f1248a1-876c-11e3-83eb-001143e453fe';" tbdb mysql -e "insert into blockstores values (1066, 'pc498', 'disk1', 0, 'sata-generic', 'element', 476940, 1, now());" tbdb mysql -e "insert into blockstore_attributes values (1066, 'serialnum', 'WD-WMAYP3579753', 'string');" tbdb mysql -e "insert into blockstores values (1067, 'pc498', 'disk2', 0, 'sata-generic', 'element', 238418, 1, now());" tbdb mysql -e "insert into blockstore_attributes values (1067, 'serialnum', '9SF16YWS', 'string');" tbdb mysql -e "update emulab_indicies set idx=1068 where name='next_bsidx';" tbdb To update the database simple run the command thusly: boss:./checknode$ ./gen_sql | bash Assuming that your database has is called tbdb. Notice that there are commands which start with the comment "#BYHAND" these are command that are dangerous or gen_sql does not have all the information to operate on. The commands should be studied and run by hand if approbate. Generate state of the testbed using the 'runreport' commandThe following assumes the directory /proj/emulab-ops/nodecheck/nodecheck exists (it could be anywhere) that contains the following:
The runreport command can executed by hand or it can be put in a crontab file such as: # crontab file to run emulab inventory checks SHELL=/usr/local/bin/bash PATH=/usr/sbin:/usr/bin:/usr/local/bin:/usr/testbed/bin:/usr/testbed/sbin:/bin CHECKNODE_CRONJOB=YES CHECKNODE_MTA="sendmail -t" CHECKNODE_MAILTO="tbops" # # 0 5 * * * (cd /proj/emulab-ops/nodecheck/nodecheck; ./runreport) Day-to-day operationsConsole Line logging in mfs modeNormal
Running Hardware Inventory Gather Gathering Inventory.. Starting diskcheck.. Cpucheck.. Starting Memcheck.. Starting niccheck.. Done Running Hardware Inventory Execption output - disk order
Running Hardware Inventory Gather Gathering Inventory.. Starting diskcheck.. Cpucheck.. Starting Memcheck.. Starting niccheck.. ERROR DISKs: OUT OF ORDER found 3KS0XJK4 3KS0XJW1 from tbdb 3KS0XJW1 3KS0XJK4 Done Running Hardware Inventory Diff reportingDiff Report for pc542 @ Tue Nov 12 15:23:37 MST 2013 Kernel Linux 3.2.7 #8 SMP Fri May 25 14:19:38 MDT 2012 x86_64 ------------------------------------------------------------------ ERROR DISK OUT OF ORDER found WD-WMAYP4243392 from tbdb WD-WMAYP4243392 9SF16T3N DISKINFO does not match local DISKINFO UNITS=1 tbdb DISKINFO UNITS=2
return to top Developer notesupdate | rc.nodecheck | nodecheck | diskcheck | Gather as information about the Hard Drives: | cpucheck | timecheck | Check the clock synchronism with network time. Timecheck only runs in non-mfs mode. | memcheck | Discovers the amount of memory in the system | niccheck | Gather information of the NIC adapters on the node: | Information (inventory) logging | Local node logging | Logs containing more information then what is output on the console are saved in /var/emulab/log/nodecheck.log | Development/bootstrap utilities | Bash
updaterc.nodecheckThe is the boot-time startup script. nodecheckUmbrella script that will call the rest of the scripts diskcheckGather as information about the Hard Drives:
cpucheckGather information about the CPU:
timecheckCheck the clock synchronism with network time. Timecheck only runs in non-mfs mode.memcheckDiscovers the amount of memory in the systemniccheckGather information of the NIC adapters on the node:
Information (inventory) loggingLocal node loggingLogs containing more information then what is output on the console are saved in /var/emulab/log/nodecheck.logDevelopment/bootstrap utilitiescheckwce Search the inventory files and list drives that don't have the WriteCacheEnable bit set. checkinventorydrift Compare the inventory files against each other looking for changes between runs. checkutils.h Support functions that all other scripts source including gen_sql. The following are some useful function which can be included by other scripts. # A calling script needs to first source the utility file and then call initialize() source checkutils.h initialize () # the following function then become available readtmcinfo () # read hw inventory from 'tmcc hwinfo' or a file. # copy into one of three arrays hwinv, hwinvcopy or tmccinfo comparetmcinfo () # compare arrays hwinv and copyhwinv. diffs are sent to a outputfile printhwinv () # output a complete listing of all info in the array hwinv printtcminfo() # output from the array hwinv only the data which should be in the testbed database # in the same format as 'tmcc hwinfo' BashBash is used for all the scripts that run in checknode except for rc.checknode which is written in sh. Bash version 4 is required # if var is not set then create var, and optionally set it [[ -z "${var-}" ]] && declare var # "#${a+$a}" use $a if it exists else use nothing # example: test to see if an array has something in it, if not then ... if [ -z ${hwinvcopy[$1]}+${hwinvcopy[$1]}} ] ; then # $1 could hold a log file name or it could be empty, if empty use a default # needed since -u set and can't reference a unset variable. IE we can't test $1 # to see if it has a value. logfile=${1-"/tmp/logfile"} sql commands that can be helpful # to get HD info from serial number
![]() ![]() ![]() |
![]() |
Document generated by Confluence on Jan 27, 2014 08:42 |