Punches
Punches are defined in a configuration file, as described in the documentation for clusterpunch.conf
. This section documents the punches that come with the distribution. Some of the diagnostic punches are designed to work with a RedHat, or RedHat-like Linux distribution. You can always alter and create your own punches to suit your environment. The punches shown here are not the best punches - let me know if you come up with something better/interesting. If you create some punches for your specific OS/architecture, also let me know so that I can post these for others using the same platform.
Example Punches
punch = punch1
This punch makes and times 1,000,000 calls to rand() and populates the bench1
statistic with time taken to execute the code.
<punch> name = punch1 statistic = bench1 valuetype = timer format = %6.3f function <<CODE for (my $i=0;$i<1e6;$i++) { rand (); } CODE </punch>
The output below, and for each punch was obtained by using clustersnapshot
with the arguments -c PUNCHNAME -s STATISTIC
. The command was run on our cluster and I show the nodes at the top, middle and bottom of the listing sorted by the punch value.
host bench1 live 2of8 0.406 1 7of7 0.406 1 ... 8of4 0.450 1 5of3 0.450 1 ... 0of0 0.774 1 0of1 1.162 1
punch = punch2
This is an example punch which employs a call to a system command. In this case the cat
statistic is populated and the value is the time taken to show a directory listing of all entries under /etc.
<punch> name = punch2 statistic = cat valuetype = timer system = "/bin/ls -alR /etc &> /dev/null" </punch>
host cat live 3of8 0.086979 1 8of8 0.094786 1 ... 9of1 0.143933 1 8of1 0.144877 1 ... 0of0 0.657006 1 0of1 2.62294 1
Benchmark Punches
Benchmark punches contain code which is timed on each node. The execution times are then used to provide relative ranking of the nodes.
punch = benchmem
This punch is meant to determine the speed of the memory subsystem bu repeatedly allocating/deallocating a large array. This punch permits arguments, M and N, and allocates/deallocates an array of N elements M times. By default N=1,000,000 and M=20. This punch contributes to the cumulative statistic b_all
. The value filter defined by valuemap does nothing, and is here as an example of how this feature could be used.
<punch> name = benchmem statistic = b_mem cumulative = b_all valuemap = return $_[0] valuetype = timer format = %6.3f sort = ascending function <<CODE my ($M,$N) = @_; $M ||= 20; $N ||= 1e6; my @array; foreach my $idx (1..$M) { $array[$idx] = []; $array[$idx]->[$N] = 0; $array[$idx] = undef; } CODE </punch>
host b_all b_mem live 7of8 0.791 0.791 1 1of7 0.791 0.791 1 ... 5of1 0.933 0.933 1 5of0 0.933 0.933 1 ... 2of1 1.242 1.242 1 0of1 2.774 2.774 1
punch = benchio
This is a test of the I/O system using dd. A file of size kbytes
is written M
times to /tmp. It is assumed that the block size on your system is 512 bytes, so the number of blocks written is 2*kbytes
and that you have /bin/dd and that there is enough space in /tmp to do the operation. The filename has a randomized name and is deleted immediately after creation. By default a single 60MB file is created during this punch.
<punch> name = benchio statistic = b_io cumulative = b_all valuetype = timer format = %6.3f function <<CODE my ($kbytes,$M) = @_; $kbytes ||= 60e3; $M ||= 1; foreach (1..$M) { my $randfilename = join("",map {chr(97+int(rand(25)))} (0..20)); my $count = 2*$kbytes; system("/bin/dd if=/dev/zero of=/tmp/$randfilename count=$count &> /dev/null"); unlink("/tmp/$randfilename"); } CODE </punch>
host b_all b_io live 0of0 0.073 0.073 1 2of2 0.637 0.637 1 ... 9of3 1.460 1.460 1 4of4 1.461 1.461 1 ... 1of1 4.077 4.077 1 9of1 4.123 4.123 1
punch = benchcpu
To benchmark the CPU, N
calls to trascendental and power functions are called. There are multiple entries within the loop to try to minimize the overhead of looping. By default the loop is repeated 100,000 times. I'm aware that this is not a rigorous CPU benchmark, any more than the above memory and I/O benchmarks are rigorous. I've found this to be sufficient enough to rank P3 CPUs.
<punch> name = benchcpu statistic = b_cpu cumulative = b_all valuetype = timer format = %6.3f function <<CODE my ($N) = @_; $N ||= 1e5; foreach (0..$N) { sin($_/$N)**2*cos($_/($N+1))**2; sin($_/($N+1))**2*cos($_/($N+2))**2; sin($_/($N+2))**2*cos($_/($N+3))**2; sin($_/($N+3))**2*cos($_/($N+4))**2; } CODE </punch>
In the example below, I've used -c "benchcpu;mhz;load" -s "b_cpu"
as the parameters to clustersnapshot to show how the CPU benchmark results relate to the current load and MHz rating.
host b_all b_cpu live load mhz 5of7 0.514 0.514 1 1.0 2792 5of8 0.514 0.514 1 0.0 2792 7of7 0.514 0.514 1 0.0 2792 ... 9of3 0.570 0.570 1 1.1 2522 5of3 0.571 0.571 1 1.0 2522 6of3 0.571 0.571 1 1.1 2522 ... 7of2 0.777 0.777 1 1.0 1992 3of3 0.791 0.791 1 3.1 2522 0of1 1.321 1.321 1 2.2 1992
Increasing the number of loops in benchcpu from 100,000 to 1,000,000 by calling -c "bencpu(1e6);mzh;load" -s "b_cpu"
yields similar relative rankings, but now the benchmark takes about 5 seconds on each node.
host b_all b_cpu live load mhz 7of7 5.132 5.132 1 0.2 2792 (up from #3) 0of8 5.132 5.132 1 0.2 2792 (up from #4) 5of8 5.132 5.132 1 0.1 2792 (down from #2) ... 6of3 5.678 5.678 1 1.2 2522 1of4 5.679 5.679 1 2.2 2522 3of4 5.682 5.682 1 2.2 2522 9of3 5.684 5.684 1 1.2 2522 8of3 5.688 5.688 1 1.2 2522 5of4 5.689 5.689 1 0.1 2522 5of3 5.689 5.689 1 1.2 2522 ... 2of1 7.819 7.819 1 2.5 1992 0of1 8.181 8.181 1 3.1 1992 0of0 11.606 11.606 1 3.0 1992
Diagnostic Punches
Diagnostic punches return values pertaining to the hardware profile, status and configuration of the nodes. Their return values can be used to rank the nodes on an absolute scale.
punch = mzh
This punch uses information from the /proc
filesystem to provide a sum of the MHz speed across all CPUs in the node. It's expected that each CPU will have an entry in /proc/cpuinfo
with a line like
cpu MHz : 1261.416
showing the MHz speed. The punch uses valuetype = return
since we don't want the time for execution but the return value of the code to be the punch value. The sort type of the punch is set to sort = descending
because larger values are associated with higher ranks - we want more MHz!
<punch> name = mhz statistic = mhz valuemap = return $_[0] valuetype = return format = %5d sort = descending function <<CODE return 0 if ! -e "/proc/cpuinfo"; my $cpuinfo = `cat /proc/cpuinfo`; my $MHz; while($cpuinfo =~ /MHz\s*:\s*(\d+)/g) { $MHz += $1; } return $MHz; CODE
punch = load
This punch returns the system load average derived from /proc/loadavg
. By default the 1-minute load average is returned, but you can call the punch with load(N)
for N=1,5,15
to get the 5 and 15 minute averages.
<punch> name = load statistic = load valuetype = return appendargs = true format = %4.1f function <<CODE my ($time) = @_; my $loadavg; if(-e "/proc/loadavg") { $loadavg = `cat /proc/loadavg`; chomp $loadavg; } else { $loadavg = "- - -"; } my %load; @load{1,5,15} = split /\s+/,$loadavg; if(defined $time && defined $load{$time}) { return $load{$time}; } else { return $load{1}; } CODE </punch>
Using clustersnapshot -c "load" -s "load"
host live load 5of4 1 0.0 8of0 1 0.0 ... 4of7 1 1.1 9of2 1 1.1 .... 1of5 1 3.1 4of3 1 3.1
You'll notice that the punch populates the statistic = load
. What if you ask for two loads? Using clustersnapshot -c "load;load(5);load(15)" -s "load"
host live load load15 load5 5of2 1 0.0 0.00 0.02 6of0 1 0.0 0.00 0.00 ... 6of8 1 1.0 0.90 1.01 8of8 1 1.0 0.91 1.00 ... 2of4 1 3.0 2.91 3.01 1of5 1 3.0 2.91 3.03
You'll notice that the arguments for the 5- and 15-minute average punches were appended to the statistic so that you can distinguish between the loads. Because the 1-minute load punch was called with load
and not load(1)
the 1-minute statistic remains the same (these two calls are equivalent because the default load returned is the 1-minute average). The arguments are appended to the statistic when appendargs = true
.
punch = uptime
To get the uptime of the node, use this punch. Your /proc/stat
needs to have the line
btime 1043184875
for this punch to work. The value returned by the punch is in units of days.
<punch> name = uptime statistic = uptime valuetype = return format = %7d sort = descending function <<CODE my $stat = `cat /proc/stat`; if($stat =~ /btime (\d+)/s) { my $boottime = $1; my $uptime = (time-$boottime)/3600/24; return $uptime; } else { return -1; } CODE </punch>
punch = nusers
Counts the number of users, not necessarily unique, logged into the node as reported by /usr/bin/who
.
<punch> name = nusers statistic = nusers valuetype = return format = %6d function <<CODE return -1 if ! -e "/usr/bin/who"; my $nwho = `/usr/bin/who | wc | tr -s " " | cut -d " " -f 2`; chomp $nwho; return $nwho; CODE </punch>
punch = jobusers
Reports the total CPU time of jobs in the process table a per-user basis. Any job not owned by system daemons, as listed in the punch code, is used towards the total count. It's expected that the command
wwps auxww | tr -s " " | cut -d " " -f 1,10
will return something like
root 8:06 bob 0:30 ntp 0:00
For each user, the CPU times for their jobs is added and the output format is user:time
where time
is in minutes. Thus below, phuang
's jobs on 7of8
have been running for 1.98 minutes.
name = jobusers statistic = jobusers valuetype = return format = %25s function <<CODE my $ps = `ps auxww | tr -s " " | cut -d " " -f 1,10`; my @ps = split(/\n/,$ps); chomp @ps; my %users; my @stopusers = qw(USER bin daemon nobody root xfs rpcuser rpc ntp lp); foreach my $psline (@ps) { my ($user,$time) = split(/ /,$psline); next if grep($user eq $_, @stopusers); my ($min,$sec) = split(/:/,$time); my $totmin = $min*60+$sec; $users{$user} += $totmin; } map {$users{$_} /= 60 } keys %users; my @report; map { push(@report,join(":",$_,sprintf("%.2f",$users{$_}))) } sort keys %users; return join(",",@report); CODE </punch>
host jobusers live 7of7 acherk:0.03,jliu:0.00,martink:0.00,srusaw:0.18 1 7of8 jliu:0.02,martink:0.00,phuang:1.98,srusaw:0.20 1
punch = date
Returns the current time on the node in HH:MM:SS format.
<punch> name = date statistic = date valuetype = return format = %8s function <<CODE use POSIX qw(strftime); my $format = $_[0] || "%H:%M:%S"; my $timestamp = strftime $format, localtime; return $timestamp; CODE </punch>
punch = nrunning
Returns the number of currently running processes as reported by /proc/loadavg
.
0.00 0.00 0.00 1/260 22710
The number in bold in the line above is used.
<punch> name = nrunning statistic = nrunning valuetype = return format = %8d function <<CODE my $loadavg = `cat /proc/loadavg`; if($loadavg =~ /(\d+)\/(\d)/) { return $1; } else { return -1; } CODE </punch>
punch = kernel
Returns the kernel on the node via uname -r
<punch> name = kernel statistic = kernel valuetype = return format = %25s function <<CODE chomp(my $kernel = `uname -r`); return $kernel CODE </punch>
host kernel live 0of0 2.2.14-VA.2.1smp 1 0of1 2.2.18pre11-va2.1 1 0of2 2.2.14-VA.2.1smp 1
punch = mem
Reports on the amount of free/used memory/swap on the node. You can call this punch using mem
or mem(free)
to see the amount of free memory as well as mem(total)
, mem(used)
, mem(swapused)
and mem(totalwswap)
to see the total memory, used memory, used swap and total memory with swap. All values are returned in MB.
<punch> name = mem statistic = mem valuetype = return format = %6d appendargs = true valuemap = $_[0]/1024; function <<CODE my ($arg) = @_; my @lines = map {[split(/\s+/,$_)]} grep ($_ =~ /\d/,split(/\n/,`free -o`)); if(! $arg || $arg eq "free") { return $lines[0]->[3]; } elsif ($arg eq "total") { return $lines[0]->[1]; } elsif ($arg eq "used") { return $lines[0]->[2]; } elsif ($arg eq "swapused") { return $lines[1]->[2]; } elsif ($arg eq "totalwswap") { return $lines[1]->[1]+$lines[0]->[1]; } CODE </punch>