A while ago I wrote a Nagios plugin to monitor CPU load in SmartOS. I’ve updated that plugin to support individual monitoring of each core, that means you can have a more precise monitoring. In addition I’ve created a template for PNP4Nagios which you can use with the new script. It handles numerous cores automatically depending on how many are fed into Nagios by the monitoring script.
Tag Archives: nagios
Nagios script to monitor disk temperatures in SmartOS
I’ve written a Nagios script for SmartOS to check the mean temperature of the harddrives in your pool.
Use it like this.
check_smarttemp.js [warning value] [critical value] [disks to check, for example c0t0d0 cc0t1d0 c0t2d0 c0t3d0]
If you need the smartmontools package it can be found here.
Installing NRPE in the SmartOS global zone
Are you using Nagios, op5 or some other Nagios based monitoring solution? Then you’ll probably want NRPE installed in the global zone of your SmartOS nodes to monitor them. This is how to do it.
Installing NRPE
Since the GZ doesn’t come with pkgin to install packages by default you have two choices. Install pkgin as described in http://wiki.smartos.org/display/DOC/Installing+pkgin or you’ll have to install NRPE manually (or with a management tool like Chef, CFengine etc.). I had problems using pkgin in the GZ, for some reason it didn’t install all files in the NRPE packages.
I’m showing you how to do it manually. Start by downloading NRPE from the latest pkgsrc available at smartos.org, for example http://pkgsrc.smartos.org/packages/SmartOS/2012Q4-multiarch/All/nagios-nrpe-2.12nb3.tgz. Unpack the file and copy the contents to /opt/custom. /opt/custom is one of the few places in your SmartOS installation where you can have stuff that persists through reboots. Remove the files starting with +, those contain package information needed by pkgin. NRPE should hopefully be installed under /opt/custom/sbin/nrpe.
Configuration of NRPE
Under /opt/custom/share/examples/nagios/nrpe.cfg you can find an example configuration for NRPE. I copied this file to /opt/custom/etc and modified it for my needs. As the GZ doesn’t have user called “nagios” I’m running it as “nobody”. Also make sure you point it at your scripts nrpe.cfg, mine are installed under /opt/custom/libexec/nagios. The scripts included in the nagios plugins package from smartos.org won’t work under the GZ, just to warn you. Some libraries are missing.
Configure SMF
NRPE should in most cases be started when the SmartOS node starts. To accomplish this a manifest has to be added to SMF, the Service Management Facility. SMF manifests are written in XML, create a manifest XML-file that looks like this.
Now, import it with svccfg import manifest.xml and enable it with svcadm enable nrpe. Now NRPE should be up and running. Finally copy your manifest XML file to /etc/custom/smf. This will add the manifest each time the system boots.
Monitoring CPU load in SmartOS global zone via Nagios
I’ve written a script for monitoring the CPU load in the SmartOS global zone using Nagios. The plugin just checks the momentary CPU load, no average over the last minutes or something like that.
You can find it at GitHub.
https://gist.github.com/linuxprofessor/4491960
https://github.com/linuxprofessor/nagios_scripts/blob/master/check_cpuload.sh
Monitoring a SmartOS hypervisor from Nagios
The global zone in SmartOS is, well… It’s not that comprehensive when it comes to programs installed. Since it’s to be regarded as a live image, you shouldn’t really install programs here. With that in mind, how does one monitor SmartOS from Nagios (or in my case op5)? The easiest solution is to use check_by_ssh and putting all the monitoring scripts somewhere under /zones (so they don’t disappear if the machine reboots).
check_by_ssh uses, you’ve probably already guessed, SSH to get the monitoring information from the monitored host to the NMS (Network Monitoring System). To use this method you’ll have to set up pub key auth, so this is step 1.
Setting up pub key auth with SSH
The user running Nagios (or op5) needs a SSH pub key, use ssh-keygen to create this on the NMS. The newly created file called ~/.ssh/id_rsa.pub contains the public key. This needs to be transferred and inserted into a file on the SmartOS host.
It’s time to set up the 2nd part of pub key auth, now on the SmartOS host. Create a new directory under /usbkey called config.inc. Put a textfile here called authorized_keys containing the info from id_rsa.pub from the NMS. Add the following to /usbkey/config
config_inc_dir=config.inc root_authorized_keys_file=authorized_keys
SSH public key auth is now set up, but you probably need to reboot the SmartOS host first for the config to be read.
Nagios and scripts
Use check_by_ssh to create a new check command, it should look something like this.
I then added my nagios check scripts to a directory called /opt/custom/libexec (which is actually located under /zones) and I’m using check_by_ssh to call them. Works like a charm, although it might not be the best or most secure way to do it. But imho it’s better than installing NRPE and probably breaking a lot of the intended slimness of the SmartOS global zone.
Nagios script checking hard drive S.M.A.R.T status
A couple of days ago another drive failed in my file server. No big deal. I found out by checking the S.M.A.R.T status of the drive as I thought it was behaving a bit odd. This confirmed my concerns, the disk was about to fail.
Now, wouldn’t it be nice to get some kind of warning if this is about to happen again? To address the problem I wrote a script to be used in Nagios (or op5 in my case) that periodically checks the S.M.A.R.T status of all the disks in the file server. You can download the script here: check_smart.py.
Custom graphs for services checked via NRPE
Most of us running a Nagios based monitoring systems (like op5) uses NRPE (Nagios Remote Plugin Executor). Normally you have a check called check_nrpe which take an argument, the name of the service to check which is configured in nrpe.cfg on the host being checked. Since there’s no graph template that will suite all needs for NRPE checks it uses the default template. This template isn’t very exciting, in fact it’s pretty dull. It’s OK for most services, for sometimes you’ll want a custom template here too. It can the achieved by adding a custom check for each type of NRPE check that’s supported.
Here’s an example. I’m using check_load on a couple of my UNIX/Linux servers to check/graph the load. Normally I’d use check_nrpe and use check_load as an argument (since this is what’s configured on each host being checked). To be able to add a custom graph template that pnp4nagios can use to draw the graph, I’ll have to add a new service check. In this case it’s called “check_nrpe_load”.
When this is added, replace your current service check for the host(s) with the newly created check_nrpe_load. At first you shouldn’t see any difference of the graph since we haven’t added a custom template yet.
I’m discussing creating new templates here. In this example I want the same template as used by the local check_load written by op5. I’ll just copy check_load.php from /opt/monitor/op5/pnp/templates.dist to check_nrpe_load.php and put it in /opt/monitor/op5/pnp/templates, that should be it (modify paths etc if you’re using some other Nagios based NMS). Wait a couple of minutes for the service to get checked.
This is what I ended up with, much better than the default template (which, by the way, doesn’t show load1, load5 and load 15 in the same graph).
Two new Nagios plugins for checking the EDS 1-wire server
Tonight I’ve completed two new plugins for Nagios (or op5 in my case). They’re used to check DS18B20 and DS2438, i.e. temperature and relative humidity sensors in a 1-wire network. Since I have my 1-wire network(s) connected to a 1-wire server from Embedded Data Systems I wrote the plugins to pull the data from this device. They both use XML parsing and retrieve the data from a file called details.xml on the 1-wire server.
The first plugin is for checking the humidity sensor.
#!/usr/bin/python
# coding: utf-8
#
# Check the humidity probes from a DS2438 sensor on EDS 1-wire server
#
# By Marcus Wilhelmsson
# marcus@nickebo.net
# http://www.nickebo.net
# Licence GPLv2
# Version 0.1
import string, sys, os, argparse, urllib2, xml
from xml.dom.minidom import parseString
# Parse arguments
parser = argparse.ArgumentParser(description='Check humidity probes on EDS 1-wire server')
parser.add_argument('-H', action="store", dest="hostname", help='1-wire server hostname')
parser.add_argument('-p', action="store", dest="probe", type=int, help='Probe number, starting with 0')
parser.add_argument('-w', action="store", dest="warn", type=int, help='Warning humidity')
parser.add_argument('-c', action="store", dest="crit", type=int, help='Critical humidity')
results = parser.parse_args()
# Store parsed arguments in variables and make sure they're not empty
warn = results.warn
crit = results.crit
probe = results.probe
if (warn == None or crit == None or probe == None):
parser.print_help()
raise SystemExit()
if (crit <= warn):
print "Critical humidity can't be less or equal to warning humidity"
parser.print_help()
raise SystemExit()
#Connect to the 1-wire server and download details.xml
hostname = results.hostname
try:
file = urllib2.urlopen('http://' + hostname + '/details.xml')
except:
print "Could not connect to " + hostname
raise SystemExit(3)
#Read the data from the XML file and close it
data = file.read()
file.close()
#Parse XML
try:
dom = parseString(data)
xmlTag = dom.getElementsByTagName('Humidity')[probe].toxml()
sensor = float(xmlTag.replace('<Humidity Units="PercentRelativeHumidity">','').replace('</Humidity>',''))
except:
print "Error reading 1-wire probe"
raise SystemExit(3)
# Print status
if (sensor < warn):
print "OK: Humidity: " + str(sensor) + " %|Humidity=" + str(sensor) + "%;" + str(warn) + ";" + str(crit) + ";" + str(sensor-10) + ";" + str(sensor+10)
raise SystemExit(0)
elif (sensor >= warn and sensor < crit):
print "WARNING: Humidity: " + str(sensor) + " %|Humidity=" + str(sensor) + "%;" + str(warn) + ";" + str(crit) + ";" + str(sensor-10) + ";" + str(sensor+10)
raise SystemExit(1)
else:
print "CRITICAL: Humidity: " + str(sensor) + " %|Humidity=" + str(sensor) + "%;" + str(warn) + ";" + str(crit) + ";" + str(sensor-10) + ";" + str(sensor+10)
raise SystemExit(2)
The second script is for checking temperature probes.
#!/usr/bin/python
# coding: utf-8
#
# Check the temperature probes on EDS 1-wire server
#
# By Marcus Wilhelmsson
# marcus@nickebo.net
# http://www.nickebo.net
# Licence GPLv2
# Version 0.1
import string, sys, os, argparse, urllib2, xml
from xml.dom.minidom import parseString
# Parse arguments
parser = argparse.ArgumentParser(description='Check temperature probes on EDS 1-wire server')
parser.add_argument('-H', action="store", dest="hostname", help='1-wire server hostname')
parser.add_argument('-p', action="store", dest="probe", type=int, help='Probe number, starting with 0')
parser.add_argument('-w', action="store", dest="warn", type=int, help='Warning temperature')
parser.add_argument('-c', action="store", dest="crit", type=int, help='Critical temperature')
results = parser.parse_args()
# Store parsed arguments in variables and make sure they're not empty
warn = results.warn
crit = results.crit
probe = results.probe
if (warn == None or crit == None or probe == None):
parser.print_help()
raise SystemExit()
if (crit <= warn):
print "Critical temperature can't be less or equal to warning temperature"
parser.print_help()
raise SystemExit()
#Connect to the 1-wire server
hostname = results.hostname
try:
file = urllib2.urlopen('http://' + hostname + '/details.xml')
except:
print "Could not connect to " + hostname
raise SystemExit(3)
#Read the XML file and close it
data = file.read()
file.close()
#Parse XML
try:
dom = parseString(data)
xmlTag = dom.getElementsByTagName('Temperature')[probe].toxml()
sensor = float(xmlTag.replace('<Temperature Units="Centigrade">','').replace('</Temperature>',''))
except:
print "Error reading 1-wire probe"
raise SystemExit(3)
# Print status
if (sensor < warn):
print "OK: Temperature: " + str(sensor) + " C|Temperature=" + str(sensor) + ";" + str(warn) + ";" + str(crit) + ";" + str(warn-5) + ";" + str(crit+5)
raise SystemExit(0)
elif (sensor >= warn and sensor < crit):
print "WARNING: Temperature: " + str(sensor) + " C|Temperature=" + str(sensor) + ";" + str(warn) + ";" + str(crit) + ";" + str(warn-5) + ";" + str(crit+5)
raise SystemExit(1)
else:
print "CRITICAL: Temperature: " + str(sensor) + " C|Temperature=" + str(sensor) + ";" + str(warn) + ";" + str(crit) + ";" + str(warn-5) + ";" + str(crit+5)
raise SystemExit(2)
They’re both licensed under GPL, so feel free to use them.
I’ve sent both plugins to Nagios Exchange, hopefully they’ll be up there in a couple of days tops.
Mac OS X CPU temperature check script 0.1
I’ve rewritten and done an overhaul of another Nagios plugin. It’s used to check the CPU temperature in Mac OS X.
#!/usr/bin/python
# coding: utf-8
#
# Check the CPU temperature on Mac OS X
# Requires Temperature Monitor by Marcel Bresink Software-Systeme
# Can be downloadded from http://www.bresink.com/osx/TemperatureMonitor.html
#
# By Marcus Wilhelmsson
# marcus@nickebo.net
# http://www.nickebo.net
# Licence GPLv2
# Version 0.1
import string, sys, os, argparse
tempmonitor = "/Applications/TemperatureMonitor.app/Contents/MacOS/tempmonitor"
# Check if above binary really exists
if os.path.isfile(tempmonitor) == False:
print "Temperature Monitor not installed in /Applications"
raise SystemExit(3)
# Parse arguments
parser = argparse.ArgumentParser(description='Check CPU temperature on Mac OS X using Temperature Monitor')
parser.add_argument('-w', action="store", dest="warn", type=int, help='Warning temperature')
parser.add_argument('-c', action="store", dest="crit", type=int, help='Critical temperature')
results = parser.parse_args()
# Store parsed arguments in variables and make sure they're not empty
warn = results.warn
crit = results.crit
if (warn == None or crit == None):
parser.print_help()
raise SystemExit()
# Read the CPU temperature
try:
sensor = int(os.popen(tempmonitor + " | awk '{print $1}'").read())
except:
print "Error reading CPU temperature"
raise SystemExit(3)
# Print status
if (sensor < warn):
print "OK: Temperature: " + str(sensor) + " C|Temperature=" + str(sensor) + ";" + str(warn) + ";" + str(crit) + ";" + str(warn-5) + ";" + str(crit+5)
raise SystemExit(0)
elif (sensor > warn and sensor < crit):
print "WARNING: Temperature: " + str(sensor) + " C|Temperature=" + str(sensor) + ";" + str(warn) + ";" + str(crit) + ";" + str(warn-5) + ";" + str(crit+5)
raise SystemExit(1)
else:
print "CRITICAL: Temperature: " + str(sensor) + " C|Temperature=" + str(sensor) + ";" + str(warn) + ";" + str(crit) + ";" + str(warn-5) + ";" + str(crit+5)
raise SystemExit(2)Just as with the HDD temperature script it’s been submitted to Nagios Exchange. You can download the source file from here.
Hard drive temperature check script version 0.1
I’ve more or less rewritten my Nagios script for checking hard drive temperatures. It still gives you a mean value of all the hard drives checked, but it’s a proper script now with error handeling and arguments.
#!/usr/local/bin/python
# coding: utf-8
#
# Check the temperature of hard drives and calculate a mean value
# Requires smartmontools and privileges to check the disk temperatures
#
# By Marcus Wilhelmsson
# marcus@nickebo.net
# http://www.nickebo.net
# Licence GPLv2
# Version 0.1
import string, sys, os, argparse
# Set full directory and name of the smartctl binary, change if needed
smartctlbin="/usr/local/sbin/smartctl"
# Check if above binary really exists
if os.path.isfile(smartctlbin) == False:
print "Binary file for smartctl is faulty: " + smartctlbin
raise SystemExit(3)
# Parse arguments
parser = argparse.ArgumentParser(description='Check hard drive temperatures using smartmontools')
parser.add_argument('-w', action="store", dest="warn", type=int, help='Warning temperature')
parser.add_argument('-c', action="store", dest="crit", type=int, help='Critical temperature')
parser.add_argument(nargs='*', action='store', dest='disk', help='Disks to check: /dev/sda /dev/sdb /dev/sdc etc.',)
results = parser.parse_args()
# Store parsed arguments in variables and make sure they're not empty
warn = results.warn
crit = results.crit
disks = results.disk
if (warn == None or crit == None or disks == []):
parser.print_help()
raise SystemExit()
# Do the actual disk temperature checks
total = 0
for x in disks:
if os.path.exists(x):
try:
total = total + int(os.popen(smartctlbin + " -a " + x + "| grep Celsius | awk '{print $10}'").read())
except:
print "Error checking " + x + ". Is it a valid hard drive?"
raise SystemExit(3)
else:
print "Disk " + x + " does not exist. Exiting."
raise SystemExit(3)
try:
total = total/len(disks)
except:
print "Error calculating temperature mean value"
# Print status and make sure critical is greater than warning
if warn >= crit:
print "ERROR: Critical must be greater than warning"
raise SystemExit(3)
if (total < warn): print "OK: Temperature: " + str(total) + " C|Temperature=" + str(total) + ";" + str(warn) + ";" + str(crit) + ";" + str(warn-5) + ";" + str(crit+5) raise SystemExit(0) elif (total >= warn and total < crit):
print "WARNING: Temperature: " + str(total) + " C|Temperature=" + str(total) + ";" + str(warn) + ";" + str(crit) + ";" + str(warn-5) + ";" + str(crit+5)
raise SystemExit(1)
else:
print "CRITICAL: Temperature: " + str(total) + " C|Temperature=" + str(total) + ";" + str(warn) + ";" + str(crit) + ";" + str(warn-5) + ";" + str(crit+5)
raise SystemExit(2)I’ve licensed it under GNU GPL version 2.
I’ll post it on Nagios Exchange as soon as I get a confirmation mail for my account…
The script can be downloaded here.



