A while ago I wrote a Nagios plugin to monitor CPU load in SmartOS. I’ve updated that plugin to support individual monitoring of each core, that means you can have a more precise monitoring. In addition I’ve created a template for PNP4Nagios which you can use with the new script. It handles numerous cores automatically depending on how many are fed into Nagios by the monitoring script.
Tag Archives: monitoring
Adding an environment variable to a service in SMF
Since I’m using NRPE to monitor my SmartOS node I want to be able to use Nagios plugins. The plugins, when installed with pkgin or manually, are added to /opt/local/libexec/nagios or /opt/custom/libexec/nagios and the libraries which the plugins are dependent on are installed to /opt/local/lib or /opt/custom/lib, depending on your preferences. These directories aren’t in the default LD_LIBRARY_PATH in SmartOS, that means NRPE won’t find them.
If you want NRPE to find the libraries you need the LD_LIBRARY_PATH environment variable set manually. This is done by adding it to the SMF manifest. Hopefully you’ve added your manifest XML-file to /opt/custom/smf, edit it and change from:
<exec_method name='start' type='method' exec='/opt/custom/sbin/nrpe -c /opt/custom/etc/nrpe.cfg -d' timeout_seconds='0'/>
to
<exec_method name='start' type='method' exec='/opt/custom/sbin/nrpe -c /opt/custom/etc/nrpe.cfg -d' timeout_seconds='0'> <method_context> <method_environment> <envvar name='LD_LIBRARY_PATH' value='/lib:/usr/lib:/usr/local/lib:/opt/local/lib'/> </method_environment> </method_context> </exec_method>
This will set the variable each SMF starts the service and NRPE will find the libraries.
Installing NRPE in the SmartOS global zone
Are you using Nagios, op5 or some other Nagios based monitoring solution? Then you’ll probably want NRPE installed in the global zone of your SmartOS nodes to monitor them. This is how to do it.
Installing NRPE
Since the GZ doesn’t come with pkgin to install packages by default you have two choices. Install pkgin as described in http://wiki.smartos.org/display/DOC/Installing+pkgin or you’ll have to install NRPE manually (or with a management tool like Chef, CFengine etc.). I had problems using pkgin in the GZ, for some reason it didn’t install all files in the NRPE packages.
I’m showing you how to do it manually. Start by downloading NRPE from the latest pkgsrc available at smartos.org, for example http://pkgsrc.smartos.org/packages/SmartOS/2012Q4-multiarch/All/nagios-nrpe-2.12nb3.tgz. Unpack the file and copy the contents to /opt/custom. /opt/custom is one of the few places in your SmartOS installation where you can have stuff that persists through reboots. Remove the files starting with +, those contain package information needed by pkgin. NRPE should hopefully be installed under /opt/custom/sbin/nrpe.
Configuration of NRPE
Under /opt/custom/share/examples/nagios/nrpe.cfg you can find an example configuration for NRPE. I copied this file to /opt/custom/etc and modified it for my needs. As the GZ doesn’t have user called “nagios” I’m running it as “nobody”. Also make sure you point it at your scripts nrpe.cfg, mine are installed under /opt/custom/libexec/nagios. The scripts included in the nagios plugins package from smartos.org won’t work under the GZ, just to warn you. Some libraries are missing.
Configure SMF
NRPE should in most cases be started when the SmartOS node starts. To accomplish this a manifest has to be added to SMF, the Service Management Facility. SMF manifests are written in XML, create a manifest XML-file that looks like this.
Now, import it with svccfg import manifest.xml and enable it with svcadm enable nrpe. Now NRPE should be up and running. Finally copy your manifest XML file to /etc/custom/smf. This will add the manifest each time the system boots.
Monitoring refrigerator and freezer temperatures
It’s not a secret, I’m a sucker for monitoring. My latest project is to monitor the temperatures of my fridge and freezer. Mainly for the fun of it, but it’s also a good thing I get a mail if they break and stop cooling.
It’s all integrated into my op5 system, as everything else. I’m using a separate OW-SERVER connected to the rest of the network via WiFi through an Airport Express.
I’ve learned a great deal about my appliances. First of all, they don’t keep the temperature I set them to do. They actually go up and down a lot, I guess this is when the temp is too high and the compressor kicks in. I’ve also noticed that my freezer raises the temperature a LOT once a day, my guess this is for the anti frost mechanism.
Above you can see the temperatures varying. The fridge temp is for the last 24 hrs, the freezer temp is from the last week.
Monitoring a SmartOS hypervisor from Nagios
The global zone in SmartOS is, well… It’s not that comprehensive when it comes to programs installed. Since it’s to be regarded as a live image, you shouldn’t really install programs here. With that in mind, how does one monitor SmartOS from Nagios (or in my case op5)? The easiest solution is to use check_by_ssh and putting all the monitoring scripts somewhere under /zones (so they don’t disappear if the machine reboots).
check_by_ssh uses, you’ve probably already guessed, SSH to get the monitoring information from the monitored host to the NMS (Network Monitoring System). To use this method you’ll have to set up pub key auth, so this is step 1.
Setting up pub key auth with SSH
The user running Nagios (or op5) needs a SSH pub key, use ssh-keygen to create this on the NMS. The newly created file called ~/.ssh/id_rsa.pub contains the public key. This needs to be transferred and inserted into a file on the SmartOS host.
It’s time to set up the 2nd part of pub key auth, now on the SmartOS host. Create a new directory under /usbkey called config.inc. Put a textfile here called authorized_keys containing the info from id_rsa.pub from the NMS. Add the following to /usbkey/config
config_inc_dir=config.inc root_authorized_keys_file=authorized_keys
SSH public key auth is now set up, but you probably need to reboot the SmartOS host first for the config to be read.
Nagios and scripts
Use check_by_ssh to create a new check command, it should look something like this.
I then added my nagios check scripts to a directory called /opt/custom/libexec (which is actually located under /zones) and I’m using check_by_ssh to call them. Works like a charm, although it might not be the best or most secure way to do it. But imho it’s better than installing NRPE and probably breaking a lot of the intended slimness of the SmartOS global zone.
Problems during update to op5 5.6.0
Yesterday I updated to op5 5.6.0 and it broke. Big time! Every time I tried to view Tactical overview, Services, basically anything using Ninja I just got a stack trace. I tried reinstalling Ninja using yum, but with no luck. However it told me to run a script called /opt/monitor/op5/ninja/install_scripts/ninja_db_init.sh, but this script gave me the following errors:
[root@charon marcus]# /opt/monitor/op5/ninja/install_scripts/ninja_db_init.sh Installing database tables for Ninja GUI /opt/monitor/op5/ninja/install_scripts/ninja_db_init.sh: line 42: [: : integer expression expected Installing database tables for SLA report configuration /opt/monitor/op5/ninja/op5-upgradescripts/merlin-reports-db-upgrade.sh: line 52: [: : integer expression expected Installing database tables for AVAIL report configuration /opt/monitor/op5/ninja/op5-upgradescripts/merlin-reports-db-upgrade.sh: line 90: [: : integer expression expected Installing database tables for scheduled reports configuration /opt/monitor/op5/ninja/op5-upgradescripts/merlin-reports-db-upgrade.sh: line 136: [: : integer expression expected Database upgrade complete.
After some research I concluded that the script couldn’t log into the database, the root account in MySQL had a password set all of a sudden. (op5 uses a blank root password by default).
I used the instructions at http://dev.mysql.com/doc/refman/5.0/en/resetting-permissions.html#resetting-permissions-unix to reset the root password and reinstalled all op5 packages. This didn’t work, so I ran the /opt/monitor/op5/ninja/install_scripts/ninja_db_init.sh script manually. This did the trick!
[root@charon ~]# /opt/monitor/op5/ninja/install_scripts/ninja_db_init.sh Upgrading ninja db from v4 to v5 Upgrading AVAIL tables from v8 to v9 ... done. Upgrading scheduled reports tables from v7 to v8.sql ... done. Importing old scheduled reports Schedules seems to be already imported. Database upgrade complete.
As you can see all the errors are gone and the upgrade seemed to be successful, and it was successful!
Now, I’m not sure if it was something I did or a bug in op5. I’m just very pleased that I fixed the problem.
Temperature page up again
My temperature page is now up and running again. Right now it shows you the temperature outside my apartment and in my server cabinet. I’ll try to add some more data, since I’m monitoring a lot more than this. Enjoy.
Monitoring display prototype working
Finally I’m home again after a week taking care of my parents house and dog while they were on vacation. I’ve continued work on my Arduino project and today I connected it to my network monitoring software op5 to see if it worked as intended. It did! Maybe you’re wondering what it’s supposed to do?
It’s basically a two row LCD display, two LEDs and an Arduino. This is connected to the computer running op5 via USB. A program, written in Python, is running on the op5 computer. This transmits current out- and indoor temperature which is displayed in the LCD and in addition the LEDs (a green and a red one) shows me the overall status of my network. If the green LED’s lit everything’s OK, if the red’s lit something’s wrong. Pretty simple.
The monitoring server
Let’s start with the monitoring server. It has a Python program running which sends data to the Arduino.
#!/usr/bin/python
import time
import serial
import string, sys, os, time
#Serial port
serial_name = "/dev/ttyUSB0"
hostname = "jupiter.nickebo.net"
community = "public"
oid1 = "1.3.6.1.4.1.31440.10.5.1.1.0"
oid2 = "1.3.6.1.4.1.31440.10.5.1.1.2"
def printLCD():
# configure the serial connections (the parameters differs on the device you are connecting to)
try:
ser = serial.Serial(serial_name,9600)
except:
print "Could not open serial port " + serial_name
raise SystemExit()
time.sleep(2)
print "Printing: Out: " + str(round(sensor, 1))
ser.write('0Out: ' + str(round(sensor, 1)) + ' C')
time.sleep(2)
print "Printing: In: " + str(round(sensor2, 1))
ser.write('1In: ' + str(round(sensor2, 1)) + ' C')
time.sleep(2)
statusfile = os.popen("/opt/monitor/bin/monitorstats |grep 'Services Ok'|awk '{print $5 $7 $9}'")
status = int(statusfile.read())
statusfile.close()
print "op5 status: " + str(int(status))
if(int(status) > 0):
print "alarm"
ser.write('alarm')
else:
print "reset"
ser.write('reset')
while 1:
#Connect to the 1-wire server
try:
sensorfile = os.popen('/usr/bin/snmpget -c ' + community + ' -v 2c ' + hostname + ' ' + oid1 + ' |cut -d \\" -f 2')
sensor = str(sensorfile.read())
sensor = float(sensor)
except:
print "Could not connect to " + hostname + " or error reading probe"
raise SystemExit(3)
try:
sensorfile2 = os.popen('/usr/bin/snmpget -c ' + community + ' -v 2c ' + hostname + ' ' + oid2 + ' |cut -d \\" -f 2')
sensor2 = str(sensorfile2.read())
sensor2 = float(sensor2)
except:
print "Could not connect to " + hostname + " or error reading probe"
raise SystemExit(3)
printLCD()
time.sleep(10)
This is pretty much a dirty hack, but it works. I use a while-loop which tries to fetch the temperature data from my 1-wire server via SNMP. If this succeeds it calls the printLCD function which sends data to the Arduino. The function also checks the status of op5, sending “alarm” if something’s wrong and “reset” if everything’s OK. As you can see the data sent for the temperature readings start with either 0 or 1, this is a simple header which tells the Arduino if the data is to be printed on line 0 or 1 on the LCD. Unfortunately I’ve been too lazy to do proper comments in the code, this will be sorted in the final version.
If you’ve been able to decrypt my crappy Python code above you might want to see the other side of the serial line? Namely the code for the Arduino.
The Arduino, where the real magic happens
The Arduino uses C++ (well, it’s VERY similar to C++) for the programming part. I’m using a library called LiquidCrystal to control the HD44780 LCD display, this makes it a whole lot easier. Beside that it’s handling the incoming serial data and sorting it out that’s the challenge. Now, the code.
#include
// initialize the library with the numbers of the interface pins
LiquidCrystal lcd(12, 11, 5, 4, 3, 2);
void setup(){
// set up the LCD's number of columns and rows:
lcd.begin(16, 2);
// initialize the serial communications:
Serial.begin(9600);
//Grön
pinMode(8, OUTPUT);
//Röd
pinMode(9, OUTPUT);
//Sätt Grön LED till high
digitalWrite(8, HIGH);
}
void loop()
{
// when characters arrive over the serial port...
if (Serial.available()) {
char inData[18];
char inChar=-1;
char cmdVar[2];
byte index=0;
// wait a bit for the entire message to arrive
delay(100);
// read all the available characters
while (Serial.available() > 0) {
if(index < 17) // One less than the size of the array
{
inChar = Serial.read(); // Read a character
inData[index] = inChar; // Store it
if(index == 0)
{
cmdVar[0] = inChar;
cmdVar[1] = '\0';
}
index++; // Increment where to write next
inData[index] = '\0'; // Null terminate the string
}
}
if(strcmp(inData,"alarm") == 0)
{
digitalWrite(9, HIGH);
digitalWrite(8, LOW);
}
else if(strcmp(inData,"reset") == 0)
{
digitalWrite(9, LOW);
digitalWrite(8, HIGH);
}
else
{
byte i=1;
char dispVar[18];
for(i=1;i<17;i++)
{
dispVar[i-1] = inData[i];
}
Serial.write(cmdVar);
if(strcmp(cmdVar,"0") == 0)
{
//lcd.clear();
lcd.setCursor(0,0);
lcd.write(dispVar);
}
else if(strcmp(cmdVar,"1") == 0)
{
//lcd.clear();
lcd.setCursor(0,1);
lcd.write(dispVar);
}
}
}
}At the first glimpse this code is “WTF? what did he do? this is the crappiest code I’ve ever…” and yes, it IS crappy. But it works, I haven’t refined it in any way. So, what does it do? I initialize the LCD, loads libraries, etc. at the top. Also sets digital pin 8 and 9 as outputs for my LEDs. It assumes everything is OK and sets the green LED to high, which means it will be lit. Now, if data is received on the serial port (emulated via an FTDI chip over USB) the data is put in inData and inData is null terminated with ‘\0′. This is straight forward, receive the data and put it in an array. I then check if it matches “alarm” or “reset”, if so the green/red LED is lit. If not, I place the first char in a separate variable called cmdVar. If it’s equal to 0 or 1 i print out the rest of the data on either row 0 or row 1. If the first char it’s either 0 or 1 nothing happens, it’s not valid since I don’t know which line to print it on.
The next step is to order all stuff needed and solder it all together and make a permanent installation. Hopefully I’ll get this done next week.
Above is the Arduino at present, displaying current indoor and outdoor temperatures. As you can see the green LED is lit, everything is OK in my network. I’ll keep it like this until I’ve soldered the new unit.
Status display Arduino project
Since I got the Arduino I’ve decided to get a project going to build a status display for my home network. It will feature a LCD display for show temperatures, humidity, etc. and two LEDs for showing green (all OK) or red (something’s wrong). Here’s a video of me showing the prototype.
Two new Nagios plugins for checking the EDS 1-wire server
Tonight I’ve completed two new plugins for Nagios (or op5 in my case). They’re used to check DS18B20 and DS2438, i.e. temperature and relative humidity sensors in a 1-wire network. Since I have my 1-wire network(s) connected to a 1-wire server from Embedded Data Systems I wrote the plugins to pull the data from this device. They both use XML parsing and retrieve the data from a file called details.xml on the 1-wire server.
The first plugin is for checking the humidity sensor.
#!/usr/bin/python
# coding: utf-8
#
# Check the humidity probes from a DS2438 sensor on EDS 1-wire server
#
# By Marcus Wilhelmsson
# marcus@nickebo.net
# http://www.nickebo.net
# Licence GPLv2
# Version 0.1
import string, sys, os, argparse, urllib2, xml
from xml.dom.minidom import parseString
# Parse arguments
parser = argparse.ArgumentParser(description='Check humidity probes on EDS 1-wire server')
parser.add_argument('-H', action="store", dest="hostname", help='1-wire server hostname')
parser.add_argument('-p', action="store", dest="probe", type=int, help='Probe number, starting with 0')
parser.add_argument('-w', action="store", dest="warn", type=int, help='Warning humidity')
parser.add_argument('-c', action="store", dest="crit", type=int, help='Critical humidity')
results = parser.parse_args()
# Store parsed arguments in variables and make sure they're not empty
warn = results.warn
crit = results.crit
probe = results.probe
if (warn == None or crit == None or probe == None):
parser.print_help()
raise SystemExit()
if (crit <= warn):
print "Critical humidity can't be less or equal to warning humidity"
parser.print_help()
raise SystemExit()
#Connect to the 1-wire server and download details.xml
hostname = results.hostname
try:
file = urllib2.urlopen('http://' + hostname + '/details.xml')
except:
print "Could not connect to " + hostname
raise SystemExit(3)
#Read the data from the XML file and close it
data = file.read()
file.close()
#Parse XML
try:
dom = parseString(data)
xmlTag = dom.getElementsByTagName('Humidity')[probe].toxml()
sensor = float(xmlTag.replace('<Humidity Units="PercentRelativeHumidity">','').replace('</Humidity>',''))
except:
print "Error reading 1-wire probe"
raise SystemExit(3)
# Print status
if (sensor < warn):
print "OK: Humidity: " + str(sensor) + " %|Humidity=" + str(sensor) + "%;" + str(warn) + ";" + str(crit) + ";" + str(sensor-10) + ";" + str(sensor+10)
raise SystemExit(0)
elif (sensor >= warn and sensor < crit):
print "WARNING: Humidity: " + str(sensor) + " %|Humidity=" + str(sensor) + "%;" + str(warn) + ";" + str(crit) + ";" + str(sensor-10) + ";" + str(sensor+10)
raise SystemExit(1)
else:
print "CRITICAL: Humidity: " + str(sensor) + " %|Humidity=" + str(sensor) + "%;" + str(warn) + ";" + str(crit) + ";" + str(sensor-10) + ";" + str(sensor+10)
raise SystemExit(2)
The second script is for checking temperature probes.
#!/usr/bin/python
# coding: utf-8
#
# Check the temperature probes on EDS 1-wire server
#
# By Marcus Wilhelmsson
# marcus@nickebo.net
# http://www.nickebo.net
# Licence GPLv2
# Version 0.1
import string, sys, os, argparse, urllib2, xml
from xml.dom.minidom import parseString
# Parse arguments
parser = argparse.ArgumentParser(description='Check temperature probes on EDS 1-wire server')
parser.add_argument('-H', action="store", dest="hostname", help='1-wire server hostname')
parser.add_argument('-p', action="store", dest="probe", type=int, help='Probe number, starting with 0')
parser.add_argument('-w', action="store", dest="warn", type=int, help='Warning temperature')
parser.add_argument('-c', action="store", dest="crit", type=int, help='Critical temperature')
results = parser.parse_args()
# Store parsed arguments in variables and make sure they're not empty
warn = results.warn
crit = results.crit
probe = results.probe
if (warn == None or crit == None or probe == None):
parser.print_help()
raise SystemExit()
if (crit <= warn):
print "Critical temperature can't be less or equal to warning temperature"
parser.print_help()
raise SystemExit()
#Connect to the 1-wire server
hostname = results.hostname
try:
file = urllib2.urlopen('http://' + hostname + '/details.xml')
except:
print "Could not connect to " + hostname
raise SystemExit(3)
#Read the XML file and close it
data = file.read()
file.close()
#Parse XML
try:
dom = parseString(data)
xmlTag = dom.getElementsByTagName('Temperature')[probe].toxml()
sensor = float(xmlTag.replace('<Temperature Units="Centigrade">','').replace('</Temperature>',''))
except:
print "Error reading 1-wire probe"
raise SystemExit(3)
# Print status
if (sensor < warn):
print "OK: Temperature: " + str(sensor) + " C|Temperature=" + str(sensor) + ";" + str(warn) + ";" + str(crit) + ";" + str(warn-5) + ";" + str(crit+5)
raise SystemExit(0)
elif (sensor >= warn and sensor < crit):
print "WARNING: Temperature: " + str(sensor) + " C|Temperature=" + str(sensor) + ";" + str(warn) + ";" + str(crit) + ";" + str(warn-5) + ";" + str(crit+5)
raise SystemExit(1)
else:
print "CRITICAL: Temperature: " + str(sensor) + " C|Temperature=" + str(sensor) + ";" + str(warn) + ";" + str(crit) + ";" + str(warn-5) + ";" + str(crit+5)
raise SystemExit(2)
They’re both licensed under GPL, so feel free to use them.
I’ve sent both plugins to Nagios Exchange, hopefully they’ll be up there in a couple of days tops.







