Setting Up HP iLO Hardware Monitoring with OMD Labs

Seven problems. That’s how many obstacles I counted before hardware monitoring for three HP ProLiant servers was finally working. The documentation made it sound so simple: enable SNMP, run check_hpasm, done.

Or: How I learned that “community project” sometimes means “bring your own band-aids”

If you’re facing a similar setup – HP ProLiant servers monitored via iLO, OMD Labs as your monitoring system – this walkthrough will save you considerable frustration.

The Setup

Three HP ProLiant DL360p Gen8 servers running VMware ESXi 7.0.3. Each server has an iLO4 management controller with its own network interface:

Server ESXi IP iLO IP
esxi-01 192.168.20.x 192.168.21.101
esxi-02 192.168.20.x 192.168.21.102
esxi-03 192.168.20.x 192.168.21.103

Goal: Comprehensive hardware monitoring – fans, temperatures, power supplies, RAM, RAID – via the iLO interface.

Monitoring System: OMD Labs 5.60 (ConSol Edition) with Naemon as the core and PNP4Nagios for graphing.

Sounds straightforward? It wasn’t.

Problem 1: “Failed to create new file: invalid path”

The very first attempt to create a host through the Thruk web interface greeted me with an error:

Failed to create new file: invalid path

What happened? The directory for configuration files simply didn’t exist. OMD Labs is a community project – some basic functionality isn’t perfectly configured out of the box.

The solution:

su - monitoring  # As OMD site user
mkdir -p ~/etc/naemon/conf.d/hosts
mkdir -p ~/etc/naemon/conf.d/services
mkdir -p ~/etc/naemon/conf.d/commands

Lesson learned: For production setups, it’s better to create configurations directly via CLI. The web interface is nice for quick wins but not always reliable.

The ESXi SNMP Detour

Before tackling iLO, we first tried monitoring ESXi itself via SNMP.

Enabling SNMP on ESXi

Via SSH to the ESXi host:

esxcli system snmp set --communities public
esxcli system snmp set --enable true
esxcli network firewall ruleset set --ruleset-id snmp --enabled true

Problem 2: SNMP MIBs Not Loaded

First test:

snmpwalk -v2c -c public esxi-01.example.com sysDescr

Result:

sysDescr: Unknown Object Identifier (Sub-id not found: (top) -> sysDescr)

The SNMP MIBs weren’t loaded on the OMD server. No big deal – numeric OIDs are more universal anyway:

snmpwalk -v2c -c public esxi-01.example.com .1.3.6.1.2.1.1.1.0

Result:

iso.3.6.1.2.1.1.1.0 = STRING: "VMware ESXi 7.0.3 build-24411414 VMware, Inc. x86_64"

Lesson learned: Always use numeric OIDs for Nagios/Naemon checks – they work everywhere regardless of installed MIB files.

Why iLO Instead of ESXi?

ESXi SNMP provides basic information (uptime, NICs), but for real hardware monitoring, the iLO controller is the better choice:

  • Dedicated hardware management interface – built exactly for this
  • Access even when server is powered off – critical for troubleshooting
  • More detailed sensor data – 41 temperature sensors instead of a handful
  • HP-specific MIBs – CPQHLTH-MIB knows every sensor in the system

Problem 3: SNMP Not Active on iLO

First test against the iLO IP:

snmpwalk -v2c -c public 192.168.21.101 .1.3.6.1.2.1.1.1.0

Timeout

SNMP wasn’t enabled on iLO4 or no community was configured.

The solution: In the iLO4 web interface:

  1. Open browser: https://192.168.21.101
  2. Navigate: Administration → Management → SNMP Settings
  3. Enter Read Community (e.g., public – or better, something secure)
  4. Click Apply

After configuration:

snmpwalk -v2c -c public 192.168.21.101 .1.3.6.1.2.1.1.1.0

Result:

iso.3.6.1.2.1.1.1.0 = STRING: "Integrated Lights-Out 4 1.50"

check_hpasm: The Right Tool for the Job

We could have written individual SNMP checks for each sensor. But check_hpasm is specifically designed for HP ProLiant hardware:

  • Automatic detection of all components
  • Aggregated health status
  • Performance data for all sensors
  • Human-readable output

Problem 4: Perl Locale Warning

First invocation:

~/lib/nagios/plugins/check_hpasm -H 192.168.21.101 -C public

Warnings appeared:

perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
    LANGUAGE = (unset),
    LC_ALL = (unset),
    LANG = "en_US.UTF-8"
    are supported and installed on your system.

The locale settings in the OMD site environment weren’t properly configured.

The solution:

echo 'export LC_ALL=C' >> ~/.profile
source ~/.profile

After that, the check ran cleanly:

WARNING - system fan overall status is degraded, fan 6 (system) degraded,
System: 'proliant dl360p gen8', S/N: 'XXXXXXXXXX', ROM: 'P71 07/01/2015'

Result: The check immediately found a real hardware issue – Fan 6 is degraded!

Problem 5: --perfdata Option Doesn’t Work

We wanted to enable performance data for graphs:

~/lib/nagios/plugins/check_hpasm -H 192.168.21.101 -C public --perfdata

Error message:

Option perfdata requires an argument

The OMD version of check_hpasm was compiled without --enable-perfdata. The option expects a value.

The solution: Explicitly use --perfdata=short:

~/lib/nagios/plugins/check_hpasm -H 192.168.21.101 -C public --perfdata=short

Result with performance data:

WARNING - system fan overall status is degraded... | pc_1=112;460;460 pc_2=99;460;460
fan_1=50% fan_2=50% ... temp_1=24;42;42 temp_2=40;70;70 ...

Naemon Configuration

Architecture Decision: Separate iLO Hosts

Important design decision: Define iLO hosts separately from ESXi hosts:

  1. Different IPs – iLO has its own management network
  2. Future-proof – New HP servers without ESXi fit the schema
  3. Clear separation – Hardware monitoring vs. virtualization

File: ~/etc/naemon/conf.d/hosts/ilo-hosts.cfg

# iLO Hostgroup
define hostgroup {
    hostgroup_name          ilo-servers
    alias                   HP iLO Management Interfaces
}

# iLO Host Template
define host {
    name                    ilo-host
    use                     generic-host
    check_command           check-host-alive
    check_interval          5
    register                0
}

# iLO Hosts
define host {
    use                     ilo-host
    host_name               ilo-esxi-01
    alias                   iLO ESXi-01 (DL360p Gen8)
    address                 192.168.21.101
    hostgroups              ilo-servers
}

define host {
    use                     ilo-host
    host_name               ilo-esxi-02
    alias                   iLO ESXi-02 (DL360p Gen8)
    address                 192.168.21.102
    hostgroups              ilo-servers
}

define host {
    use                     ilo-host
    host_name               ilo-esxi-03
    alias                   iLO ESXi-03 (DL360p Gen8)
    address                 192.168.21.103
    hostgroups              ilo-servers
}

File: ~/etc/naemon/conf.d/commands/check_hpasm.cfg

define command {
    command_name    check_hpasm
    command_line    $USER1$/check_hpasm -H $HOSTADDRESS$ -C $ARG1$ --perfdata=short
}

File: ~/etc/naemon/conf.d/services/hpasm-services.cfg

define service {
    use                     generic-service
    hostgroup_name          ilo-servers
    service_description     HP Hardware Health
    check_command           check_hpasm!public
    check_interval          5
}

Problem 6: Hostgroup Not Found

omd check

Error message:

Error: Could not find any hostgroup matching 'ilo-servers'

The hostgroup definition was missing or not loaded correctly.

The solution: Ensure the hostgroup is defined in the same file or before the services. After correction:

omd check && omd reload naemon

Check Status via Livestatus

echo "GET services
Filter: description = HP Hardware Health
Columns: host_name description state plugin_output" | unixcat ~/tmp/run/live

Result:

ilo-esxi-01;HP Hardware Health;1;WARNING - system fan overall status is degraded, fan 6 (system) degraded...
ilo-esxi-02;HP Hardware Health;2;CRITICAL - dimm module 0:12 (module 12 @ cartridge 0) needs attention (degraded)...
ilo-esxi-03;HP Hardware Health;0;OK - System: 'proliant dl360p gen8', hardware working fine...
Thruk service overview showing three iLO hosts: WARNING, CRITICAL, and OK status
Thruk Service Overview: All three iLO hosts at a glance

PNP4Nagios Graphing

After a few check cycles:

ls ~/var/pnp4nagios/perfdata/ilo-esxi-01/
HP_Hardware_Health.xml
HP_Hardware_Health_fan_1.rrd
HP_Hardware_Health_fan_2.rrd
...
HP_Hardware_Health_temp_1.rrd
HP_Hardware_Health_temp_2.rrd
...

8 fans, 41 temperature sensors, 2 power consumption values – perfect!

Thruk service detail with performance data: fans, temperatures, power consumption
Service Detail: All performance data at a glance

Problem 7: PNP4Nagios Shows Error Instead of Graphs

When opening PNP4Nagios in the browser:

Please check the documentation for information about the following error.

Undefined array key 14

file [line]:
templates.dist/check_hpasm.php [35]:

The bundled PNP4Nagios template for check_hpasm defines only 14 colors:

$colors=array("CC3300","CC3333","CC3366",...); // only 14 entries

But our DL360p Gen8 has 41 temperature sensors! At sensor 15, there’s no color entry → array index error.

The solution: Custom template with more colors and cyclic usage:

File: ~/share/pnp4nagios/htdocs/templates/check_hpasm.php

<?php
#
# Fixed check_hpasm template with more colors
#
$colors=array(
    "CC3300","CC3333","CC3366","CC3399","CC33CC","CC33FF",
    "336600","336633","336666","336699","3366CC","3366FF",
    "33CC33","33CC66","33CC99","33CCCC","33CCFF","339900",
    "339933","339966","339999","3399CC","3399FF","993300",
    "993333","993366","993399","9933CC","9933FF","996600",
    "996633","996666","996699","9966CC","9966FF","999900",
    "999933","999966","9999CC","9999FF","00CC00","00CC33",
    "00CC66","00CC99","00CCCC","00CCFF","0099FF","0066FF"
);
$max_rpm=5400;
$col_f=0;
$col_t=0;
$num_colors=count($colors);

foreach($DS as $KEY => $VAL){
    if(preg_match('/^fan_/',$NAME[$KEY])){
        $ds_name[1] = "Fan Speed";
        $opt[1] = "-X0 --slope-mode -u $max_rpm --vertical-label \"RPMs\"  --title \"HPASM Fan Speed\" ";
        if(!isset($def[1])){
            $def[1] = "";
        }
        $def[1] .= "DEF:ovar$KEY=$RRDFILE[$KEY]:$DS[$KEY]:AVERAGE " ;
        $def[1] .= "CDEF:var$KEY=ovar$KEY,100,/,$max_rpm,* " ;
        // Modulo operator for cyclic color usage
        $def[1] .= "LINE:var$KEY#".$colors[$col_f % $num_colors].":\"$NAME[$KEY]\" " ;
        $def[1] .= "GPRINT:var$KEY:LAST:\"%6.0lf RPM LAST \" ";
        $def[1] .= "GPRINT:var$KEY:MAX:\"%6.0lf RPM MAX \" ";
        $def[1] .= "GPRINT:var$KEY:AVERAGE:\"%6.2lf RPM AVERAGE \\n\" ";
        $col_f++;
    }
    if(preg_match('/^temp_/',$NAME[$KEY])){
        $ds_name[2] = "Temperature";
        $opt[2] = "--slope-mode --vertical-label \"Celsius\"  --title \"HPASM Temperature\" ";
        if(!isset($def[2])){
            $def[2] = "";
        }
        $def[2] .= "DEF:var$KEY=$RRDFILE[$KEY]:$DS[$KEY]:AVERAGE " ;
        // Modulo operator for cyclic color usage
        $def[2] .= "LINE:var$KEY#".$colors[$col_t % $num_colors].":\"$NAME[$KEY]\\t\" " ;
        $def[2] .= "GPRINT:var$KEY:LAST:\"%6.0lf $UNIT[$KEY] LAST \" ";
        $def[2] .= "GPRINT:var$KEY:MAX:\"%6.0lf $UNIT[$KEY] MAX \" ";
        $def[2] .= "GPRINT:var$KEY:AVERAGE:\"%6.2lf $UNIT[$KEY] AVERAGE \\n\" ";
        $col_t++;
    }
    // Additional: Power Consumption Graph
    if(preg_match('/^pc_/',$NAME[$KEY])){
        $ds_name[3] = "Power Consumption";
        $opt[3] = "--slope-mode --vertical-label \"Watts\"  --title \"HPASM Power Consumption\" ";
        if(!isset($def[3])){
            $def[3] = "";
        }
        $def[3] .= "DEF:var$KEY=$RRDFILE[$KEY]:$DS[$KEY]:AVERAGE " ;
        $def[3] .= "LINE:var$KEY#".$colors[$KEY % $num_colors].":\"$NAME[$KEY]\" " ;
        $def[3] .= "GPRINT:var$KEY:LAST:\"%6.0lf W LAST \" ";
        $def[3] .= "GPRINT:var$KEY:MAX:\"%6.0lf W MAX \" ";
        $def[3] .= "GPRINT:var$KEY:AVERAGE:\"%6.2lf W AVERAGE \\n\" ";
    }
}
?>

After the fix:

omd reload apache

Now three separate graphs are displayed:

  1. Fan Speed – All 8 fans
  2. Temperature – All 41 temperature sensors
  3. Power Consumption – Power usage of both PSUs
PNP4Nagios graph: Power consumption of both PSUs over time
Power Consumption
PNP4Nagios graph: 41 temperature sensors over time
Temperature (41 sensors)

What Did We Learn?

The Seven Pitfalls at a Glance

Problem Cause Solution
“Failed to create new file” Directories missing mkdir -p ~/etc/naemon/conf.d/{hosts,...}
SNMP OID “Unknown” MIBs not loaded Use numeric OIDs
iLO SNMP Timeout Community not configured Enter “Read Community” in iLO web UI
Perl Locale Warning LC_ALL not set export LC_ALL=C in ~/.profile
–perfdata without argument Compile option missing Use --perfdata=short
Hostgroup not found Definition missing Add hostgroup to config file
PNP “Undefined array key” Too few colors Custom template with 48+ colors

The Real Payoff

Yes, the setup was more involved than expected. But the monitoring immediately paid for itself:

  • Fan 6 on server 1 is degraded – without redundancy, this would have led to a thermal shutdown
  • DIMM 0:12 on server 2 is defective – ECC is still correcting, but the module needs replacement

Both issues would likely have gone unnoticed without active hardware monitoring until it was too late. (I knew about them, of course!)

File Structure for Reference

~/etc/naemon/conf.d/
├── commands/
│   └── check_hpasm.cfg
├── hosts/
│   └── ilo-hosts.cfg
└── services/
    └── hpasm-services.cfg

~/share/pnp4nagios/htdocs/templates/
└── check_hpasm.php  # Custom template

Conclusion

Seven problems, seven solutions. A long evening with SNMP, Perl, and way too many temperature sensors. But in the end, we have solid hardware monitoring that also found the known defects (good proof of concept ;)).

Was it worth the effort? Absolutely. Finding hardware problems before they become critical is priceless.

References