Setting Up HP iLO Hardware Monitoring with OMD Labs

Seven problems. That’s how many obstacles I counted before hardware monitoring for three HP ProLiant servers was finally working. The documentation made it sound so simple: enable SNMP, run check_hpasm, done.

Or: How I learned that “community project” sometimes means “bring your own band-aids”

If you’re facing a similar setup – HP ProLiant servers monitored via iLO, OMD Labs as your monitoring system – this walkthrough will save you considerable frustration.

The Setup

Three HP ProLiant DL360p Gen8 servers running VMware ESXi 7.0.3. Each server has an iLO4 management controller with its own network interface:

Server	ESXi IP	iLO IP
esxi-01	192.168.20.x	192.168.21.101
esxi-02	192.168.20.x	192.168.21.102
esxi-03	192.168.20.x	192.168.21.103

Goal: Comprehensive hardware monitoring – fans, temperatures, power supplies, RAM, RAID – via the iLO interface.

Monitoring System: OMD Labs 5.60 (ConSol Edition) with Naemon as the core and PNP4Nagios for graphing.

Sounds straightforward? It wasn’t.

Problem 1: “Failed to create new file: invalid path”

The very first attempt to create a host through the Thruk web interface greeted me with an error:

Failed to create new file: invalid path

What happened? The directory for configuration files simply didn’t exist. OMD Labs is a community project – some basic functionality isn’t perfectly configured out of the box.

The solution:

su - monitoring  # As OMD site user
mkdir -p ~/etc/naemon/conf.d/hosts
mkdir -p ~/etc/naemon/conf.d/services
mkdir -p ~/etc/naemon/conf.d/commands

Lesson learned: For production setups, it’s better to create configurations directly via CLI. The web interface is nice for quick wins but not always reliable.

The ESXi SNMP Detour

Before tackling iLO, we first tried monitoring ESXi itself via SNMP.

Enabling SNMP on ESXi

Via SSH to the ESXi host:

esxcli system snmp set --communities public
esxcli system snmp set --enable true
esxcli network firewall ruleset set --ruleset-id snmp --enabled true

Problem 2: SNMP MIBs Not Loaded

First test:

snmpwalk -v2c -c public esxi-01.example.com sysDescr

Result:

sysDescr: Unknown Object Identifier (Sub-id not found: (top) -> sysDescr)

The SNMP MIBs weren’t loaded on the OMD server. No big deal – numeric OIDs are more universal anyway:

snmpwalk -v2c -c public esxi-01.example.com .1.3.6.1.2.1.1.1.0

Result:

iso.3.6.1.2.1.1.1.0 = STRING: "VMware ESXi 7.0.3 build-24411414 VMware, Inc. x86_64"

Lesson learned: Always use numeric OIDs for Nagios/Naemon checks – they work everywhere regardless of installed MIB files.

Why iLO Instead of ESXi?

ESXi SNMP provides basic information (uptime, NICs), but for real hardware monitoring, the iLO controller is the better choice:

Dedicated hardware management interface – built exactly for this
Access even when server is powered off – critical for troubleshooting
More detailed sensor data – 41 temperature sensors instead of a handful
HP-specific MIBs – CPQHLTH-MIB knows every sensor in the system

Problem 3: SNMP Not Active on iLO

First test against the iLO IP:

snmpwalk -v2c -c public 192.168.21.101 .1.3.6.1.2.1.1.1.0

→ Timeout

SNMP wasn’t enabled on iLO4 or no community was configured.

The solution: In the iLO4 web interface:

Open browser: https://192.168.21.101
Navigate: Administration → Management → SNMP Settings
Enter Read Community (e.g., public – or better, something secure)
Click Apply

After configuration:

snmpwalk -v2c -c public 192.168.21.101 .1.3.6.1.2.1.1.1.0

Result:

iso.3.6.1.2.1.1.1.0 = STRING: "Integrated Lights-Out 4 1.50"

check_hpasm: The Right Tool for the Job

We could have written individual SNMP checks for each sensor. But check_hpasm is specifically designed for HP ProLiant hardware:

Automatic detection of all components
Aggregated health status
Performance data for all sensors
Human-readable output

Problem 4: Perl Locale Warning

First invocation:

~/lib/nagios/plugins/check_hpasm -H 192.168.21.101 -C public

Warnings appeared:

perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
    LANGUAGE = (unset),
    LC_ALL = (unset),
    LANG = "en_US.UTF-8"
    are supported and installed on your system.

The locale settings in the OMD site environment weren’t properly configured.

The solution:

echo 'export LC_ALL=C' >> ~/.profile
source ~/.profile

After that, the check ran cleanly:

WARNING - system fan overall status is degraded, fan 6 (system) degraded,
System: 'proliant dl360p gen8', S/N: 'XXXXXXXXXX', ROM: 'P71 07/01/2015'

Result: The check immediately found a real hardware issue – Fan 6 is degraded!

Problem 5: --perfdata Option Doesn’t Work

We wanted to enable performance data for graphs:

~/lib/nagios/plugins/check_hpasm -H 192.168.21.101 -C public --perfdata

Error message:

Option perfdata requires an argument

The OMD version of check_hpasm was compiled without --enable-perfdata. The option expects a value.

The solution: Explicitly use --perfdata=short:

~/lib/nagios/plugins/check_hpasm -H 192.168.21.101 -C public --perfdata=short

Result with performance data:

WARNING - system fan overall status is degraded... | pc_1=112;460;460 pc_2=99;460;460
fan_1=50% fan_2=50% ... temp_1=24;42;42 temp_2=40;70;70 ...

Naemon Configuration

Architecture Decision: Separate iLO Hosts

Important design decision: Define iLO hosts separately from ESXi hosts:

Different IPs – iLO has its own management network
Future-proof – New HP servers without ESXi fit the schema
Clear separation – Hardware monitoring vs. virtualization

File: ~/etc/naemon/conf.d/hosts/ilo-hosts.cfg

# iLO Hostgroup
define hostgroup {
    hostgroup_name          ilo-servers
    alias                   HP iLO Management Interfaces
}

# iLO Host Template
define host {
    name                    ilo-host
    use                     generic-host
    check_command           check-host-alive
    check_interval          5
    register                0
}

# iLO Hosts
define host {
    use                     ilo-host
    host_name               ilo-esxi-01
    alias                   iLO ESXi-01 (DL360p Gen8)
    address                 192.168.21.101
    hostgroups              ilo-servers
}

define host {
    use                     ilo-host
    host_name               ilo-esxi-02
    alias                   iLO ESXi-02 (DL360p Gen8)
    address                 192.168.21.102
    hostgroups              ilo-servers
}

define host {
    use                     ilo-host
    host_name               ilo-esxi-03
    alias                   iLO ESXi-03 (DL360p Gen8)
    address                 192.168.21.103
    hostgroups              ilo-servers
}

File: ~/etc/naemon/conf.d/commands/check_hpasm.cfg

define command {
    command_name    check_hpasm
    command_line    $USER1$/check_hpasm -H $HOSTADDRESS$ -C $ARG1$ --perfdata=short
}

File: ~/etc/naemon/conf.d/services/hpasm-services.cfg

define service {
    use                     generic-service
    hostgroup_name          ilo-servers
    service_description     HP Hardware Health
    check_command           check_hpasm!public
    check_interval          5
}

Problem 6: Hostgroup Not Found

omd check

Error message:

Error: Could not find any hostgroup matching 'ilo-servers'

The hostgroup definition was missing or not loaded correctly.

The solution: Ensure the hostgroup is defined in the same file or before the services. After correction:

omd check && omd reload naemon

Check Status via Livestatus

echo "GET services
Filter: description = HP Hardware Health
Columns: host_name description state plugin_output" | unixcat ~/tmp/run/live

Result:

ilo-esxi-01;HP Hardware Health;1;WARNING - system fan overall status is degraded, fan 6 (system) degraded...
ilo-esxi-02;HP Hardware Health;2;CRITICAL - dimm module 0:12 (module 12 @ cartridge 0) needs attention (degraded)...
ilo-esxi-03;HP Hardware Health;0;OK - System: 'proliant dl360p gen8', hardware working fine...

Thruk service overview showing three iLO hosts: WARNING, CRITICAL, and OK status — Thruk Service Overview: All three iLO hosts at a glance

PNP4Nagios Graphing

After a few check cycles:

ls ~/var/pnp4nagios/perfdata/ilo-esxi-01/

HP_Hardware_Health.xml
HP_Hardware_Health_fan_1.rrd
HP_Hardware_Health_fan_2.rrd
...
HP_Hardware_Health_temp_1.rrd
HP_Hardware_Health_temp_2.rrd
...

8 fans, 41 temperature sensors, 2 power consumption values – perfect!

Thruk service detail with performance data: fans, temperatures, power consumption — Service Detail: All performance data at a glance

Problem 7: PNP4Nagios Shows Error Instead of Graphs

When opening PNP4Nagios in the browser:

Please check the documentation for information about the following error.

Undefined array key 14

file [line]:
templates.dist/check_hpasm.php [35]:

The bundled PNP4Nagios template for check_hpasm defines only 14 colors:

$colors=array("CC3300","CC3333","CC3366",...); // only 14 entries

But our DL360p Gen8 has 41 temperature sensors! At sensor 15, there’s no color entry → array index error.

The solution: Custom template with more colors and cyclic usage:

File: ~/share/pnp4nagios/htdocs/templates/check_hpasm.php

<?php
#
# Fixed check_hpasm template with more colors
#
$colors=array(
    "CC3300","CC3333","CC3366","CC3399","CC33CC","CC33FF",
    "336600","336633","336666","336699","3366CC","3366FF",
    "33CC33","33CC66","33CC99","33CCCC","33CCFF","339900",
    "339933","339966","339999","3399CC","3399FF","993300",
    "993333","993366","993399","9933CC","9933FF","996600",
    "996633","996666","996699","9966CC","9966FF","999900",
    "999933","999966","9999CC","9999FF","00CC00","00CC33",
    "00CC66","00CC99","00CCCC","00CCFF","0099FF","0066FF"
);
$max_rpm=5400;
$col_f=0;
$col_t=0;
$num_colors=count($colors);

foreach($DS as $KEY => $VAL){
    if(preg_match('/^fan_/',$NAME[$KEY])){
        $ds_name[1] = "Fan Speed";
        $opt[1] = "-X0 --slope-mode -u $max_rpm --vertical-label \"RPMs\"  --title \"HPASM Fan Speed\" ";
        if(!isset($def[1])){
            $def[1] = "";
        }
        $def[1] .= "DEF:ovar$KEY=$RRDFILE[$KEY]:$DS[$KEY]:AVERAGE " ;
        $def[1] .= "CDEF:var$KEY=ovar$KEY,100,/,$max_rpm,* " ;
        // Modulo operator for cyclic color usage
        $def[1] .= "LINE:var$KEY#".$colors[$col_f % $num_colors].":\"$NAME[$KEY]\" " ;
        $def[1] .= "GPRINT:var$KEY:LAST:\"%6.0lf RPM LAST \" ";
        $def[1] .= "GPRINT:var$KEY:MAX:\"%6.0lf RPM MAX \" ";
        $def[1] .= "GPRINT:var$KEY:AVERAGE:\"%6.2lf RPM AVERAGE \\n\" ";
        $col_f++;
    }
    if(preg_match('/^temp_/',$NAME[$KEY])){
        $ds_name[2] = "Temperature";
        $opt[2] = "--slope-mode --vertical-label \"Celsius\"  --title \"HPASM Temperature\" ";
        if(!isset($def[2])){
            $def[2] = "";
        }
        $def[2] .= "DEF:var$KEY=$RRDFILE[$KEY]:$DS[$KEY]:AVERAGE " ;
        // Modulo operator for cyclic color usage
        $def[2] .= "LINE:var$KEY#".$colors[$col_t % $num_colors].":\"$NAME[$KEY]\\t\" " ;
        $def[2] .= "GPRINT:var$KEY:LAST:\"%6.0lf $UNIT[$KEY] LAST \" ";
        $def[2] .= "GPRINT:var$KEY:MAX:\"%6.0lf $UNIT[$KEY] MAX \" ";
        $def[2] .= "GPRINT:var$KEY:AVERAGE:\"%6.2lf $UNIT[$KEY] AVERAGE \\n\" ";
        $col_t++;
    }
    // Additional: Power Consumption Graph
    if(preg_match('/^pc_/',$NAME[$KEY])){
        $ds_name[3] = "Power Consumption";
        $opt[3] = "--slope-mode --vertical-label \"Watts\"  --title \"HPASM Power Consumption\" ";
        if(!isset($def[3])){
            $def[3] = "";
        }
        $def[3] .= "DEF:var$KEY=$RRDFILE[$KEY]:$DS[$KEY]:AVERAGE " ;
        $def[3] .= "LINE:var$KEY#".$colors[$KEY % $num_colors].":\"$NAME[$KEY]\" " ;
        $def[3] .= "GPRINT:var$KEY:LAST:\"%6.0lf W LAST \" ";
        $def[3] .= "GPRINT:var$KEY:MAX:\"%6.0lf W MAX \" ";
        $def[3] .= "GPRINT:var$KEY:AVERAGE:\"%6.2lf W AVERAGE \\n\" ";
    }
}
?>

After the fix:

omd reload apache

Now three separate graphs are displayed:

Fan Speed – All 8 fans
Temperature – All 41 temperature sensors
Power Consumption – Power usage of both PSUs

PNP4Nagios graph: Power consumption of both PSUs over time — Power Consumption

PNP4Nagios graph: 41 temperature sensors over time — Temperature (41 sensors)

What Did We Learn?

The Seven Pitfalls at a Glance

Problem	Cause	Solution
“Failed to create new file”	Directories missing	`mkdir -p ~/etc/naemon/conf.d/{hosts,...}`
SNMP OID “Unknown”	MIBs not loaded	Use numeric OIDs
iLO SNMP Timeout	Community not configured	Enter “Read Community” in iLO web UI
Perl Locale Warning	LC_ALL not set	`export LC_ALL=C` in ~/.profile
–perfdata without argument	Compile option missing	Use `--perfdata=short`
Hostgroup not found	Definition missing	Add hostgroup to config file
PNP “Undefined array key”	Too few colors	Custom template with 48+ colors

The Real Payoff

Yes, the setup was more involved than expected. But the monitoring immediately paid for itself:

Fan 6 on server 1 is degraded – without redundancy, this would have led to a thermal shutdown
DIMM 0:12 on server 2 is defective – ECC is still correcting, but the module needs replacement

Both issues would likely have gone unnoticed without active hardware monitoring until it was too late. (I knew about them, of course!)

File Structure for Reference

~/etc/naemon/conf.d/
├── commands/
│   └── check_hpasm.cfg
├── hosts/
│   └── ilo-hosts.cfg
└── services/
    └── hpasm-services.cfg

~/share/pnp4nagios/htdocs/templates/
└── check_hpasm.php  # Custom template

Conclusion

Seven problems, seven solutions. A long evening with SNMP, Perl, and way too many temperature sensors. But in the end, we have solid hardware monitoring that also found the known defects (good proof of concept ;)).

Was it worth the effort? Absolutely. Finding hardware problems before they become critical is priceless.