XenServer NRPE Monitoring

Problem

Citrix XenServer hypervisors in a customer environment needed monitoring, but installing Zabbix agents on them wasn’t an option. The customer’s XenServer hosts already had NRPE (Nagios Remote Plugin Executor) agents installed from a legacy Nagios deployment. The hypervisors were critical infrastructure — they hosted all customer VMs — and the operations team wasn’t willing to install additional software on them.

I needed a way to monitor 25 XenServer hosts through their existing NRPE agents, integrate the data into Zabbix, and discover new hosts automatically when they appeared on the network. The solution also had to handle the fact that NRPE is fundamentally a pull protocol (you query the agent and it responds), while the monitoring architecture works best with a push model where data is sent to Zabbix.

Constraints

No Zabbix agent on hypervisors. Only NRPE was available on the target hosts. All monitoring data had to come through NRPE queries.
Push, not pull. Zabbix’s native NRPE support is limited. The cleanest integration path was collecting data externally and pushing it to Zabbix via trapper items using zabbix_sender.
Automatic host discovery. New XenServer hosts should be detected and monitored without manual configuration. The customer added hypervisors regularly during capacity expansions.
Cron-driven, not Zabbix-polled. NRPE queries needed to run on a fixed schedule (every minute) rather than being triggered by Zabbix’s polling mechanism, giving more predictable timing and avoiding Zabbix queue pressure.

Architecture

The system inverts the typical Zabbix monitoring model. Instead of Zabbix pulling data from agents, three Bash scripts running on the Zabbix proxy push data into Zabbix:

Host Discovery (discover_xenapp_hosts.sh, 327 lines). Runs every 24 hours via cron. Uses nmap to scan a configurable IP range for hosts responding on NRPE’s port 5666. For each responding host, it resolves the hostname via reverse DNS and constructs a JSON array in Zabbix’s LLD (Low-Level Discovery) format. The discovery data is sent to Zabbix via zabbix_sender as a trapper item. Zabbix’s host prototype system then automatically creates host entries for newly discovered XenServer nodes, assigns the monitoring template, and configures the correct proxy.

Metric Discovery (discover_xenapp_metrics.sh, 179 lines). Runs every 24 hours. For each known host, queries the NRPE agent to discover which metrics are available (not all hosts expose the same check commands). Outputs a JSON LLD array mapping metric names to their Zabbix item keys, units, and display names. This allows the template to dynamically create items only for metrics that each specific host actually supports.

Data Collection (query_xen_server.sh, 452 lines). Runs every minute via cron. Iterates through all known hosts, queries each one’s NRPE agent for all discovered metrics, parses the results, and sends the combined data to Zabbix as a single JSON blob via zabbix_sender. A single master trapper item (nrpe.master.data) receives the JSON, and dependent items extract individual metrics using JSONPath preprocessing.

The shape of the system is an inverted monitoring model: the Zabbix proxy runs the Bash scripts on cron, queries the XenServer hosts over NRPE, and pushes results in via zabbix_sender for dependent-item preprocessing — push from the proxy instead of pull from the server.

Key Engineering Decisions

Single master trapper item with dependent extraction. Rather than sending each metric as a separate zabbix_sender call (which would be 17 metrics × 25 hosts = 425 individual sends per minute), all metrics for a host are combined into a single JSON object and sent once. Dependent items with JSONPath preprocessing extract individual values. This reduces zabbix_sender overhead and gives Zabbix a single point to check for data staleness — if the master item stops updating, a nodata trigger fires.

File-based host cache with locking. The host list discovered by the discovery script is written to a JSON cache file (/tmp/xenapp_hosts_cache.json). The collection script reads this cache on each run. File locking (via flock) prevents concurrent writes if discovery and collection happen to overlap. This is a simple approach, but it’s robust — the cache file is human-readable, trivially debuggable, and survives proxy restarts.

nmap for discovery, not DNS or CMDB. I chose network scanning over querying a CMDB or DNS zone because the network scan tells you what’s actually there and responding, not what someone remembered to document. If a XenServer host responds on port 5666, it has NRPE and can be monitored. If it doesn’t respond, there’s nothing to monitor. The source of truth is the network.

Cron scheduling over Zabbix external checks. Running the collection as a Zabbix external check item would tie it to Zabbix’s polling scheduler, which introduces timing variability under load. A cron job runs exactly when scheduled, and the trapper/sender pattern decouples data collection from Zabbix’s internal scheduling. The trade-off is that Zabbix doesn’t know if the collection script fails — the only signal is a nodata trigger after a configurable timeout.

What’s Monitored

17 metrics across two categories:

Host-Level Metrics:

CPU utilization (%)
Memory utilization (%)
System load (1-minute, 5-minute, 15-minute averages)
Disk usage per filesystem (%)
Network interface status and throughput

Dom0-Level Metrics (XenServer-specific):

Dom0 CPU utilization (%)
Dom0 memory utilization (%)
Dom0 load averages (1-min, 5-min, 15-min)

The template includes configurable warning and critical thresholds per metric. Some metrics have XenServer-specific thresholds — for example, Dom0 load averages have tighter limits (warning at 2.5, critical at 3.2 for 1-minute load) than general host CPU, because Dom0 performance directly impacts all hosted VMs.

The discovery flow is network-based end to end: nmap scans port 5666, the script parses the responding hosts into a JSON LLD payload, and Zabbix auto-creates host prototypes against the NRPE template. The source of truth is what actually responds on the network.

Challenges and Trade-offs

NRPE response parsing. NRPE returns human-readable strings, not structured data. A CPU check might return OK - CPU Usage: 45% |cpu_usage=45%;80;90;0;100. The collection script parses the performance data section (after the | character) and extracts numeric values. Different NRPE plugins format their output differently, so the parser has to handle multiple formats gracefully and fall back to marking a metric as UNKNOWN if parsing fails.

Discovery timing. Host discovery runs every 24 hours. If a new XenServer host is added at 9 AM, it won’t be discovered and monitored until the next discovery cycle. For this customer, 24-hour discovery was acceptable. The interval is configurable in the cron job.

Open-sourced for reuse. I published this on GitHub because the pattern — using NRPE via proxy-side scripts with zabbix_sender — is applicable to any environment with legacy Nagios/NRPE agents that needs to be integrated into Zabbix without installing new agents.

Outcome

25 XenServer hosts discovered and monitored automatically in production
17 metrics per host collected every minute via push architecture
Zero additional software installed on the hypervisor hosts
Host discovery detects new additions within 24 hours
Ran reliably in production, with the master trapper item and nodata triggers giving an immediate signal on any missed collection cycle
Open-sourced and documented for reuse across the organization and publicly

Tech Stack

Monitoring: Zabbix 7.0 (trapper items, LLD host prototypes, dependent items)
Scripts: Bash (discovery, metrics discovery, data collection)
NRPE: Nagios NRPE agent (pre-existing on XenServer hosts)
Discovery: nmap (network scanning on port 5666)
Data Delivery: zabbix_sender (push to trapper items)
Scheduling: Cron (1-minute collection, 24-hour discovery)
Hypervisor: Citrix XenServer