IBM FlashSystem Monitoring

Problem

IBM FlashSystem 9500 and 7300 storage arrays are critical infrastructure — they hold the data for everything else. When a storage pool fills up or a drive fails, the blast radius is enormous. But IBM’s built-in monitoring tools are oriented toward storage administrators, not MSP operations centers that need these arrays integrated into a centralized Zabbix monitoring platform alongside thousands of other devices.

I needed to build monitoring that discovers all storage entities automatically, tracks capacity and performance metrics, alerts on hardware failures, and requires zero maintenance after initial setup. I built it three times. Each version taught me something about where the real complexity hides.

Version 1: Python + systemd + zabbix_sender

The first version was the most obvious approach: a Python script running as a systemd service on the Zabbix proxy, polling the FlashSystem REST API on a schedule and pushing metrics to Zabbix via zabbix_sender.

It worked. The script authenticated to the FlashSystem API, fetched data from multiple endpoints, parsed the responses, and sent formatted metrics to Zabbix trapper items. But the operational overhead was significant:

Per-device systemd instances. Each FlashSystem required its own systemd service unit, its own configuration file, its own credentials file. Adding a new array meant creating new config files, new service units, enabling and starting them, and hoping nothing conflicted with existing instances.
External dependencies. Python 3, the requests library, and zabbix_sender all had to be installed and maintained on the proxy. When the proxy was rebuilt or upgraded, the monitoring scripts had to be reinstalled separately.
No native integration. The metrics appeared in Zabbix as trapper items, disconnected from the discovery and template system. Adding new metrics meant editing both the Python script and the Zabbix template.

Version 1 proved the concept and ran reliably in production, but I knew there was a better way.

Version 2: Zabbix HTTP Agent Items

For version 2, I eliminated all external dependencies by using Zabbix’s built-in HTTP Agent item type. HTTP Agent items make REST API calls natively from the Zabbix server or proxy — no external scripts needed.

The challenge was authentication. IBM FlashSystem uses a two-step authentication: POST credentials to /rest/v1/auth, receive a token, then use that token as an X-Auth-Token header for all subsequent requests. Zabbix’s HTTP Agent items are stateless — each item makes its own independent HTTP call, and there’s no native way to share an authentication token between items.

I solved this with a dependent item chain: one master item handles authentication, a preprocessing step extracts the token, and dependent items use that token via macro injection. It eliminated the Python dependency entirely, but the template became complex — dozens of items with intricate preprocessing chains, careful timing to ensure the auth item ran before the data items, and fragile dependencies between items.

Version 2 ran reliably but was difficult to maintain and harder to explain to other engineers who needed to troubleshoot it.

The arc across the three versions is a steady move from external scripts to pure Zabbix-native monitoring: v1 (Python + systemd + zabbix_sender) → v2 (HTTP Agent + dependent item chains) → v3 (a single JavaScript Script item).

Version 3: Single JavaScript Script Item (Production Winner)

Version 3 collapsed everything into a single Zabbix Script item containing embedded JavaScript. This is the version running in production.

One item. One script. Every five minutes, Zabbix executes the JavaScript, which:

Authenticates — POSTs credentials to the FlashSystem REST API and receives a token
Fetches data from 8 API endpoints using the token: lsmdiskgrp (pools), lsvdisk (volumes), lsnodestats (nodes), lssystem (system health), lsenclosure (enclosures), lsdrive (drives), lsportfc (FC ports), lsportip (iSCSI ports)
Returns combined JSON — all 8 responses merged into a single JSON object with a timestamp

Six Low-Level Discovery rules, all of type DEPENDENT, parse the combined JSON to discover entities. Item prototypes extract individual metrics using JSONPath preprocessing. Trigger prototypes fire when pools exceed utilization thresholds, when drives fail or degrade, when nodes hit CPU limits, or when ports go offline.

The entire monitoring solution is a single importable YAML file. To monitor a new FlashSystem, you create a host, link the template, set three macros ({$FS_IP}, {$FS_USER}, {$FS_PASS}), and wait five minutes. No scripts to install. No services to configure. No dependencies beyond Zabbix itself.

What’s Actually Monitored

In production, the template discovers and monitors:

Storage Pools: Status, total capacity, used capacity, utilization percentage (with warning at 80% and critical at 90%)
Volumes: 68 volumes discovered — status and capacity for each
Drives: 44 drives — status monitoring with alerts for failed or degraded states
Enclosures: Hardware status monitoring
Nodes: CPU utilization, compression CPU, total cache utilization, VDisk IOPS
FC Ports: 16 ports — status monitoring with alerts for inactive or offline states
iSCSI Ports: 8 ports (including failover interfaces) — state monitoring

Plus ICMP ping monitoring and SNMP trap integration for real-time hardware alerts.

The shape of the final architecture is deliberately flat: a single Zabbix Script item authenticates, fetches all 8 API endpoints, and returns combined JSON, which feeds six dependent discovery rules and their item prototypes — one script item, six discovery rules, zero external dependencies.

Key Engineering Decisions

JavaScript over Python inside Zabbix. Zabbix’s SCRIPT item type supports JavaScript execution using the built-in Duktape engine. The JavaScript runs inside the Zabbix server/proxy process — no external runtime needed. The trade-off is that Duktape is ECMAScript 5.1, not modern JS, and the available APIs are limited (no async/await, no Node.js modules). But for HTTP requests and JSON manipulation, it’s sufficient.

Single master item with dependent discovery. Rather than making 8 separate HTTP calls (which would require 8 separate auth tokens or a shared auth mechanism), one script makes all calls sequentially using a single token. The combined JSON result feeds every discovery rule and metric extraction. This means one API authentication per cycle instead of 8, reducing load on the FlashSystem.

Self-signed certificate handling. FlashSystem management interfaces use self-signed certificates. The script disables SSL peer and host verification via setVerifyPeer(false) and setVerifyHost(false). In an MSP environment where you’re monitoring customer infrastructure you don’t control the certificates for, this is a pragmatic necessity.

Challenges and Trade-offs

The evolution was necessary. Each version taught me something I couldn’t have learned by reading documentation. Version 1 taught me the operational cost of external dependencies. Version 2 taught me the complexity ceiling of Zabbix’s item dependency chains. Version 3 exists because I had the experience of the first two failures informing the design.

Duktape limitations. The JavaScript engine inside Zabbix doesn’t support fetch() or XMLHttpRequest. You use Zabbix’s HttpRequest() object, which has a different API. Error handling is limited — if a request fails partway through the 8-endpoint collection, the entire item fails. I mitigate this with a 30-second timeout macro and a nodata trigger that fires if no data arrives for 15 minutes.

iSCSI port deduplication. The FlashSystem API returns iSCSI ports with a failover field that creates duplicate port entries. The discovery JavaScript handles this by appending _fo to failover port IDs, ensuring unique discovery keys. This was a production bug — ports were being discovered, lost, and rediscovered on every cycle until I added the failover suffix.

Outcome

Pure Zabbix-native monitoring with zero external dependencies
Monitoring FlashSystem arrays across multiple customer environments from one centralized platform
Template import + 3 macros = full monitoring within a single polling cycle
Open for use across the entire MSP — any engineer can deploy it without understanding the internals
The three-version evolution is now my go-to example of iterative simplification in monitoring design

The production Zabbix dashboard surfaces pool utilization, node CPU, volume status, and drive health for each FlashSystem at a glance.

Tech Stack

Monitoring: Zabbix 7.0 (Script items, JavaScript/Duktape engine)
API: IBM FlashSystem REST API v1 (8 endpoints)
Discovery: 6 Zabbix Low-Level Discovery rules (DEPENDENT type)
Alerting: SNMP traps (hardware events) + Zabbix triggers (threshold-based)
Previous iterations: Python 3 + systemd (v1), Zabbix HTTP Agent (v2)