Monitoring Platform Consolidation

Problem

A Nordic MSP operating across Denmark, Sweden, Norway, and Finland was running three separate enterprise monitoring platforms simultaneously: IBM Tivoli Monitoring for infrastructure, IBM Storage Protect (formerly TSM) with custom scripts for backup monitoring, and ScienceLogic SL1 for network and application monitoring. Each platform had its own infrastructure, its own operational procedures, and its own team of specialists. The licensing costs alone were significant, and the operational overhead of maintaining three platforms — each with different upgrade cycles, different alerting mechanisms, and different escalation paths — was unsustainable.

The decision was made to consolidate everything into Zabbix. As the Danish Lead Technical on this project, I was responsible for designing the target architecture, building the migration tooling, executing the migration for Danish customers, and coordinating with the Swedish, Norwegian, and Finnish teams.

Constraints

3,000+ Windows servers plus SQL databases, web applications, network devices, and storage arrays needed to be migrated without monitoring gaps
Zero tolerance for monitoring blackouts. Every migrated host had to be verified as monitored in Zabbix before the legacy platform could be decommissioned for that host. This meant running parallel monitoring during transitions.
Five countries, different networks. Each country had different network architectures, firewall rules, and customer segmentation. The Zabbix architecture had to accommodate this without creating five independent silos.
Operational continuity. I was concurrently serving as ScienceLogic SL1 System Administrator during the migration — maintaining the very platform I was replacing. Bugs still needed fixing, dashboards still needed updating, and customers still needed support on the old platform while the new one was being built.
Three-year timeline (January 2023 – December 2025). Long enough to do it right, short enough that the business expected steady progress and regular deliverables.

Architecture

Zabbix Proxy Architecture

The core architectural decision was the proxy topology. Each country — and in some cases, each customer segment — gets its own Zabbix proxy. The proxy runs on Debian Linux with PostgreSQL as the local database, pgBouncer for connection pooling, and the full SNMP trap receiver chain (snmptrapd → SNMPTT → Zabbix trapper).

I wrote automated proxy deployment scripts for Debian 12/13 and RHEL 8/9 (some customers required Red Hat for compliance). The scripts handle everything: PostgreSQL installation with SCRAM-SHA-256 authentication, pgBouncer configuration in transaction pooling mode, Zabbix proxy installation and configuration, SNMP trap receiver setup, Zabbix Agent 2 with auto-registration UserParameters, and firewall rule validation.

A proxy install script configures the proxy with tuned parameters: 20 start pollers, 256MB cache, 5-minute config sync frequency, and SNMP trapper enabled. Each proxy registers itself with the central Zabbix server via auto-registration, carrying metadata about which country and customer segment it serves.

Auto-Registration System

Every Zabbix agent deployed in the environment carries a set of UserParameters that identify its organizational context: autoregister.company.id, autoregister.company.name, autoregister.company.service, autoregister.erp.id, and autoregister.templates. When an agent connects to its assigned proxy for the first time, the auto-registration action on the server reads these parameters and automatically places the host in the correct host groups, assigns the correct templates, and populates inventory fields.

This was essential for migration at scale. Rather than manually creating 3,000+ hosts in Zabbix, we deployed agents with the correct UserParameters and let auto-registration handle the rest.

IBM Storage Protect Monitoring

One of the more complex migration targets was IBM Storage Protect (TSM) backup monitoring. The legacy monitoring used custom scripts scattered across backup servers. I built a complete Zabbix-native monitoring solution:

A discovery template that queries TSM’s SERVERS table to find all instances automatically
A wrapper script (storage_protect_dsmadmc.sh) that handles authentication, timeouts, retries, and output formatting for TSM CLI queries
Specialized collection scripts for pool utilization, volume status, backup job status, and container storage
Cron-driven collection with Zabbix trapper items for reliable data delivery
Host prototypes that auto-create a Zabbix host for each discovered TSM instance

This replaced a fragmented collection of Nagios checks and manual monitoring that had been accumulated over years.

Key Engineering Decisions

Proxy-per-segment, not proxy-per-customer. I considered a proxy for every customer but decided against it — the overhead of managing hundreds of proxy instances wasn’t justified. Instead, proxies are deployed per country or per customer segment (e.g., all Danish shared-service customers on one proxy, large dedicated customers on their own proxy). This keeps the proxy count manageable while maintaining network isolation where required.

PostgreSQL over MySQL for proxy databases. Zabbix supports both, but PostgreSQL handles the concurrent write patterns of proxy data better, especially with pgBouncer in transaction pooling mode. The proxy databases sustain the steady high-frequency value stream from thousands of monitored hosts without connection exhaustion.

Automated agent deployment via BigFix. The MSP already had BigFix deployed across all managed endpoints. Rather than building a separate agent deployment pipeline, I used BigFix actions to deploy and configure Zabbix agents. This meant we could deploy agents to hundreds of servers in a single BigFix action, with the auto-registration UserParameters baked into the deployment.

SNMPTT for trap processing. I chose SNMPTT over raw snmptrapd logging because SNMPTT provides structured trap translation, variable binding extraction, and configurable formatting. This matters when you’re processing traps from Aruba APs, Avaya phone systems, storage arrays, and network switches — each with different MIB structures.

Challenges and Trade-offs

Verification at scale. The hardest part of any monitoring migration is proving that the new system catches everything the old system caught. I built verification frameworks that compared alert coverage between the legacy platforms and Zabbix: for each legacy check, verify that a Zabbix item exists, is collecting data, and has equivalent trigger thresholds. This was largely manual for the first wave and increasingly automated for subsequent waves.

Template sprawl. With 3,000+ servers across different operating systems, application stacks, and customer requirements, the Zabbix template library grew quickly. I established naming conventions and a hierarchical template structure (base OS template → application template → customer-specific overrides) but keeping templates organized across five countries with different contributors required constant governance.

Running legacy and new in parallel. During migration waves, some hosts were monitored by both the legacy platform and Zabbix. This meant double the alert volume during transitions. We mitigated this by suppressing legacy alerts for hosts that had passed verification in Zabbix, but the coordination was manual and error-prone.

Outcome

Migrated a large fleet of servers, databases, applications, network devices, and storage arrays off three separate platforms into one unified Zabbix infrastructure
Retired the recurring licensing and operational overhead of running three parallel enterprise monitoring platforms
Collapsed three disjoint sets of dashboards, alerting mechanisms, and escalation paths into a single pane of glass
Designed and deployed Zabbix proxy infrastructure across five Nordic countries
Built automated proxy and agent deployment tooling used across the entire organization
Created the IBM Storage Protect monitoring solution now used for all TSM instances
Established template governance and auto-registration patterns adopted org-wide

Tech Stack

Monitoring: Zabbix 7.0 (server, proxies, agents)
Databases: PostgreSQL 17 + pgBouncer (proxy), PostgreSQL (server)
Operating Systems: Debian 12/13, RHEL 8/9 (proxies), Windows Server (agents)
SNMP: snmptrapd + SNMPTT (trap processing)
Deployment: BigFix (agent deployment), Bash (proxy automation scripts)
Legacy platforms replaced: IBM Tivoli Monitoring, IBM Storage Protect monitoring, ScienceLogic SL1