Back to projects
automation

Disk Cleanup Automation

Zabbix-triggered, change-governed disk remediation across the MSP fleet — .NET webhook receiver, ServiceNow Flow Designer, BigFix-delivered PowerShell, polling-based result fetch.

C# .NET 8 ASP.NET Core Polly Serilog ServiceNow Flow Designer BigFix Client Query API PowerShell Zabbix Change Management

Problem

Disk-space alerts were the most common ticket the MSP’s overnight on-call dealt with. Zabbix would fire, an engineer would log into a server, run a handful of cleanup commands, watch the free-space graph come back up, and write a couple of lines into the change record (or, more often, skip the change record because “it’s just temp files”). It was repetitive, error-prone, and quietly built up a debt of un-audited remediations across hundreds of customer servers.

The brief was simple in shape and unforgiving in detail: take the same alert, do the same cleanup, but make it auditable, governed, and unattended. Every cleanup needs to land in a change_request — created, approved through Standard-change auto-approval, executed, and closed — with a worknote chain a manager could read in the morning. The change record has to capture not just success/failure but a real diagnostic of what was using the disk, so the next time the alert fires we know whether it’s a deeper problem.

This is also the second automation built on the same ServiceNow + BigFix stack as the Hybrid Identity Automation work — different trigger, different work, but a chance to validate that the stack generalises.

Constraints

The interesting constraints came from the seams between systems:

  • The trigger is external. Zabbix has to call into ServiceNow without ServiceNow having any awareness of Zabbix. The webhook target needs to be reachable from the customer’s monitoring VLAN, authenticate without long-lived ServiceNow credentials, and never block the Zabbix template — Zabbix’s media-type retry budget is tiny.
  • No callback from BigFix. Same constraint as the AD work: BigFix accepts an action and runs it eventually. There is no webhook, no completion event, no return channel. The result has to be retrieved from a file the script writes on the endpoint.
  • MID Server bridging. ServiceNow runs in the cloud, BigFix sits on-prem behind firewalls. The MID Server is the only sanctioned bridge. The webhook receiver lives on the same Windows host as the MID Server (separate Windows Service, port 8080) — same host policy, different process.
  • BigFix’s ActionScript parser substitutes braces. Anywhere { and } appear in an embedded script, BigFix tries to evaluate them as relevance expressions. Doubling them works for short JSON payloads ({{{) but fails at the scale of a 1,000+ line PowerShell script with hashtables, string interpolation and splatting on every other line.
  • 1,024-character ceiling on BigFix property values. The BigFix Client Query API returns the result file content via a relevance expression that reads the file. That value is capped — anything past 1,024 characters gets truncated. A real cleanup result with a useful diagnostic is 50–200 KB.
  • Audit trail is the deliverable, not a side effect. This automation exists because the prior pattern (engineer logs in, runs a script, doesn’t open a change) was failing audit. Every step has to be defensible after the fact: who triggered it, what payload, which script version ran, what was deleted, what was left, who closed the change.
  • Multi-tenant by design. The same .NET binary, the same ServiceNow scoped app, and the same BigFix root all serve every customer. Customer scope flows through the payload (erp_id → customer lookup) and through per-customer BigFix sites.
  • Idempotent against alert storms. A flapping disk fires the trigger every couple of minutes. The pipeline cannot create one CHG per fire — the on-call would wake to a hundred change records before the first cleanup finished.

Architecture

The pipeline has five layers, left to right:

  1. Zabbix template. A standard Windows-disk template carries the trigger last(/Host/vfs.fs.size[C:\,pused]) > 85 and a media type that POSTs JSON to the .NET receiver. Single-action policy, no recurring fires.
  2. .NET webhook receiver. Self-contained .NET 8 Windows Service running alongside the Java MID Server on the MID host. Listens on port 8080 with IP allowlist + X-API-Key middleware, validates payload shape, runs in-memory deduplication, transforms Zabbix’s PascalCase media-type fields into ServiceNow’s snake_case shape, and forwards to ServiceNow with a Polly retry/circuit-breaker policy.
  3. ServiceNow Scripted REST API. Endpoint /api/x_msp_disk_cleanup/zbx_bf_api/disk_cleanup in the scoped app. Resolves the customer from erp_id, looks up the target CI from CMDB, runs the layer-4 idempotency check (no active CHG with the same hostname + drive in the last 15 min), creates the change_request, and triggers the subflow inBackground(). Returns 202 Accepted with the change number.
  4. Flow Designer flow + governed CHG. A 17-step flow (triggered as a subflow by the REST API) that fetches the versioned PowerShell from a custom table, escapes JSON braces for the BigFix payload, Base64-encodes the entire PowerShell, resolves the BigFix credential through the ServiceNow Credential Resolver (backed by Azure Key Vault), builds the BES XML, POSTs via the MID Server, advances the CHG state through Standard-change auto-approval (-5 → -4 → -2 → -1 → 0), then closes the CHG (state → 3) with the cleanup summary in the worknotes. Result-fetch is delegated to a shared BigFix – Poll for Result (AKV) subflow rather than polled inline.
  5. BigFix-delivered execution. BigFix Root dispatches the action to the target’s BigFix agent. The agent stages payload.json and DiskCleanup.ps1.b64, runs certutil -decode to materialise the script, and waithidden powershell.exe runs it. The script writes a condensed result.json (under 1,024 chars, fetched via Client Query) and a full result.json + execution.log (kept on disk for the audit trail).
Disk Cleanup Automation — Event-Driven Remediation
Ready·Zabbix alert → .NET webhook → ServiceNow CHG → BigFix → endpoint cleanup → result polled back
Step 01 / 19
--:--:--Press play to start the simulation.
01 / 19HTTPSZabbix ServerMID Webhook SvcMonitoring → MSP Edge · MID Server Host
Zabbix trigger fires · disk space alert
01 / 19

Live walkthrough — Zabbix alert → .NET webhook → ServiceNow CHG → BigFix → endpoint cleanup → Client Query result fetch → CHG closed

Key Engineering Decisions

Base64-encode the PowerShell instead of escaping braces. BigFix’s ActionScript parser does relevance substitution on every {...} it sees. The legacy approach in the AD work doubled braces ({{{) to escape them — fine for short JSON, fragile at the scale of a 1,000+ line PowerShell. The fix is to never let the parser see the script content at all: encode the entire PowerShell as Base64 in the subflow, drop the .b64 file via createfile, and run certutil -decode on the endpoint to materialise the script. The BES XML carries an opaque blob; the parser is satisfied; the PowerShell on disk is byte-identical to what the developer wrote. This is the headline pattern of the project, and it’s reusable for every future automation that needs to ship a real script.

Two-tier result files. The Client Query API caps property values at 1,024 characters, but a useful disk-cleanup result with a diagnostic summary is 50–200 KB. The script writes two files: a condensed result.json at C:\_Masters\BigFix\DiskCleanup\result.json (fits the cap, fetched live by ServiceNow) and a full result.json + execution.log inside the per-CHG folder (C:\_Masters\BigFix\DiskCleanup\{CHG}\). The condensed file carries Success, SpaceFreedGB, a one-line DiagnosticSummary, and timestamps — enough to close the CHG. The full file is the audit artefact: every cleanup phase, every error, every diagnostic recommendation. ServiceNow records the path to it in the change worknotes.

Versioned PowerShell in a database table, not in the flow. The PowerShell body lives in x_msp_disk_cleanup_scripts rows, keyed by script_id and version with an active flag. The subflow’s first step queries this table at runtime. New script versions are inserted alongside the old one, smoke-tested via a manual runner against a test CHG, then activated by toggling active. No Update Set, no flow re-publish, no developer in the room when a fix lands. The contract between flow and script is the table shape — everything else is data.

Four layers of deduplication. A flapping disk can fire the trigger every minute. To make the pipeline safe against alert storms: (1) the Zabbix media type uses single-action policy with no recurring fires, (2) the .NET receiver filters out MessageType=0 (recovery) events, (3) the .NET receiver maintains an in-memory cache keyed on company|host|drive with a 15-minute TTL — duplicates return 202 { "Message": "Duplicate suppressed" } without forwarding, (4) the ServiceNow REST endpoint runs an idempotency check against change_request records in active states with the same hostname + drive in the same 15-minute window, returning the existing CHG instead of creating a new one. Layer 3 catches the same alert from the same Zabbix server within the cache lifetime; layer 4 catches the case where the .NET receiver was restarted (cache cleared) but the original CHG is still active.

Auto-approval Standard change with a real audit trail. The CHG is created with type=standard, low risk/impact/urgency, and a populated implementation_plan, backout_plan, and risk_impact_analysis. ServiceNow’s Standard-change auto-approval policy advances it through review without human gating. The point isn’t to slow the automation down — it’s to put it in the same audit pipeline as every manual change, so reporting, search, and CAB visibility work the same way. Worknotes are written at start, after the BigFix POST, after polling completes, and at closure.

Two services on the MID host. The Java MID Server polls the ECC Queue for outbound BigFix REST work. The .NET webhook receiver listens for inbound Zabbix HTTP. Same Windows host (close to the customer’s monitoring + BigFix infrastructure for low latency), separate processes (different lifecycles, different security surfaces, different deployment cadence). The MID Server is upgraded by ServiceNow’s release schedule; the receiver is upgraded by Install-WebhookService.ps1 whenever the .NET binary changes.

Credential Resolver + Azure Key Vault for BigFix auth — no stored credential. The BigFix Basic Auth token is not a ServiceNow credential record any more. It is resolved through the ServiceNow Credential Resolver (com.snc.discovery.CredentialResolver, the WCM mechanism the newer automations standardised on), with the secret held in Azure Key Vault. A flow step (Get-BigFix-Basic-Auth-Token-From-KV, parameterised by a kv_secret_name flow input) pulls the secret, composes base64(user:pass), and returns it GlideEncrypter-encrypted so it travels between steps as an opaque string — nothing readable lands in the flow run log or the exported snapshot. Each BigFix action decrypts it in-memory at the moment of the HTTPS call via a global BigFixCredHelper script include, then nulls it. The helper has to live in global scope because GlideEncrypter (and gs.sleep, used by the poll loop) are only callable there, while the BigFix actions run in the scoped app — the helper is the cross-scope bridge. This is a deliberate separation from the older AD User Automation flow, which authenticates with a session login.

A reusable Poll-for-Result subflow. The earlier AD work polled BigFix inline — resolve Computer ID, submit Client Query, freshness-check — with the steps copy-pasted into each flow. Disk Cleanup factors that into a single shared subflow, BigFix – Poll for Result (AKV), which resolves the computer ID, polls, and returns the raw result; the parent flow keeps only the business logic (what to do with the result). The same subflow now backs Compliance Audit and the Veeam check, so a fix to the polling loop lands once. Its poll loop uses helper.sleep() rather than gs.sleep() specifically because gs.sleep is fenced inside scoped apps.

Challenges and Trade-offs

Polling latency vs operational simplicity. Client Query fetches don’t push. The flow polls /api/clientqueryresults/{id} every 2 seconds up to 3 attempts after the action is submitted. If the cleanup takes longer than ~6 seconds (which it often does — disk scans aren’t fast), the loop falls through with data_is_fresh=false and the next BigFix gather cycle picks it up. The trade-off is that the change can stay in state 0 (Implementing) for a few minutes before closing. A proper push channel from BigFix would close that gap; the operational simplicity of polling won out.

Receiver-to-ServiceNow auth. The .NET receiver authenticates to ServiceNow with Basic Auth via a dedicated service account scoped to a single role. OAuth 2.0 client credentials would tighten tenant isolation further, but Basic Auth on a locked-down service account is defensible for the current boundary. With Basic Auth, password rotation is manual: edit appsettings.json on the MID host, restart the Windows Service, validate via /health. The credential lives in a Windows Service private store; defensible for the current boundary.

Two languages, two services, two failure modes. Splitting into a .NET receiver and a ServiceNow scoped app means two deployment artefacts, two log streams, two restart policies. The receiver’s Serilog logs and the ServiceNow sys_journal_field worknotes have to be correlated by the request ID + change number. Worth it because the receiver does pre-ServiceNow work (validation, dedup, payload transformation) that doesn’t belong inside ServiceNow scripts, and because failing fast at the receiver layer keeps ServiceNow clean.

The 1,024-character ceiling shaped the result format. Without that cap, the script could write one rich JSON file and the flow could parse it. With the cap, the script has to know what to truncate, in what order, to stay under the limit while preserving the most valuable signal. The current truncation strategy (drop middle messages, shorten diagnostic summary, drop secondary errors) is empirically tuned to preserve the most valuable signal under the cap.

Standard change auto-approval is a policy decision dressed as a config. Auto-approval works because the CHG is genuinely low risk — non-destructive cleanup, idempotent, well-tested. The day someone adds a riskier action to the same automation, the auto-approval becomes a problem. The mitigation is procedural: any change to the script body that touches Remove-Item or service stops triggers a manual review of the CHG template’s risk classification before activation.

Encoding landmine: windows-1252 sanitisation. The PowerShell is Base64-encoded and certutil -decoded on the endpoint, which writes the file as ANSI / windows-1252 with no BOM. Any em-dash, en-dash, smart quote, ellipsis, non-breaking space or stray high-codepoint character that survived into the script body comes back as mojibake after decode — and a mangled script fails in confusing ways at runtime, not at deploy time. The POST action runs an explicit sanitizeForWindows1252() pass over the script before encoding (normalising those characters and dropping anything above 0xFF). It’s the kind of defect that only shows up once a Scandinavian comment or a copy-pasted quote sneaks in, so it’s enforced in code rather than left to discipline.

The Key Vault migration surfaced a snapshot-reconciliation gotcha. The Credential Resolver / Key Vault auth is in place, but recreating the final BigFix POST action in the UI during the migration meant the exported flow snapshot pointed at the old action sys_id — so the action body had to be re-pasted, re-published, and snapshot-reconciled to keep the export coherent with what runs. It is the kind of trap that bites anyone who edits a published flow action in the UI rather than through the export pipeline, and it is worth calling out as a standing discipline for this stack.

Outcome

The pipeline runs end-to-end across all five layers:

  • .NET webhook receiver: exercised against real Zabbix payloads, logging the full transformation chain — payload validation, dedup, and PascalCase-to-snake_case mapping — before it forwards to ServiceNow.
  • ServiceNow scoped app + REST endpoint: working end-to-end. It resolves the customer, looks up the target CI, runs the idempotency check, creates the CHG, triggers the subflow inBackground(), and returns 202 with the change number.
  • Subflow + BigFix submission + PowerShell execution + Client Query result fetch: working end-to-end against a target endpoint. The CHG advances Draft → Implementing → Closed and the worknote chain is written at every phase. The scripted force-close routes around a “can change state?” business rule that blocks the standard Update-Record close.
  • Credential Resolver / Azure Key Vault auth: the BigFix token is Key Vault-sourced and decrypted in-action, so no stored credential lives in ServiceNow.
  • Zabbix template: the YAML template carries the disk trigger and the MID Webhook media type, parameterised per customer instance.

What’s there in concrete terms:

  • One scoped app (x_msp_disk_cleanup) with a Scripted REST endpoint, a 17-step subflow, a custom script-versioning table, and a populated CHG template
  • One .NET 8 Windows Service with Polly retry, Serilog rolling logs, IP allowlist + X-API-Key, four-layer deduplication, dry-run mode, and /health + /metrics endpoints
  • One PowerShell script (~1,170 lines) covering 11 cleanup phases plus a read-only diagnostic scan that emits actionable recommendations (pagefile sizing, hibernation file, SQL on system drive, VSS shadow storage, stale profiles, vendor logs)
  • BigFix property + analysis definitions for result fetching, scoped per customer site

The deliverable that matters most is the audit trail: every cleanup now lands as a governed change_request with a full worknote chain instead of an engineer logging in and quietly running a script, which is exactly the gap the prior pattern failed on. The /metrics endpoint on the receiver surfaces request volume, dedup-suppressed count, retry attempts, and forward latency for operational visibility.

Tech Stack

  • Webhook receiver: C# .NET 8, ASP.NET Core (Kestrel, self-hosted), Polly (retry + circuit breaker), Serilog (rolling file + EventLog), Microsoft.Extensions.Hosting.WindowsServices
  • ServiceNow: Scoped app x_msp_disk_cleanup, Scripted REST API (sys_ws_operation), a 17-step Flow Designer flow + the shared BigFix – Poll for Result (AKV) subflow, GlideRecord for CMDB and CHG operations, ServiceNow Credential Resolver (com.snc.discovery.CredentialResolver) backed by Azure Key Vault for BigFix auth (GlideEncrypter via a global BigFixCredHelper), sn_fd.FlowAPI.getRunner() for background subflow trigger
  • Custom tables: x_msp_disk_cleanup_scripts (versioned PowerShell store)
  • BigFix: /api/actions for action submission, /api/query for Computer ID resolution, /api/clientquery + /api/clientqueryresults/{id} for result polling, custom property + analysis for the result file path
  • Endpoint script: PowerShell 5.1 (~1,170 lines), 11 cleanup phases (Windows Temp, user profile temp/caches, WU cache, Recycle Bin, IIS logs, crash dumps, CBS/DISM, WER, Installer, Font cache, ETL traces), read-only diagnostic scan (top folders, SQL files on C:, VSS shadow storage, large files, stale profiles, vendor logs)
  • Trigger: Zabbix 6+ template, media type MID Webhook, trigger last(/Host/vfs.fs.size[C:\,pused]) > 85
  • Change governance: Standard change_request with auto-approval, populated implementation/backout/risk plans, full worknote chain
  • Repo: github.com/sofus999/Disk-Cleanup