A Self-Diagnosing Operations System That Replaced a Full-Time Infrastructure Role
A hierarchical agent system that monitors 60+ live sites, diagnoses incidents on its own, proposes fixes, and executes them behind a human approval gate, taking over the job of a full-time technical operations hire.
Where they started.
The client runs a network of more than 60 live properties spread across many hosting accounts and several providers, plus individual servers. Keeping all of that healthy (uptime, bandwidth limits, DNS, email deliverability, WordPress issues, expiring domains) had been the job of a dedicated technical operations person. That role was expensive and, for a network this size, a single point of failure. The brief was to rebuild that entire function as an AI system that never sleeps, never forgets a site, and never touches anything destructive without a human saying yes.
What we did.
The design principle was strict separation between thinking and doing, and between diagnosis and destruction. Nothing runs permanently except a thin nervous system that catches events. The intelligence is spun up fresh for each incident, does exactly one job, reports, and exits. A single coordinator is the only thing a human talks to. It holds no site context of its own, it simply routes each incident to the one specialist built to handle it.
Critically, every specialist diagnoses in read only mode first. It looks, it works out what is actually wrong, and it proposes a fix. Anything that would change or delete something is posted as an explicit approval request before it can run. Cheap checks come before expensive ones: a domain lookup that needs no credentials instantly tells the system whether a fault sits at the registrar, the DNS, or the server, so no time or access is spent looking in the wrong place.
The always on layer is a lightweight event engine that catches alerts from uptime monitoring and from a dedicated monitoring inbox, enriches them (which site, which server, which type of fault), and hands them to the coordinator. The coordinator routes to a scoped specialist: one for virtual servers, one for hosting control panel operations, one for DNS and registrar issues, one for the site platform layer, and read only review.
Human oversight lives in a chat control plane. The coordinator posts its diagnosis there, requests approval for any destructive action, and logs every step to a thread. Under that sits a hard safety layer: guard hooks that block dangerous commands outright regardless of what any agent decides, so a stray delete or an out of window restart is stopped before it can ever run. A neat operational trick solved the multi account email problem: one domain holds a forwarder per site, all funnelling into a single monitoring inbox with a naming scheme that tells the system which site each alert came from.
What changed.
The system took over the work of a full time infrastructure hire. It watches every site continuously, catches incidents the moment they surface, works out the true cause before acting, and carries out fixes only after a human approves them in chat. Because diagnosis is always read only and destruction is always gated, the risk profile is far lower than handing standing access to one person. The result is 24/7 coverage of a 60+ site network, with a full audit trail of every action, at a fraction of the cost of the role it replaced.
Built with Agent SDK, n8n, Slack, Uptime monitoring + email alerting, cPanel UAPI, SSH, registrar and DNS APIs.
AI Automation
The same AI Automation discipline keeps your infrastructure standing: systems that watch, diagnose, and resolve incidents before a customer ever notices. We can wire it into your stack so uptime stops depending on someone being awake.