your watchdog died. nobody noticed for 8 hours.
you set up monitoring because you wanted to sleep at night. the monitor crashed at 8:31 in the morning and didn't come back until someone happened to look at 5pm. eight hours of zero supervision on a system that handles real work. here's how to make sure your safety net has a safety net.
what happened
we run a background process that watches our ai agent's core services. it checks every five minutes: is the gateway up? is the message bridge running? are the automation daemons alive? if something dies, it restarts it automatically and logs the event. it's the thing that lets us walk away from the computer.
yesterday morning around 8:31 am, the watchdog itself crashed. the process it runs inside — a scheduled task on windows — had been quietly disabled sometime earlier. nobody got an alert because the thing that sends the alerts was the thing that died.
for the next eight hours, our agent's main service went unsupervised. it happened to stay up on its own. but if it had crashed at noon, nobody would have known until someone manually checked at five o'clock. that's the nightmare scenario: eight hours of thinking you're covered when you're not.
the problem with a single layer of monitoring
monitoring only works when it's running. this sounds obvious. but the whole point of monitoring is that you stop looking at the thing directly because you trust the monitor. the moment the monitor dies, your confidence becomes a liability — you're sure everything is fine, but you haven't actually checked.
the most common way monitors die is boring. it's not a dramatic crash. it's a windows update that disables a scheduled task. a cloud function that hits its invocation limit. a free-tier uptime check that expires. a zapier automation that gets paused because your plan downgraded. the monitor doesn't explode — it quietly stops running, and the silence looks like health.
the longer the gap, the worse the recovery. if your monitor was down for ten minutes, you probably missed nothing. if it was down for eight hours, you have no idea what happened during that window. did a customer hit an error page? did a form submission vanish? did an automation silently fail? you can't answer because nobody was watching.
the fix: a dead man's switch for your monitor
the principle is simple: your monitor should prove it's alive on a schedule. if the proof stops arriving, something separate notices.
what we changed after this
we added a rate-limited fallback alert that fires through a completely separate path when the primary watchdog hasn't written a heartbeat in fifteen minutes. we also adopted a rule: the watchdog file is canonical. there's exactly one. no copies floating around that might get out of sync with the real one. and we test-killed it to make sure the fallback actually fires.
the embarrassing part isn't that the watchdog died. things die. the embarrassing part is that we had monitoring and still went eight hours without knowing. the gap between "monitoring is set up" and "monitoring is actually working right now" is the exact gap where small businesses lose hours, customers, and sleep.
agent hq monitors the monitor.
our agents run a dead man's switch on every watchdog process — if supervision stops, a separate alert fires through a backup channel within fifteen minutes. no more silent gaps.
see the kit →