5 min read

your watchdog died. nobody noticed for 8 hours.

you set up monitoring because you wanted to sleep at night. the monitor crashed at 8:31 in the morning and didn't come back until someone happened to look at 5pm. eight hours of zero supervision on a system that handles real work. here's how to make sure your safety net has a safety net.

MONITOR → DEAD → 8h SILENCE

what happened

we run a background process that watches our ai agent's core services. it checks every five minutes: is the gateway up? is the message bridge running? are the automation daemons alive? if something dies, it restarts it automatically and logs the event. it's the thing that lets us walk away from the computer.

yesterday morning around 8:31 am, the watchdog itself crashed. the process it runs inside — a scheduled task on windows — had been quietly disabled sometime earlier. nobody got an alert because the thing that sends the alerts was the thing that died.

for the next eight hours, our agent's main service went unsupervised. it happened to stay up on its own. but if it had crashed at noon, nobody would have known until someone manually checked at five o'clock. that's the nightmare scenario: eight hours of thinking you're covered when you're not.

why this matters for small businesses
if you run any kind of monitoring — uptime checks, zapier error alerts, hubspot workflow notifications, stripe webhook health pings — you are trusting a system to tell you when things break. but that system itself can break. and when it does, you don't get an alert. you get silence. silence that feels exactly like "everything is fine."

the problem with a single layer of monitoring

monitoring only works when it's running. this sounds obvious. but the whole point of monitoring is that you stop looking at the thing directly because you trust the monitor. the moment the monitor dies, your confidence becomes a liability — you're sure everything is fine, but you haven't actually checked.

the most common way monitors die is boring. it's not a dramatic crash. it's a windows update that disables a scheduled task. a cloud function that hits its invocation limit. a free-tier uptime check that expires. a zapier automation that gets paused because your plan downgraded. the monitor doesn't explode — it quietly stops running, and the silence looks like health.

the longer the gap, the worse the recovery. if your monitor was down for ten minutes, you probably missed nothing. if it was down for eight hours, you have no idea what happened during that window. did a customer hit an error page? did a form submission vanish? did an automation silently fail? you can't answer because nobody was watching.

the fix: a dead man's switch for your monitor

the principle is simple: your monitor should prove it's alive on a schedule. if the proof stops arriving, something separate notices.

01
make your monitor write a heartbeat. every time it runs successfully, it writes a timestamp to a file, a database row, or a webhook. the content doesn't matter. the timestamp does. if the timestamp is current, the monitor is alive. if it's stale, something is wrong.
02
set up a second check that only reads the heartbeat. this is the dead man's switch. it runs on a completely different system — a different server, a different service, a different provider. all it does is read the heartbeat timestamp and compare it to the current time. if the gap is too large (say, more than twice the expected interval), it fires an alert through a different channel than the monitor uses.
03
use a different alert channel for the dead man's switch. if your monitor sends alerts via email, the dead man's switch should text you. if the monitor posts to slack, the switch should email you. the whole point is that when one channel is dead, the backup reaches you through a completely separate path.
04
test it by killing the monitor on purpose. once a month, disable your watchdog or uptime check deliberately. wait for the dead man's switch to fire. if you get the alert, the system works. if you don't, your backup is broken and you're back to hoping — which is exactly where you started before you set up monitoring in the first place.
the cheap version
if you don't want to build a second system, there's a simpler version: put a recurring calendar event on your phone — once a week, same time — that says "check if the monitor is running." open the dashboard, look at the last-run timestamp, confirm it's fresh. it's manual, but it's infinitely better than assuming silence means health.

what we changed after this

we added a rate-limited fallback alert that fires through a completely separate path when the primary watchdog hasn't written a heartbeat in fifteen minutes. we also adopted a rule: the watchdog file is canonical. there's exactly one. no copies floating around that might get out of sync with the real one. and we test-killed it to make sure the fallback actually fires.

the embarrassing part isn't that the watchdog died. things die. the embarrassing part is that we had monitoring and still went eight hours without knowing. the gap between "monitoring is set up" and "monitoring is actually working right now" is the exact gap where small businesses lose hours, customers, and sleep.

bottom line
monitoring only protects you while it's running. the moment it stops, silence feels like health. add a dead man's switch — a second system that proves your monitor is alive. test it by breaking the monitor on purpose. if the alert doesn't come, you're not monitored. you're just hoping.

agent hq monitors the monitor.

our agents run a dead man's switch on every watchdog process — if supervision stops, a separate alert fires through a backup channel within fifteen minutes. no more silent gaps.

see the kit →