MisLinux: How I Built an External Watchdog So My Cluster Is Monitored From the Outside Too

By Vicente Arteaga Gomez

MisLinux · Last updated: April 20, 2026

This article is part 12 of my MisLinux series on Kubernetes on Hetzner. It is based on a real operational setup I use. I am not affiliated with Synology, Grafana Labs, Hetzner, or any vendor mentioned here.

When people say "the cluster is monitored," they often mean "the cluster is monitoring itself." That is useful, but it is not the same thing.

An in-cluster monitoring stack can tell you a great deal about node pressure, pod restarts, internal health endpoints, and application metrics. What it cannot prove by itself is that the cluster is still reachable from the outside world under the same conditions your users see.

That gap matters more than it sounds.

Why internal monitoring is not enough on its own

If all of your monitoring, alerting, and dashboards live inside the system being monitored, you inherit its blind spots:

a networking break can isolate the cluster and the monitoring at the same time
DNS or ingress issues can make public services fail while internal health checks still pass
a bad change can break external reachability without breaking pod-level liveness

In other words, the cluster can be healthy in its own opinion while being unavailable in the way that actually matters.

That is why I added a small external watchdog running outside the cluster.

What I wanted the watchdog to do

I did not want a second full observability stack. I wanted a narrow, opinionated external view:

can the public URL resolve?
does TLS still work?
does the service answer quickly enough?
if it fails, can I get a notification even if the cluster itself is confused?

That is a very different job from Prometheus inside the cluster. The external watchdog is not for rich diagnosis. It is for independent truth.

The design I ended up with

I run a lightweight watchdog process outside the cluster on a small always-on system. Its responsibilities are intentionally boring:

poll the critical public endpoints
enforce explicit connection and response timeouts
send a Telegram message when a check fails
emit a periodic heartbeat back toward the cluster so the cluster can notice if the watchdog itself disappears

That last step matters. An external watchdog that silently dies is not much of a watchdog.

So the design becomes two-way:

Direction	What it proves
External watchdog -> public services	The internet-facing path still works from outside the cluster
Cluster monitoring -> watchdog heartbeat	The watchdog itself is still alive and checking in

Together, those signals are much more trustworthy than either one alone.

What the watchdog checks

I keep the checks simple and explicit:

public service health endpoints or equivalent minimal URLs
DNS resolution for important hostnames
TLS validity and impending certificate expiry
response-time budget, not just up/down state

This is deliberately narrower than full application monitoring. If I want latency histograms or per-route error rates, Prometheus is the right place. If I want to know whether the service is still reachable from outside the cluster without trusting the cluster to tell me, the watchdog is the right place.

Why I like Telegram for this role

I wanted something easy to receive on a phone, easy to automate, and not dependent on setting up yet another mail or paging system. Telegram is a good fit for that:

straightforward bot API
fast delivery
readable messages on mobile
low friction for a small operational setup

The important part is not the specific chat platform. The important part is that the alert message explains itself clearly enough that I do not have to mentally reconstruct the system at the moment I receive it.

What makes the alerts useful instead of noisy

The external watchdog helped me reinforce a monitoring principle I already believed: alerts should explain what the failure probably means and what to check first.

A useful alert says something like:

which public service failed
how it failed (timeout, DNS failure, bad status, TLS issue)
what the first operator check should be

That is much better than a generic "service down" message, especially when the point of the watchdog is to be the first signal you receive.

Why I did not replace the in-cluster monitoring

The external watchdog is not a substitute for Prometheus, Grafana, node metrics, or per-service health instrumentation. It is a complement.

Prometheus answers questions like:

which node is filling up?
which pod restarted?
did the request rate collapse gradually or instantly?
is this a localized workload issue or a broader cluster issue?

The watchdog answers a much simpler question:

if I were outside the cluster right now, would the service look alive?

That division of labor is exactly why the pairing works.

The real value during incidents

The biggest benefit is not prettier monitoring architecture. It is incident classification.

When the external watchdog reports a failure and the internal cluster metrics are also sick, I know I am dealing with something broad.

When the external watchdog reports a failure but the internal metrics still look normal, I immediately suspect:

DNS
ingress
certificate
external routing
origin reachability from outside the cluster

That narrows the response path quickly.

Likewise, if the cluster starts alerting that the external watchdog heartbeat has stopped, I know the monitoring blind spot may have moved to the watchdog side rather than the service side.

What I would recommend to small-cluster operators

If you self-manage a small production cluster, my recommendation is not "build a second observability platform." It is "give yourself one independent external truth source."

The minimum useful version can be very small:

a handful of critical URL checks
explicit timeouts
one alert channel
one heartbeat back into the cluster

That is enough to cover a whole class of failures that purely internal monitoring does not see clearly.

Final thought

The reason I like an external watchdog is not that it is sophisticated. It is that it is independent.

Operationally, independence is what gives the signal value. A cluster that says "I am healthy" is useful information. A cluster that is independently reachable from the outside is better information. When both views agree, I trust them more. When they disagree, I know where to investigate next.

That is the real job of the watchdog: not to replace the internal monitoring stack, but to stop it from being the only voice in the room.

MisLinux

Pages

Monday, April 27, 2026

How I Built an External Watchdog So My Cluster Is Monitored From the Outside Too