By Vicente Arteaga Gomez
MisLinux · Last updated: April 20, 2026
This article is part 12 of my MisLinux series on Kubernetes on Hetzner. It is based on a real operational setup I use. I am not affiliated with Synology, Grafana Labs, Hetzner, or any vendor mentioned here.
When people say "the cluster is monitored," they often mean "the cluster is monitoring itself." That is useful, but it is not the same thing.
An in-cluster monitoring stack can tell you a great deal about node pressure, pod restarts, internal health endpoints, and application metrics. What it cannot prove by itself is that the cluster is still reachable from the outside world under the same conditions your users see.
That gap matters more than it sounds.
Why internal monitoring is not enough on its own
If all of your monitoring, alerting, and dashboards live inside the system being monitored, you inherit its blind spots:
- a networking break can isolate the cluster and the monitoring at the same time
- DNS or ingress issues can make public services fail while internal health checks still pass
- a bad change can break external reachability without breaking pod-level liveness
In other words, the cluster can be healthy in its own opinion while being unavailable in the way that actually matters.
That is why I added a small external watchdog running outside the cluster.
What I wanted the watchdog to do
I did not want a second full observability stack. I wanted a narrow, opinionated external view:
- can the public URL resolve?
- does TLS still work?
- does the service answer quickly enough?
- if it fails, can I get a notification even if the cluster itself is confused?
That is a very different job from Prometheus inside the cluster. The external watchdog is not for rich diagnosis. It is for independent truth.
The design I ended up with
I run a lightweight watchdog process outside the cluster on a small always-on system. Its responsibilities are intentionally boring:
- poll the critical public endpoints
- enforce explicit connection and response timeouts
- send a Telegram message when a check fails
- emit a periodic heartbeat back toward the cluster so the cluster can notice if the watchdog itself disappears
That last step matters. An external watchdog that silently dies is not much of a watchdog.
So the design becomes two-way:
| Direction | What it proves |
|---|---|
| External watchdog -> public services | The internet-facing path still works from outside the cluster |
| Cluster monitoring -> watchdog heartbeat | The watchdog itself is still alive and checking in |
Together, those signals are much more trustworthy than either one alone.
What the watchdog checks
I keep the checks simple and explicit:
- public service health endpoints or equivalent minimal URLs
- DNS resolution for important hostnames
- TLS validity and impending certificate expiry
- response-time budget, not just up/down state
This is deliberately narrower than full application monitoring. If I want latency histograms or per-route error rates, Prometheus is the right place. If I want to know whether the service is still reachable from outside the cluster without trusting the cluster to tell me, the watchdog is the right place.
Why I like Telegram for this role
I wanted something easy to receive on a phone, easy to automate, and not dependent on setting up yet another mail or paging system. Telegram is a good fit for that:
- straightforward bot API
- fast delivery
- readable messages on mobile
- low friction for a small operational setup
The important part is not the specific chat platform. The important part is that the alert message explains itself clearly enough that I do not have to mentally reconstruct the system at the moment I receive it.
What makes the alerts useful instead of noisy
The external watchdog helped me reinforce a monitoring principle I already believed: alerts should explain what the failure probably means and what to check first.
A useful alert says something like:
- which public service failed
- how it failed (timeout, DNS failure, bad status, TLS issue)
- what the first operator check should be
That is much better than a generic "service down" message, especially when the point of the watchdog is to be the first signal you receive.
Why I did not replace the in-cluster monitoring
The external watchdog is not a substitute for Prometheus, Grafana, node metrics, or per-service health instrumentation. It is a complement.
Prometheus answers questions like:
- which node is filling up?
- which pod restarted?
- did the request rate collapse gradually or instantly?
- is this a localized workload issue or a broader cluster issue?
The watchdog answers a much simpler question:
- if I were outside the cluster right now, would the service look alive?
That division of labor is exactly why the pairing works.
The real value during incidents
The biggest benefit is not prettier monitoring architecture. It is incident classification.
When the external watchdog reports a failure and the internal cluster metrics are also sick, I know I am dealing with something broad.
When the external watchdog reports a failure but the internal metrics still look normal, I immediately suspect:
- DNS
- ingress
- certificate
- external routing
- origin reachability from outside the cluster
That narrows the response path quickly.
Likewise, if the cluster starts alerting that the external watchdog heartbeat has stopped, I know the monitoring blind spot may have moved to the watchdog side rather than the service side.
What I would recommend to small-cluster operators
If you self-manage a small production cluster, my recommendation is not "build a second observability platform." It is "give yourself one independent external truth source."
The minimum useful version can be very small:
- a handful of critical URL checks
- explicit timeouts
- one alert channel
- one heartbeat back into the cluster
That is enough to cover a whole class of failures that purely internal monitoring does not see clearly.
Final thought
The reason I like an external watchdog is not that it is sophisticated. It is that it is independent.
Operationally, independence is what gives the signal value. A cluster that says "I am healthy" is useful information. A cluster that is independently reachable from the outside is better information. When both views agree, I trust them more. When they disagree, I know where to investigate next.
That is the real job of the watchdog: not to replace the internal monitoring stack, but to stop it from being the only voice in the room.