MisLinux: Lean Monitoring for a Small Kubernetes Cluster on Hetzner

08-lean-monitoring-for-a-small-kubernetes-cluster-on-hetzner

By Vicente Arteaga Gomez

MisLinux · Last updated: April 6, 2026

This article is part 8 of my MisLinux series on Kubernetes on Hetzner. All configurations described here are based on my own setup and real operational experience. I am not affiliated with or sponsored by Hetzner, Grafana Labs, or any other tool mentioned.

A cluster that runs without visibility is just a time bomb with a pleasant uptime. After getting the basics working on a small Hetzner k3s cluster, the next practical step is knowing whether the services running on it are actually healthy, and getting a notification before a problem becomes an incident.

This post covers how I built a lean monitoring stack that fits comfortably on modest virtual machines without consuming the CPU or disk budget that the actual workloads need.

What I wanted from monitoring

Before choosing tools I wrote down a short list of practical questions I needed answered without manual inspection:

Is each workload responding? Do the health checks pass?
Are pods crashing or restarting unexpectedly?
Is any node running low on disk space?
Are requests being served, or has traffic dropped to zero?
Did a new node join the cluster correctly?
Is the external watchdog alive and checking in?

I also wanted a notification channel I could use from a phone without installing anything complicated. Telegram works well for this.

The stack: Prometheus, Grafana, and node_exporter

For a small cluster, the standard combination of Prometheus (metrics collection), node_exporter (host-level metrics per node), and Grafana (dashboards and alerting) is hard to beat. It is widely documented, has reasonable resource consumption when tuned carefully, and integrates directly with Kubernetes through service discovery.

Resource discipline

On modest hardware, a few settings matter:

Prometheus scrape interval: I use 60 seconds rather than the default 15 seconds. For a services-focused cluster this is more than adequate. Disk write load drops sharply.

Retention: I keep 15 days of raw data. Beyond that, Grafana's visualization has diminishing returns for the day-to-day questions I actually ask. Prometheus compacts data automatically; I just make sure the retention flag is set explicitly rather than relying on the default.

node_exporter collectors: I disable the collectors I do not use: bcache, drbd, infiniband, and several others that generate irrelevant metrics for a plain virtual machine. The --collector.disable-defaults flag and selective re-enables keep the cardinality low.

Grafana data retention: Grafana itself stores dashboards and alert state but not time-series data. The dashboards are kept in a ConfigMap so they survive pod restarts without a separate persistent volume.

What gets scraped

The standard target set for a k3s cluster:

node_exporter on each node (CPU, memory, disk, network)
k3s itself exposes kubelet and API server metrics on standard ports
Any application that exposes a /metrics endpoint gets added to a ServiceMonitor or static scrape config
A custom Pushgateway for metrics that cannot be scraped directly (batch jobs, external health checks)

For CPU usage per node, node_cpu_seconds_total with rate() over 5 minutes gives a smooth enough signal to detect saturation without false alarms on short spikes.

Dashboards: one unified view

I keep one main dashboard with panels for each concern:

Node health: CPU, memory, disk per node side by side
Pod restarts: a table showing any pod with more than 0 restarts in the last hour
Ingress error rate: HTTP 4xx and 5xx rates from the nginx ingress controller
Health endpoint status: custom panels that poll each service's /health endpoint and show a green/red state
Request rate: the requests-per-second metric from the main production service, shown as a time-series so drops are immediately visible
Bandwidth: inbound and outbound bytes per node, useful for estimating how close you are to the included monthly traffic allocation

I also keep individual per-service dashboards linked from the main one. This keeps the top-level view clean while allowing a drill-down when something looks wrong.

Bandwidth and cost tracking

Hetzner includes a monthly traffic limit per server. Exceeding it has cost implications. I track node_network_transmit_bytes_total on the public interface for each node, compute a monthly rate using Prometheus's increase() function, and display it as a panel alongside a threshold line.

For cost, I pull the daily Hetzner billing data through their API and push it via Pushgateway. This gives a simple bar chart of daily spend, which is useful for catching unexpected charges early.

Alerting: Grafana rules with Telegram delivery

Grafana's built-in alerting can evaluate PromQL expressions and send notifications. I use it instead of the Prometheus Alertmanager to keep the stack simpler.

Telegram is a convenient delivery channel: no email server required, messages arrive on mobile immediately, and the Bot API is straightforward to configure. In Grafana: add a Contact Point of type Telegram, paste the bot token and chat ID, and notifications start flowing.

What alerts I actually run

Keeping the alert list short and meaningful avoids alarm fatigue:

Disk above 80% on any node — gives time to act before a full disk causes silent failures
No requests in the last 30 minutes on any production service that should have traffic — this catches outages that would otherwise go unnoticed overnight
Pod restart count increasing — anything more than 2 restarts in an hour gets flagged
Health endpoint not returning 200 — a simple up/down alert for each public service
External watchdog not checking in — see the next section

Writing alert messages that explain themselves

The most useful discipline I adopted was writing alert messages as if the reader has never seen the system before. Each message includes:

Where it comes from (which cluster, which service)
What it means in plain terms (the service is down / latency is elevated / traffic fell unexpectedly)
What to check first (a specific command, a specific URL, or a specific Kubernetes namespace)

A message that just says "alert firing" is not useful at 2am. A message that says "Production VAST service on the Hetzner cluster is receiving zero requests — likely the nginx ingress pod has restarted. Check: kubectl get pods -n production" is actionable immediately.

The external watchdog: a check from outside the cluster

A monitoring stack that lives entirely inside the cluster it monitors has a structural blind spot: if the cluster loses connectivity, or if a misconfiguration blocks all traffic, the alerting fails silently too.

I run a small Docker container on a home NAS device as an external watchdog. Its job is simple: poll each critical service URL every minute, and send a Telegram message if any check fails or times out. If the NAS container itself stops sending its periodic heartbeat to the cluster, the cluster alerts that the external watchdog is gone.

This gives a two-way health check:

NAS → Cluster: the NAS polls the cluster from outside and alerts if service URLs stop responding
Cluster → NAS: the cluster checks whether the NAS watchdog has recently posted its heartbeat, and alerts if it has not

Neither direction alone is sufficient. Together they cover the main failure scenarios.

What the watchdog checks

Each public service URL (health endpoint, or a minimal response that confirms the service is alive)
TLS certificate validity (flagging within 14 days of expiry)
DNS resolution for production hostnames
Response time — a timeout on connection or response is treated as a failure

All checks include explicit timeouts: connection timeout separate from response timeout. A service that connects but hangs indefinitely is a different kind of problem than one that refuses the connection, and both need to be caught.

What I learned building this

A few things that took more than one attempt to get right:

Alert thresholds need tuning based on real traffic patterns. A "no requests in 30 minutes" alert needs to account for legitimate low-traffic periods (nights, weekends). I adjusted the alert to trigger only when traffic drops significantly below the rolling average for the same time window the previous day. This eliminates false alarms without hiding real outages.

Disk alerts should fire early. An 80% threshold sounds conservative but on a busy registry node with several large container images, it is easy to cross from 80% to 95% in hours. I also added a separate alert for fast-growing directories.

The Pushgateway is useful but needs housekeeping. Metrics pushed via Pushgateway persist until explicitly deleted. A job that fails silently and stops pushing will show stale metrics rather than absence. I treat any Pushgateway metric older than 10 minutes for a frequently-running job as a signal of failure, not as the last known good value.

Resource limits on Prometheus and Grafana pods matter. Without explicit limits, Prometheus will happily consume several gigabytes of memory on a small node if scrape interval and retention are not tuned. I set memory requests and limits explicitly and check them after any scrape configuration change.

The result

After tuning, the monitoring stack runs comfortably alongside production workloads on modest Hetzner virtual machines. Dashboards load quickly. Alerts arrive within 2 minutes of an actual failure. The external watchdog has already caught two incidents where the internal cluster monitoring would have been silent.

The key constraint I kept in mind throughout was that monitoring should cost less than what it protects. On a small cluster, that means being deliberate about scrape intervals, retention, and the number of metrics collected — not maximizing observability for its own sake.

For a small self-managed cluster, this is enough. It answers the questions that matter, and it does not add a maintenance burden that competes with the actual work.

---

*Next in the series: Handling multi-architecture images in a Kubernetes cluster where nodes run different CPU architectures.*

MisLinux

Pages

Monday, April 6, 2026

Lean Monitoring for a Small Kubernetes Cluster on Hetzner