By Vicente Arteaga Gomez
MisLinux · Last updated: May 5, 2026
This article is part 5 of my MisLinux series on Kubernetes on Hetzner. It is based on my own operating habits and failure-planning mindset, and I am not affiliated with or sponsored by Hetzner.
A Kubernetes cluster is easy to admire when it is new. The real test starts after the initial setup, when routine maintenance becomes the difference between a stable platform and a future outage.
A short incident timeline that keeps me honest
One real failure pattern that reshaped this article was disk pressure building up slowly enough that it felt ignorable until it was suddenly not.
| Stage | What looked true at the time | What was actually happening |
|---|---|---|
| Early drift | "The node still has plenty of space." | orphaned layers and unrotated logs were accumulating quietly |
| Warning phase | "A restart later will probably clean it up." | the write paths were still growing faster than normal cleanup |
| Incident phase | pods start failing / rescheduling gets risky | the node root filesystem is now part of the outage story |
That kind of timeline is why I treat boring weekly checks as operational work, not admin busywork.
Backups are a process, not a checkbox
The first rule of stateful production systems is simple: if restore is untested, backup is incomplete.
For a Hetzner-based cluster, I want a documented answer to these questions:
- what data is backed up
- how often it is backed up
- where those backups are stored
- how long they are retained
- how a restore is performed under pressure
This matters because clusters usually fail in inconvenient ways. A backup strategy that only works when the original node is healthy is not a real recovery plan.
Backup schedule I actually follow
The table below shows the backup strategy for the three stateful concerns in a typical small Hetzner cluster. "Offsite" means written to a NAS or object storage that is outside the Hetzner project — not a volume snapshot inside the same account.
| What | Frequency | Retention | Tool | Offsite |
|---|---|---|---|---|
| MySQL database | Daily (cronjob) | 14 days | mysqldump + rclone to NAS | Yes |
| PersistentVolume data (media, uploads) | Daily | 30 days | rclone to Hetzner StorageBox | Yes |
| Kubernetes manifests / YAML config | On every change | Git history | Git + Gerrit/GitHub | Yes |
| Hetzner server snapshot (master node) | Weekly | 3 snapshots | Hetzner API automated snapshot | Hetzner internal |
| cert-manager secrets (TLS certs) | On rotation (90-day cycle) | Current + 1 prior | Kubernetes secret backup | Yes |
The most critical row is the database. A lost application can be redeployed from Git. A lost database means real data loss.
Hetzner's built-in server backup (enabled per-node at +20% monthly cost) covers the OS layer. It is not a substitute for application-level database exports, because it snapshots the entire node, not the logical data — and restoring a full node is slower and more disruptive than restoring from a logical dump.
The restore test I actually do
At least once a quarter, I pick one backup type and restore it to a scratch environment. If the restore fails or produces unexpected output, that is a problem worth finding before an actual incident.
The minimum restore test for a MySQL backup:
# On a scratch VM or pod with MySQL access:
mysqldump --no-data check: confirm the dump file has expected tables
mysql -u root -p scratch_db < latest_backup.sql
mysql scratch_db -e "SELECT COUNT(*) FROM publishers;"
If the row count matches what is expected from the live database, the backup is usable.
Monitoring should answer practical questions
I do not want dashboards for their own sake. I want monitoring that helps answer:
- are nodes healthy
- are pods restarting unexpectedly
- are ingress errors increasing
- is disk or memory pressure building up
- are certificate renewals succeeding
- are backups running and completing
Simple, reliable signals beat beautiful but confusing dashboards.
Monitoring checklist I review weekly
| Check | Signal | Tool | Alert threshold |
|---|---|---|---|
| Node CPU/memory | node_cpu_seconds_total, node_memory_MemAvailable_bytes | Prometheus + Grafana | >85% sustained 10 min |
| Node disk usage | node_filesystem_avail_bytes{mountpoint="/"} | node-exporter + Prometheus | >80% (warning), >90% (critical) |
| Pod restart count | kube_pod_container_status_restarts_total | kube-state-metrics | >5 restarts in 1h |
| Ingress error rate | Nginx access log 5xx rate | Prometheus nginx exporter | >1% of requests |
| Certificate expiry | certmanager_certificate_expiration_timestamp_seconds | cert-manager metrics | <21 days to expiry |
| Backup job last success | Custom Pushgateway metric | Cronjob + Pushgateway | >25h since last success |
| Registry image validity | HEAD /v2/image/manifests/tag | Custom checker | Any 404 or 5xx |
The disk usage check is the one I take most seriously, because a full disk is a silent killer: it stops new writes, causes pod evictions, and can corrupt state without an obvious error message. I learned this the hard way when a worker node accumulated 230GB of unrotated logs and orphaned container layers over 141 days.
The terminal trail I reach for during day-2 incidents
These are the commands I want to have muscle memory for before an incident happens:
kubectl get pods -A
kubectl top nodes
df -h
sudo du -sh /var/lib/rancher/k3s/agent/containerd/* 2>/dev/null | sort -h | tail
sudo journalctl -u k3s-agent -n 200 --no-pager
And when disk pressure is real, the useful clue usually looks uncomfortably plain:
Filesystem Size Used Avail Use%
/dev/sda1 300G 230G 70G 77%
They are not glamorous, but they usually tell me faster whether I am dealing with a cluster problem, a node problem, or a storage/logging problem.
The routine checks I do not like skipping
There are a few checks that feel boring right up until the day they save time:
- confirm backups actually completed and are readable
- review pods with repeated restarts instead of normalizing them
- scan certificate expiry before a customer reports it
- verify the last upgrade notes still match current reality
These are not advanced operations, but they are exactly the kind of small discipline that keeps a cluster from drifting into fragile territory.
Upgrades deserve a playbook
Every cluster will eventually need Kubernetes upgrades, node replacement, package updates, and configuration changes. If those operations are handled ad hoc, risk accumulates silently.
I prefer a written upgrade playbook with a clear order:
- review release notes and dependencies
- back up anything stateful
- drain and upgrade non-critical nodes first
- verify workload health after each change
- keep rollback options visible
The k3s upgrade sequence that has not surprised me
For k3s-based clusters, the safest upgrade order is: master node last. The sequence:
# 1. Drain the worker node gracefully
kubectl drain worker-node-0 --ignore-daemonsets --delete-emptydir-data
# 2. SSH into the worker and upgrade k3s-agent
ssh worker-node-0
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.29.x sh -s - agent
# 3. Verify the worker rejoined and workloads rescheduled
kubectl get nodes
kubectl uncordon worker-node-0
# 4. Only then upgrade the master
ssh master-node
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.29.x sh -s -
Upgrading the master first risks a control-plane downtime window while workers are still on the old version. Upgrading workers first keeps the cluster functional throughout.
The failure case I plan around
The day-2 mistake I fear most is not a dramatic total outage. It is the quieter one where the cluster has been degrading for days:
- backups are still running but nobody has restored one recently
- a pod has been restarting often enough to normalize the warning away
- disk usage is growing, but not fast enough to feel urgent
- the last upgrade notes no longer match the real runtime
That is why I like routine checks that answer "what changed?" before they answer "what failed?"
Node replacement should feel routine
One of the healthiest signs in a cluster is that replacing a worker node feels boring. If node recreation is stressful, too much of the platform is trapped in undocumented manual state.
That is why reproducible node provisioning matters so much. When a worker is rebuilt, the result should be predictable: same configuration shape, same security posture, same cluster expectations.
I provision all nodes from cloud-init YAML. The cloud-init template configures the OS, installs k3s, registers with the cluster, and sets the correct node labels and k3s arguments. A replacement node that provisions correctly in under 10 minutes is a good sign. One that requires manual post-setup steps is a warning that something needs to be captured in the template.
What I'd do differently now
If I were writing the first day-2 playbook again, I would elevate two things earlier:
- a restore drill calendar instead of a vague "we should test backups soon"
- a short node-forensics checklist kept next to the upgrade notes, not in my head
Those two habits have saved me more time than many fancier monitoring improvements.
What I want written down before an incident happens
Before the cluster has a bad day, I want a short operational record that answers:
- who can restore critical data
- where the last successful backups live
- how to rotate a node safely
- what to check first when ingress or DNS behaves oddly
- which changes happened recently enough to be suspicious
These notes are not glamorous, but they stop an incident from becoming a memory test.
Operational discipline compounds
What I like about small clusters is that good habits are easier to build early. The same is true of bad habits. If backup checks, upgrade notes, and routine verification are skipped in a five-node cluster, the cost of that neglect only grows later.
Day-2 operations are not glamorous, but they are where trust in the platform is earned.
Series note
This is part 5 of the series, and it is the article where the cluster stops being a deployment target and becomes an operational commitment. The next article steps back from implementation detail and looks at cost, tradeoffs, and when I would choose something else.
In the next article, I will cover cost, tradeoffs, and the situations where I would choose something other than Kubernetes on Hetzner.