Pages

Wednesday, March 18, 2026

Day-2 Operations: Backups, Monitoring, and Upgrades on Hetzner

05-day-2-operations-backups-monitoring-and-upgrades

By Vicente Arteaga Gomez

MisLinux · Last updated: May 5, 2026

This article is part 5 of my MisLinux series on Kubernetes on Hetzner. It is based on my own operating habits and failure-planning mindset, and I am not affiliated with or sponsored by Hetzner.

A Kubernetes cluster is easy to admire when it is new. The real test starts after the initial setup, when routine maintenance becomes the difference between a stable platform and a future outage.

A short incident timeline that keeps me honest

One real failure pattern that reshaped this article was disk pressure building up slowly enough that it felt ignorable until it was suddenly not.

StageWhat looked true at the timeWhat was actually happening
Early drift"The node still has plenty of space."orphaned layers and unrotated logs were accumulating quietly
Warning phase"A restart later will probably clean it up."the write paths were still growing faster than normal cleanup
Incident phasepods start failing / rescheduling gets riskythe node root filesystem is now part of the outage story

That kind of timeline is why I treat boring weekly checks as operational work, not admin busywork.

Backups are a process, not a checkbox

The first rule of stateful production systems is simple: if restore is untested, backup is incomplete.

For a Hetzner-based cluster, I want a documented answer to these questions:

  • what data is backed up
  • how often it is backed up
  • where those backups are stored
  • how long they are retained
  • how a restore is performed under pressure

This matters because clusters usually fail in inconvenient ways. A backup strategy that only works when the original node is healthy is not a real recovery plan.

Backup schedule I actually follow

The table below shows the backup strategy for the three stateful concerns in a typical small Hetzner cluster. "Offsite" means written to a NAS or object storage that is outside the Hetzner project — not a volume snapshot inside the same account.

WhatFrequencyRetentionToolOffsite
MySQL databaseDaily (cronjob)14 daysmysqldump + rclone to NASYes
PersistentVolume data (media, uploads)Daily30 daysrclone to Hetzner StorageBoxYes
Kubernetes manifests / YAML configOn every changeGit historyGit + Gerrit/GitHubYes
Hetzner server snapshot (master node)Weekly3 snapshotsHetzner API automated snapshotHetzner internal
cert-manager secrets (TLS certs)On rotation (90-day cycle)Current + 1 priorKubernetes secret backupYes

The most critical row is the database. A lost application can be redeployed from Git. A lost database means real data loss.

Hetzner's built-in server backup (enabled per-node at +20% monthly cost) covers the OS layer. It is not a substitute for application-level database exports, because it snapshots the entire node, not the logical data — and restoring a full node is slower and more disruptive than restoring from a logical dump.

The restore test I actually do

At least once a quarter, I pick one backup type and restore it to a scratch environment. If the restore fails or produces unexpected output, that is a problem worth finding before an actual incident.

The minimum restore test for a MySQL backup:

# On a scratch VM or pod with MySQL access:
mysqldump --no-data check: confirm the dump file has expected tables
mysql -u root -p scratch_db < latest_backup.sql
mysql scratch_db -e "SELECT COUNT(*) FROM publishers;"

If the row count matches what is expected from the live database, the backup is usable.

Monitoring should answer practical questions

I do not want dashboards for their own sake. I want monitoring that helps answer:

  • are nodes healthy
  • are pods restarting unexpectedly
  • are ingress errors increasing
  • is disk or memory pressure building up
  • are certificate renewals succeeding
  • are backups running and completing

Simple, reliable signals beat beautiful but confusing dashboards.

Monitoring checklist I review weekly

CheckSignalToolAlert threshold
Node CPU/memorynode_cpu_seconds_total, node_memory_MemAvailable_bytesPrometheus + Grafana>85% sustained 10 min
Node disk usagenode_filesystem_avail_bytes{mountpoint="/"}node-exporter + Prometheus>80% (warning), >90% (critical)
Pod restart countkube_pod_container_status_restarts_totalkube-state-metrics>5 restarts in 1h
Ingress error rateNginx access log 5xx ratePrometheus nginx exporter>1% of requests
Certificate expirycertmanager_certificate_expiration_timestamp_secondscert-manager metrics<21 days to expiry
Backup job last successCustom Pushgateway metricCronjob + Pushgateway>25h since last success
Registry image validityHEAD /v2/image/manifests/tagCustom checkerAny 404 or 5xx

The disk usage check is the one I take most seriously, because a full disk is a silent killer: it stops new writes, causes pod evictions, and can corrupt state without an obvious error message. I learned this the hard way when a worker node accumulated 230GB of unrotated logs and orphaned container layers over 141 days.

The terminal trail I reach for during day-2 incidents

These are the commands I want to have muscle memory for before an incident happens:

kubectl get pods -A
kubectl top nodes
df -h
sudo du -sh /var/lib/rancher/k3s/agent/containerd/* 2>/dev/null | sort -h | tail
sudo journalctl -u k3s-agent -n 200 --no-pager

And when disk pressure is real, the useful clue usually looks uncomfortably plain:

Filesystem      Size  Used Avail Use%
/dev/sda1       300G  230G   70G  77%

They are not glamorous, but they usually tell me faster whether I am dealing with a cluster problem, a node problem, or a storage/logging problem.

The routine checks I do not like skipping

There are a few checks that feel boring right up until the day they save time:

  • confirm backups actually completed and are readable
  • review pods with repeated restarts instead of normalizing them
  • scan certificate expiry before a customer reports it
  • verify the last upgrade notes still match current reality

These are not advanced operations, but they are exactly the kind of small discipline that keeps a cluster from drifting into fragile territory.

Upgrades deserve a playbook

Every cluster will eventually need Kubernetes upgrades, node replacement, package updates, and configuration changes. If those operations are handled ad hoc, risk accumulates silently.

I prefer a written upgrade playbook with a clear order:

  1. review release notes and dependencies
  2. back up anything stateful
  3. drain and upgrade non-critical nodes first
  4. verify workload health after each change
  5. keep rollback options visible

The k3s upgrade sequence that has not surprised me

For k3s-based clusters, the safest upgrade order is: master node last. The sequence:

# 1. Drain the worker node gracefully
kubectl drain worker-node-0 --ignore-daemonsets --delete-emptydir-data

# 2. SSH into the worker and upgrade k3s-agent
ssh worker-node-0
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.29.x sh -s - agent

# 3. Verify the worker rejoined and workloads rescheduled
kubectl get nodes
kubectl uncordon worker-node-0

# 4. Only then upgrade the master
ssh master-node
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.29.x sh -s -

Upgrading the master first risks a control-plane downtime window while workers are still on the old version. Upgrading workers first keeps the cluster functional throughout.

The failure case I plan around

The day-2 mistake I fear most is not a dramatic total outage. It is the quieter one where the cluster has been degrading for days:

  • backups are still running but nobody has restored one recently
  • a pod has been restarting often enough to normalize the warning away
  • disk usage is growing, but not fast enough to feel urgent
  • the last upgrade notes no longer match the real runtime

That is why I like routine checks that answer "what changed?" before they answer "what failed?"

Node replacement should feel routine

One of the healthiest signs in a cluster is that replacing a worker node feels boring. If node recreation is stressful, too much of the platform is trapped in undocumented manual state.

That is why reproducible node provisioning matters so much. When a worker is rebuilt, the result should be predictable: same configuration shape, same security posture, same cluster expectations.

I provision all nodes from cloud-init YAML. The cloud-init template configures the OS, installs k3s, registers with the cluster, and sets the correct node labels and k3s arguments. A replacement node that provisions correctly in under 10 minutes is a good sign. One that requires manual post-setup steps is a warning that something needs to be captured in the template.

What I'd do differently now

If I were writing the first day-2 playbook again, I would elevate two things earlier:

  1. a restore drill calendar instead of a vague "we should test backups soon"
  2. a short node-forensics checklist kept next to the upgrade notes, not in my head

Those two habits have saved me more time than many fancier monitoring improvements.

What I want written down before an incident happens

Before the cluster has a bad day, I want a short operational record that answers:

  • who can restore critical data
  • where the last successful backups live
  • how to rotate a node safely
  • what to check first when ingress or DNS behaves oddly
  • which changes happened recently enough to be suspicious

These notes are not glamorous, but they stop an incident from becoming a memory test.

Operational discipline compounds

What I like about small clusters is that good habits are easier to build early. The same is true of bad habits. If backup checks, upgrade notes, and routine verification are skipped in a five-node cluster, the cost of that neglect only grows later.

Day-2 operations are not glamorous, but they are where trust in the platform is earned.

Series note

This is part 5 of the series, and it is the article where the cluster stops being a deployment target and becomes an operational commitment. The next article steps back from implementation detail and looks at cost, tradeoffs, and when I would choose something else.

In the next article, I will cover cost, tradeoffs, and the situations where I would choose something other than Kubernetes on Hetzner.