By Vicente Arteaga Gomez
MisLinux · Last updated: April 20, 2026
This article is part 9 of my MisLinux series on Kubernetes on Hetzner. Everything here comes from a real cluster I operate. I am not affiliated with or sponsored by Hetzner, Prometheus, Grafana Labs, or any other vendor mentioned in this post.
When monitoring feels stable for a while, it becomes easy to think of it as solved infrastructure. That is exactly when it tends to surprise you.
I had a moment recently where Grafana suddenly stopped showing the historical Prometheus data I expected. The dashboards still loaded. Fresh data was still arriving. But the multi-day view that had been visible earlier was effectively gone. What made this more dangerous was that nothing had "failed" in the obvious sense: Prometheus was up, Grafana was up, and queries still returned current values.
That kind of failure is worse than a full outage. A full outage is noisy. Silent history loss makes you think the system is working while the evidence you need for trend analysis, incident review, or storage growth comparisons has already disappeared.
What I saw first
The symptom was simple: a Grafana dashboard that had shown several days of data earlier in the morning suddenly only had fresh points. It would have been easy to blame Grafana, because Grafana is where the problem became visible. But Grafana was just the messenger.
The first useful clue was storage. The Prometheus PVC had dropped from a level that looked normal for a populated time-series database to something dramatically smaller. That told me the data had not merely become hard to query. It had actually been removed from the TSDB.
The second clue was timing. This happened after a Prometheus recovery path that involved WAL replay and compaction pressure. In other words, this was not random data drift. It was a storage and retention interaction.
The actual cause
The root cause was the combination of two settings that looked individually reasonable:
- a small on-disk size cap for the Prometheus TSDB
- a retention target that implied more history than that size cap could safely hold after replay and compaction
In practice, Prometheus recovered, replayed WAL state, rebuilt blocks, and then immediately had to delete older blocks because the effective database size was now above the configured limit. From Prometheus's point of view, this was correct behavior. From an operator's point of view, it was a trap.
The lesson is that retention by time and retention by size are not interchangeable. If you say "keep N days" but the size budget cannot physically hold N days of your scrape volume, the size cap wins. That is especially easy to underestimate on a cluster where monitoring scope gradually expands: more exporters, more labels, more synthetic metrics, more per-node detail, more per-service probes.
Why this is easy to miss on a small cluster
On a small Kubernetes cluster, it is tempting to be aggressive with Prometheus storage because you do not want the monitoring stack competing with production workloads. That instinct is correct. The mistake is assuming a low storage cap is safe just because the cluster itself is small.
Small clusters often collect a surprisingly wide mix of data:
- host metrics from every node
- Kubernetes object state
- ingress and application health checks
- Pushgateway-fed metrics from cron jobs and watchdogs
- custom operational counters that answer "did this thing actually happen?"
Individually, none of these feel large. Together, they create a TSDB shape that grows in bursts and compacts in ways that are hard to reason about from intuition alone.
What I changed afterward
I changed two things.
First, I stopped treating the storage limit as a rough guess. I treated it as a capacity contract. If I wanted multi-day history for the metrics that matter operationally, the size cap had to reflect real ingestion and compaction behavior, not a convenient round number.
Second, I separated the question of how much history I want from how much history I can afford. Those are related, but they are not the same design decision.
My post-incident rule now looks like this:
- keep enough retention to support real incident comparison, not aspirational observability
- size the TSDB for the real metric set plus recovery headroom
- verify what happens after restart and replay, not only during steady-state uptime
- watch disk trends for Prometheus specifically instead of assuming Grafana will expose the problem early enough
The operational mistake I would avoid next time
The mistake was not "using a small disk." The mistake was assuming a one-time retention choice would stay valid while the monitoring system kept evolving.
Prometheus configuration often grows in operationally healthy ways:
- you add a better exporter
- you add a new per-service alert
- you add a custom metric for a problem that used to be invisible
- you add a standby or external watchdog path
Each one makes the monitoring better. Each one also changes the storage profile. If you do not revisit the retention budget after those improvements, you eventually get a misleadingly "healthy" monitoring stack that cannot retain the history your dashboards imply.
What I now consider the minimum safe review loop
If I were setting this up again on a small cluster, I would review these four things together instead of separately:
| Check | Why it matters |
|---|---|
| TSDB size cap vs. actual usage | Prevents size-based retention from silently defeating the intended history window |
| WAL replay behavior after restart | Reveals whether recovery itself can trigger aggressive block deletion |
| Per-target scrape scope | Keeps metric growth tied to actual operational value |
| Dashboard time ranges people rely on | Prevents the UI from implying more retained history than Prometheus can really provide |
This is not glamorous work, but it is the kind of detail that decides whether your monitoring helps during an incident or merely looks professional between incidents.
What this changed in how I think about monitoring
Before this, I already believed a small cluster should run lean monitoring. After this, I think lean monitoring has to include lean retention design, not only lean scrape configuration.
It is easy to spend time tuning exporters and alert rules while leaving storage behavior as a default plus one arbitrary disk number. That is not enough. The retention model is part of the monitoring design, not a background implementation detail.
That is also why I prefer practical monitoring over "collect everything." The more honest question is not "can I scrape this?" It is "will I still trust this system after restart, replay, compaction, and a week of real production behavior?"
Final thought
Losing Prometheus history was frustrating, but it was also clarifying. It forced me to stop thinking of monitoring data as something that passively accumulates and start treating it like any other production storage system with real capacity and lifecycle constraints.
If you run your own small cluster, my recommendation is simple: test the recovery path, inspect the size-based retention interaction, and do not wait for a missing graph to tell you your monitoring history was more fragile than you thought.
If you want the broader monitoring architecture around this story, read my earlier post on lean monitoring for a small Kubernetes cluster on Hetzner. This article is the more specific follow-up: what happened once that monitoring stack had to survive real operational pressure rather than just look correct on day one.