By Vicente Arteaga Gomez
MisLinux · Last updated: May 5, 2026
This is part of my Kubernetes-on-Hetzner-and-operations series on MisLinux. It reflects the follow-up system I now prefer for ongoing operational questions.
The incident is not really over just because the page stopped firing.
Some of the most important work happens after the immediate pressure is gone:
- verify whether the fix actually held
- confirm second-order metrics recovered
- watch for regressions a few days later
- record which questions are still open
If I do not keep those as live follow-up items, they turn into vague intentions that disappear under the next urgent task.
What I want a follow-up file to contain
| Section | Purpose |
|---|---|
| Original issue | why this item exists |
| Analysis | what the investigation actually found |
| Actions taken | what changed already |
| Future checks | what still needs to be watched |
| Artifact links | where the evidence lives |
That is enough to turn a lingering concern into an inspectable queue item.
Why one markdown file is often enough
I do not need a huge tracker to get value here.
One file per follow-up works because it is:
- easy to diff
- easy to link from a registry page
- easy to archive later
- easy to extend when the story changes
What matters is that the follow-up is durable and evidence-backed, not that it looks like an enterprise ticket system.
The real advantage
The biggest benefit is not organization. It is time.
When the same topic resurfaces, I want to answer these questions quickly:
- what exactly happened last time?
- what did we change already?
- what metric were we supposed to keep watching?
- which artifact folder contains the supporting data?
That is much easier with a live follow-up file than with memory plus old shell output.
Failure case: the unresolved item that looks resolved
One of the most common mistakes in operational work is treating "the immediate symptom improved" as equivalent to "the issue is closed."
That hides important follow-up questions like:
- did total demand recover or only one slice?
- did the workaround create a new imbalance?
- did a monitoring threshold now become misleading?
Those are exactly the questions I want to keep alive explicitly.
What I'd do differently now
I used to write big postmortems and almost no living follow-up notes. What I'd do differently now is create the follow-up file as soon as I know the issue needs days or weeks of observation, even if the initial analysis is not perfect yet.