MisLinux: My Cold-Standby Disaster Recovery Plan for a Single Critical Service

By Vicente Arteaga Gomez

MisLinux · Last updated: April 20, 2026

This article is part 13 of my MisLinux series on Kubernetes on Hetzner. It is based on a real production service and the recovery planning around it. I am not affiliated with Hetzner, Cloudflare, Netcup, or any vendor mentioned here.

Cold standby disaster recovery cover image

If you run a small platform, "disaster recovery" can become a vague comfort word very quickly. It is easy to say you can rebuild from source, recreate the cluster, restore the database, and recover the service. It is much harder to define what you will actually do if the main environment is gone or unreachable and a critical public endpoint needs to come back fast.

That difference is why I ended up designing a very explicit cold-standby plan for one service instead of pretending the whole platform needed full hot-failover sophistication.

Why I focused on one service first

Not every service deserves the same recovery model.

Some services are annoying to lose. Some are expensive to lose. A smaller number are central enough that their absence immediately becomes a business problem.

For me, the right first DR target was a single critical serving path. That let me answer the recovery question honestly:

what exact artifact does this service need to start?
where can I store that artifact outside the cluster?
how would I run the same runtime elsewhere if Hetzner were unavailable?
what manual step am I willing to take for cutover?

That is a much healthier question than "how do I make the whole cluster magically highly available?"

Why I chose cold standby instead of active-active

I did not want to build a complicated active-active or automatic failover design for a small environment unless the business case clearly justified it.

The cost of more advanced failover is not just extra infrastructure. It is operational ambiguity:

which side is authoritative?
how do you keep config perfectly in sync?
what happens during split-brain conditions?
how do you test the failover path without risking production?

For one critical service, a cold standby with good artifacts and a documented cutover path was a better fit.

My design goal became:

> keep one externally runnable copy of the service ready enough that I can start it from a known-good snapshot and then cut DNS over manually if the primary environment is gone.

What the standby actually needs

Once I stripped away the vague language, the required ingredients were simpler than I expected:

the exact runtime image or equivalent runnable artifact
the last known-good serving configuration
a minimal host or VM where that runtime can start
a TLS front door
a manual DNS cutover path

That is the core of the plan. Everything else is support structure.

Why the config snapshot matters so much

The most important insight was that disaster recovery for a service often depends more on the last known-good runtime state than on the source code.

Source code is necessary, but it is not the immediate recovery substrate in a crisis. If the live service depends on a generated configuration that reflects database-backed objects, rules, or mappings, then the practical recovery question is:

> Do I have the actual config the service was serving from?

If the answer is no, then "we can rebuild it from the database" only works if the whole control plane and upstream data path are still available during the incident. That is not a disaster assumption I want to build around.

So I prefer taking validated off-cluster snapshots of the real serving config instead.

Why the cutover stays manual

A small platform can talk itself into dangerous automation very easily.

Automatic failover sounds mature, but it only helps if the automation is more reliable than the operator judgment it replaces. For my environment, I do not think that is true yet.

That is why DNS cutover stays manual.

Manual cutover buys me two things:

I decide based on evidence, not on one possibly misleading health signal
I avoid accidental flapping between origins

In practice, that means a documented runbook is more valuable than a half-trusted automatic failover controller.

What the standby is meant to prove

The cold standby is not meant to be a second production platform in permanent parallel service. It is meant to prove four things:

Proof goal	Why it matters
The service can start elsewhere from the saved config	Prevents DR from depending on cluster-local state or wishful re-generation
The runtime image is available outside the primary pull path	Avoids turning registry or network dependency into the next blocker
TLS and reverse proxying work on the standby host	Makes cutover realistic instead of theoretical
DNS can be pointed at the standby origin intentionally	Turns the recovery path into an operator action, not a design diagram

That is enough for a credible first DR layer on a small platform.

What I do not want from DR

I do not want the disaster-recovery system to become a second fragile platform with its own hidden dependencies.

That means I try to avoid:

rebuilding from source during the incident
assuming private registries are reachable
depending on interactive browser-only recovery steps
distributing the failover state across too many places

The more moving parts the DR path has, the less likely it is to be trustworthy under actual pressure.

The tradeoff I accept

Cold standby is not instant recovery. I accept that.

The real tradeoff is:

slower recovery
in exchange for
a simpler, more explicit, more testable path

For a small team, that is often the right bargain. I would rather have a cold standby that I can explain and rehearse than an elaborate failover story I never fully trust.

Final thought

The most useful change in how I think about DR is this: disaster recovery starts becoming real only when you can name the exact artifact, the exact host, and the exact operator action that would bring the service back.

Until then, it is mostly optimism dressed up as architecture.

For me, the right first step was not platform-wide high availability. It was proving that one critical service could start elsewhere from a real snapshot and be put back behind its public hostname deliberately. That is not the final state. But it is a recovery plan I actually believe.

MisLinux

Pages

Wednesday, April 29, 2026

My Cold-Standby Disaster Recovery Plan for a Single Critical Service