Pages

Friday, April 24, 2026

The Problem With hostPort DaemonSets in Production, and How I Test Them Safely

By Vicente Arteaga Gomez

MisLinux · Last updated: April 20, 2026

This article is part 11 of my MisLinux series on Kubernetes on Hetzner. It reflects my own operational experience. I am not affiliated with Kubernetes, CNCF, or any vendor mentioned here.

hostPort DaemonSet rollout cover image

There are Kubernetes patterns that are perfectly reasonable until the day they have to be updated under pressure. A public-facing DaemonSet with hostPort is one of them.

I use this kind of pattern for a latency-sensitive service that I want reachable directly on each selected node. That is a valid design choice. The mistake would be pretending it rolls like an ordinary Deployment behind a Service. It does not.

Why hostPort changes the rollout story

When a pod binds a hostPort, it is competing for a real port on the node itself. That means a replacement pod cannot simply appear next to the old one if the old one is still holding the port.

That single detail collapses a lot of the usual Kubernetes comfort:

  • no meaningful same-node blue/green on the same port
  • no maxSurge rescue if only one node serves that traffic path
  • no assumption that a new pod can come up fully before the old one disappears

If you only have one node serving that workload class, a bad rollout is not a degraded rollout. It is an outage.

Why this is easy to underestimate

Kubernetes is so good at making ordinary workloads boring that it is easy to project the same confidence onto special cases.

With a normal Deployment behind a Service, the rollout safety story is familiar:

  1. new pods come up
  2. readiness gates traffic
  3. old pods drain away

With a hostPort DaemonSet on one eligible node, the real story is closer to this:

  1. old pod is removed or becomes unavailable
  2. new pod has to win the host port
  3. new pod still has to complete its own initialization
  4. if anything in that chain is wrong, public traffic drops immediately

That is not a reason to avoid the pattern entirely. It is a reason to stop lying to yourself about the rollout risk.

The specific failure mode I care about most

The worst version of this pattern is not "the pod crashes instantly." That is obvious. The more dangerous case is:

  • the YAML looks correct
  • the image tag looks correct
  • the process starts
  • but the service is not actually ready to answer the public request path yet

That can happen because of a slow startup sequence, a missing generated config, a probe that fires too early, or a runtime process that behaves differently after reload than after cold boot.

In other words, the main risk is not syntax. It is false confidence.

The test pattern I trust now

I do not apply new behavior directly to the public DaemonSet anymore if I can avoid it. I first prove the runtime path in a parallel test DaemonSet that stays away from the live ports.

The exact details depend on the workload, but the pattern is:

StepPurpose
Keep the public DaemonSet unchangedAvoid accidental real traffic impact while validating the new image or command path
Run a parallel test DaemonSet or equivalent pod on alternate portsProve startup, health checks, generated config, and runtime behavior without binding the production port
Check the real readiness surface, not only container livenessMake sure the service answers the same path the live system depends on
Promote only after the parallel proof is cleanTreat the public rollout as the final step, not the experiment

This is less elegant than a normal Deployment rollout, but it is honest about the shape of the risk.

What I specifically verify before promotion

My minimum checklist for a hostPort DaemonSet change now includes:

  • image pulls correctly on the target architecture
  • startup script finishes the full initialization path
  • generated config files actually exist and are valid
  • readiness only turns green after the real serving path works
  • process supervision survives reloads, not only cold start
  • the test instance answers a real request, not just a dummy socket check

If I cannot prove those on the parallel test path, the production rollout is not ready.

Why readiness probes matter even more here

I used to think of readiness primarily as a traffic-quality control. In this pattern, it is also a self-defense mechanism against human optimism.

If a service takes 90 to 180 seconds to become truly ready because it generates config from a database and only then launches the serving process, a 15-second readiness probe is not "aggressive." It is wrong.

That kind of mismatch is exactly how a Kubernetes manifest can look disciplined while still causing a live outage on rollout.

The broader point is that readiness has to model the real application timeline, not the timeline you wish the application had.

When I still think this pattern is worth using

Even with the rollout risk, I still think hostPort DaemonSets can be the right answer when:

  • each selected node is meant to serve traffic directly
  • you want to avoid an extra hop for the workload
  • node-local identity actually matters
  • the service is operationally narrow and worth the explicit handling

But the price of that pattern is operational ceremony. If you choose it, you are choosing:

  • stricter promotion discipline
  • stronger rollback expectations
  • more careful probe tuning
  • explicit non-production proof before live mutation

That is the actual cost, and it should be counted up front.

What I would not do again

I would not treat "the manifest change looks small" as a reason to skip the test path.

Small changes can break this kind of workload in large ways:

  • a slightly different startup command
  • a different image tag
  • a config generator timing change
  • a probe that turns green too early

The blast radius comes from the serving topology, not from how many lines changed.

Final thought

hostPort DaemonSets are not bad. They are just less forgiving than the default Kubernetes workload patterns people get used to.

If you use one for a public service, the safe mental model is not "Kubernetes will roll this safely for me." The safe mental model is "Kubernetes will do exactly what I asked, so I need to make sure I asked for a promotion flow that deserves production traffic."

That is why I now treat the test DaemonSet as part of the design, not a nice-to-have around the design.