Pages

Friday, May 22, 2026

What Registry Garbage Collection Actually Costs in Operator Time

By Vicente Arteaga Gomez

MisLinux · Last updated: May 5, 2026

This is part of my Kubernetes-on-Hetzner-and-operations series on MisLinux. It is an operator-focused post about self-hosted registry maintenance, not a product review.

Registry GC cover image

When people talk about registry garbage collection, they usually talk about storage.

I think the more painful cost is operator time.

Not because cleanup commands are long, but because safe cleanup turns into a proof exercise:

  • which tags are actually still referenced?
  • which ones only *look* referenced?
  • are the manifest children still readable?
  • which tags are part of failover or rollback paths?

The delete is fast. The confidence is expensive.

The real work around cleanup

This is the table I wish more registry discussions started with:

PhaseOperator work
Before cleanupclassify runtime vs proof tags, inspect manifests, check active deployments
During cleanupkeep the destructive path narrow and explicit
After cleanupprove that kept tags still resolve, pull, and match the intended runtime

That is why I now think of registry GC as a maintenance workflow, not a command.

Why tag existence is not enough

One of the nastiest cases in self-hosted registries is the "present but broken" tag:

  • the tag still resolves
  • the manifest still exists
  • one child descriptor or blob is gone
  • the next pull fails where nobody expected it

That means post-cleanup verification has to be deeper than "the tag name still appears."

The diagram I keep in mind

Registry operator cost diagram

The chart uses the cleanup workflow phase on the X axis and a relative operator-effort score on the Y axis. It is not a benchmark. It is a way to show where the real time goes: keep-set classification, manifest/blob proof, and post-cleanup validation.

The key lesson there is that the operator cost grows with uncertainty about reachability and rollback, not with the number of characters in the command line.

The command trail I actually trust

# 1. List what production and failover still reference
kubectl get deploy,ds,cronjob -A -o json | jq '.items[].spec?'

# 2. Inspect manifest structure, not only tag names
./inspect-registry-manifest.sh runtime-tag --check-blobs

# 3. Run the cleanup only after the keep-set is explicit
registry garbage-collect /etc/distribution/config.yml

# 4. Re-run manifest and pull proof on the kept tags
./inspect-registry-manifest.sh runtime-tag --check-blobs
docker pull registry.example.invalid/app:runtime-tag

If I skip step 4, I have not actually finished the maintenance.

Here is what those commands are actually doing:

  • kubectl get ... asks Kubernetes which images the live workloads reference right now, so the cleanup keep-set starts from reality instead of memory
  • inspect-registry-manifest.sh --check-blobs walks the manifest and child blobs so a tag that merely looks present does not get mistaken for a healthy runtime artifact
  • registry garbage-collect ... is the destructive phase itself
  • the final inspect-registry-manifest.sh plus docker pull are the proof that the kept runtime path still resolves from the registry all the way down to pullable content

One additional check matters a lot in practice: monitor the exact images and tags the Kubernetes workloads are using, and verify that those tags are still resolvable in the registry. That catches registry drift early, before the next restart turns a hidden integrity problem into a live outage.

Failure case: cleanup without path awareness

The most dangerous cleanup is not "too aggressive" in the abstract. It is cleanup that does not know:

  • which tags the live cluster uses
  • which tags the standby/failover path uses
  • which per-architecture descriptors need direct protection

That is how storage maintenance becomes runtime breakage.

Why operator time dominates

The expensive part is not CPU. It is:

  • reading the current runtime references
  • classifying proof tags versus active tags
  • checking manifest reachability
  • confirming rollback paths were not silently broken

That is real maintenance work even if the registry disk frees space quickly.

What I'd do differently now

I used to think of registry GC as a storage task with a verification step. What I'd do differently now is describe it the other way around: it is a verification-heavy operational workflow that happens to reclaim storage at the end.