By Vicente Arteaga Gomez
MisLinux · Last updated: May 5, 2026
This is part of my Kubernetes-on-Hetzner-and-operations series on MisLinux. It is an operator-focused post about self-hosted registry maintenance, not a product review.
When people talk about registry garbage collection, they usually talk about storage.
I think the more painful cost is operator time.
Not because cleanup commands are long, but because safe cleanup turns into a proof exercise:
- which tags are actually still referenced?
- which ones only *look* referenced?
- are the manifest children still readable?
- which tags are part of failover or rollback paths?
The delete is fast. The confidence is expensive.
The real work around cleanup
This is the table I wish more registry discussions started with:
| Phase | Operator work |
|---|---|
| Before cleanup | classify runtime vs proof tags, inspect manifests, check active deployments |
| During cleanup | keep the destructive path narrow and explicit |
| After cleanup | prove that kept tags still resolve, pull, and match the intended runtime |
That is why I now think of registry GC as a maintenance workflow, not a command.
Why tag existence is not enough
One of the nastiest cases in self-hosted registries is the "present but broken" tag:
- the tag still resolves
- the manifest still exists
- one child descriptor or blob is gone
- the next pull fails where nobody expected it
That means post-cleanup verification has to be deeper than "the tag name still appears."
The diagram I keep in mind
The chart uses the cleanup workflow phase on the X axis and a relative operator-effort score on the Y axis. It is not a benchmark. It is a way to show where the real time goes: keep-set classification, manifest/blob proof, and post-cleanup validation.
The key lesson there is that the operator cost grows with uncertainty about reachability and rollback, not with the number of characters in the command line.
The command trail I actually trust
# 1. List what production and failover still reference
kubectl get deploy,ds,cronjob -A -o json | jq '.items[].spec?'
# 2. Inspect manifest structure, not only tag names
./inspect-registry-manifest.sh runtime-tag --check-blobs
# 3. Run the cleanup only after the keep-set is explicit
registry garbage-collect /etc/distribution/config.yml
# 4. Re-run manifest and pull proof on the kept tags
./inspect-registry-manifest.sh runtime-tag --check-blobs
docker pull registry.example.invalid/app:runtime-tag
If I skip step 4, I have not actually finished the maintenance.
Here is what those commands are actually doing:
kubectl get ...asks Kubernetes which images the live workloads reference right now, so the cleanup keep-set starts from reality instead of memoryinspect-registry-manifest.sh --check-blobswalks the manifest and child blobs so a tag that merely looks present does not get mistaken for a healthy runtime artifactregistry garbage-collect ...is the destructive phase itself- the final
inspect-registry-manifest.shplusdocker pullare the proof that the kept runtime path still resolves from the registry all the way down to pullable content
One additional check matters a lot in practice: monitor the exact images and tags the Kubernetes workloads are using, and verify that those tags are still resolvable in the registry. That catches registry drift early, before the next restart turns a hidden integrity problem into a live outage.
Failure case: cleanup without path awareness
The most dangerous cleanup is not "too aggressive" in the abstract. It is cleanup that does not know:
- which tags the live cluster uses
- which tags the standby/failover path uses
- which per-architecture descriptors need direct protection
That is how storage maintenance becomes runtime breakage.
Why operator time dominates
The expensive part is not CPU. It is:
- reading the current runtime references
- classifying proof tags versus active tags
- checking manifest reachability
- confirming rollback paths were not silently broken
That is real maintenance work even if the registry disk frees space quickly.
What I'd do differently now
I used to think of registry GC as a storage task with a verification step. What I'd do differently now is describe it the other way around: it is a verification-heavy operational workflow that happens to reclaim storage at the end.