By Vicente Arteaga Gomez
MisLinux
This is part of my Kubernetes-on-Hetzner-and-operations series on MisLinux. It describes how I actually *manage* AI coding agents day to day — not the tools themselves, but the operating model that keeps them useful in production work.
Most teams experiment with AI in one of two broken modes:
- Magic button — "fix it" with no context, hope for the best.
- Copilot as autocomplete — fast typing, no durable output.
The model that works for business and operations automation is a third one:
> Treat the agent like a new hire who is fast, literal, and forgetful — brilliant at execution when the brief is clear, dangerous when the brief is vague, and unable to remember yesterday unless you wrote it down.
The contract I use with every agent session
| New-hire parallel | What I ask the AI to do |
|---|---|
| Onboarding packet | Point it at AGENTS.md, relevant SKILL.md, and proof artifacts from the last run |
| Written task brief | State goal, constraints, what must not change, and how success is verified |
| Deliverable, not chat | Require a script/program/test/CIR entry — not "here is what you could run" |
| Code review | Read the diff myself; rerun dry-run/smoke tests before any production path |
| Manager notes | Append Context / Intent / Rationale when behavior or tooling changes |
If the session ends with only prose, I failed the brief.
Step 1: Route manual work through the agent
When I catch myself doing something twice — spreadsheet reconciliation, onboarding checks, config diff, report export — I stop and rephrase:
> "Do this for me, but implement it as a repeatable script with tests and a CIR entry. Dry-run first."
That single sentence forces three outputs:
- Automation — PHP/Python/ shell the agent can rerun without the conversation
- Guardrails — PHPUnit or smoke tests on the invariant logic
- Memory —
AGENTS.mdbullet explaining *why* the approach exists
The agent is the implementer. I remain the approver.
Step 2: Write briefs the way you would for a junior engineer
Vague briefs produce confident wrong code. I include:
- Scope boundary — read-only vs production mutation; which namespace/network/sheet
- Inputs — file paths, env vars, credentials *location* (never the secret itself)
- Expected output — JSON shape, exit codes, artifact directory layout
- Verification — exact command I will run to prove it worked
- Discarded options — "do not patch production CronJob inline; fix the generator"
Example weak brief:
> Clean up the onboarding mess.
Example strong brief:
> Add onboard.php --plan read-only mode for YAML manifests. Must label EXTERNAL steps, emit JSON for monitoring, PHPUnit for the diff engine, CIR in operations/adsystem/AGENTS.md, and save proof under history/<timestamp>/. No live DB writes in this slice.
The second brief is longer because ambiguity is expensive.
Step 3: Review output like a manager, not a spectator
I assume the first pass is wrong somewhere. My review checklist:
- [ ] Did it touch only the files the brief allowed?
- [ ] Does dry-run default to safe?
- [ ] Are tests asserting behavior, not implementation trivia?
- [ ] Does failure output say expected vs actual?
- [ ] Is there a CIR entry with Rationale (not just Context)?
- [ ] Would another operator know how to rerun this in six months?
When something is off, I nudge, not restart:
> "Keep the planner, but EXTERNAL must not count as WRONG. Add a test. Update CIR with why."
That preserves context already loaded in the session.
Step 4: Demand CIR annotations for every non-obvious choice
Models forget. Repositories should not.
Good CIR captures why, not what:
- Context: onboarding checks were duplicated in chat and cron.
Intent: one manifest drives plan, apply, and readiness monitoring.
Rationale: without a single contract, AI regenerated slightly different
checks each session; manifest + `--plan` makes drift visible before apply.
I explicitly ask:
> "Add a CIR entry to the nearest AGENTS.md explaining what you tried, what failed, and what would re-break if reverted."
Without that, the next agent (or me in three weeks) "cleans up" the guardrail that prevented an outage.
Step 5: Reduce manual touchpoints every iteration
Each manual step is randomness:
- Wrong network selected
- Stale token
- Spreadsheet column renamed overnight
- "I thought we already ran that"
My metric: count human decisions per run. If a process still needs ten, the next session removes two — with tests.
Automation does not mean zero humans. It means humans approve gates, not re-type data.
What this looks like in practice
Recent classes of work where the new-hire model helped:
| Chore | Agent built | I approved |
|---|---|---|
| Publisher onboarding diff | YAML manifest + --plan library | Read-only proof JSON before any apply |
| Control panel ad-request filter | Second-pass PHP + PHPUnit | One-off kubectl job logs |
| Blogger publish recovery | CDP publish helper fixes + crawl gate | Public HTML verification |
| Finance month fill | Mapping libraries + failure email contract | Dry-run Job in cluster |
In every case, the valuable artifact was not the chat. It was the script + test + CIR triad.
Anti-patterns I stop early
| Anti-pattern | Why it fails |
|---|---|
| "Just do it live" | No replay, no audit trail |
| Accepting code without tests | Regresses silently on next model pass |
| Letting AI edit production YAML by hand | Drifts from generator source of truth |
| Skipping CIR "because it is obvious" | Obvious fades in two weeks |
| Treating refusal as failure | Good agent should block unsafe mutations |
Sibling posts on this blog
- How I Use CIR Notes to Make AI Coding Agents More Useful
- Business Automation Through AI: Scripts, Tests, and AGENTS.md
- How I Fixed Cursor Getting Slower Every Message
Together they describe one strategy: delegate execution to the agent, keep judgment and memory in the repo.
---
Independence note: AI tools mentioned (Copilot, Claude, Codex, Cursor) reflect my stack. No vendor sponsorship implied.