Pages

Tuesday, June 16, 2026

How I Manage AI Agents Like a New Hire (Detailed Briefs, Reviewed Output, CIR Notes)

By Vicente Arteaga Gomez

MisLinux

This is part of my Kubernetes-on-Hetzner-and-operations series on MisLinux. It describes how I actually *manage* AI coding agents day to day — not the tools themselves, but the operating model that keeps them useful in production work.

Most teams experiment with AI in one of two broken modes:

  1. Magic button — "fix it" with no context, hope for the best.
  2. Copilot as autocomplete — fast typing, no durable output.

The model that works for business and operations automation is a third one:

> Treat the agent like a new hire who is fast, literal, and forgetful — brilliant at execution when the brief is clear, dangerous when the brief is vague, and unable to remember yesterday unless you wrote it down.

The contract I use with every agent session

New-hire parallelWhat I ask the AI to do
Onboarding packetPoint it at AGENTS.md, relevant SKILL.md, and proof artifacts from the last run
Written task briefState goal, constraints, what must not change, and how success is verified
Deliverable, not chatRequire a script/program/test/CIR entry — not "here is what you could run"
Code reviewRead the diff myself; rerun dry-run/smoke tests before any production path
Manager notesAppend Context / Intent / Rationale when behavior or tooling changes

If the session ends with only prose, I failed the brief.

Step 1: Route manual work through the agent

When I catch myself doing something twice — spreadsheet reconciliation, onboarding checks, config diff, report export — I stop and rephrase:

> "Do this for me, but implement it as a repeatable script with tests and a CIR entry. Dry-run first."

That single sentence forces three outputs:

  1. Automation — PHP/Python/ shell the agent can rerun without the conversation
  2. Guardrails — PHPUnit or smoke tests on the invariant logic
  3. MemoryAGENTS.md bullet explaining *why* the approach exists

The agent is the implementer. I remain the approver.

Step 2: Write briefs the way you would for a junior engineer

Vague briefs produce confident wrong code. I include:

  • Scope boundary — read-only vs production mutation; which namespace/network/sheet
  • Inputs — file paths, env vars, credentials *location* (never the secret itself)
  • Expected output — JSON shape, exit codes, artifact directory layout
  • Verification — exact command I will run to prove it worked
  • Discarded options — "do not patch production CronJob inline; fix the generator"

Example weak brief:

> Clean up the onboarding mess.

Example strong brief:

> Add onboard.php --plan read-only mode for YAML manifests. Must label EXTERNAL steps, emit JSON for monitoring, PHPUnit for the diff engine, CIR in operations/adsystem/AGENTS.md, and save proof under history/<timestamp>/. No live DB writes in this slice.

The second brief is longer because ambiguity is expensive.

Step 3: Review output like a manager, not a spectator

I assume the first pass is wrong somewhere. My review checklist:

  • [ ] Did it touch only the files the brief allowed?
  • [ ] Does dry-run default to safe?
  • [ ] Are tests asserting behavior, not implementation trivia?
  • [ ] Does failure output say expected vs actual?
  • [ ] Is there a CIR entry with Rationale (not just Context)?
  • [ ] Would another operator know how to rerun this in six months?

When something is off, I nudge, not restart:

> "Keep the planner, but EXTERNAL must not count as WRONG. Add a test. Update CIR with why."

That preserves context already loaded in the session.

Step 4: Demand CIR annotations for every non-obvious choice

Models forget. Repositories should not.

Good CIR captures why, not what:

- Context: onboarding checks were duplicated in chat and cron.
  Intent: one manifest drives plan, apply, and readiness monitoring.
  Rationale: without a single contract, AI regenerated slightly different
  checks each session; manifest + `--plan` makes drift visible before apply.

I explicitly ask:

> "Add a CIR entry to the nearest AGENTS.md explaining what you tried, what failed, and what would re-break if reverted."

Without that, the next agent (or me in three weeks) "cleans up" the guardrail that prevented an outage.

Step 5: Reduce manual touchpoints every iteration

Each manual step is randomness:

  • Wrong network selected
  • Stale token
  • Spreadsheet column renamed overnight
  • "I thought we already ran that"

My metric: count human decisions per run. If a process still needs ten, the next session removes two — with tests.

Automation does not mean zero humans. It means humans approve gates, not re-type data.

What this looks like in practice

Recent classes of work where the new-hire model helped:

ChoreAgent builtI approved
Publisher onboarding diffYAML manifest + --plan libraryRead-only proof JSON before any apply
Control panel ad-request filterSecond-pass PHP + PHPUnitOne-off kubectl job logs
Blogger publish recoveryCDP publish helper fixes + crawl gatePublic HTML verification
Finance month fillMapping libraries + failure email contractDry-run Job in cluster

In every case, the valuable artifact was not the chat. It was the script + test + CIR triad.

Anti-patterns I stop early

Anti-patternWhy it fails
"Just do it live"No replay, no audit trail
Accepting code without testsRegresses silently on next model pass
Letting AI edit production YAML by handDrifts from generator source of truth
Skipping CIR "because it is obvious"Obvious fades in two weeks
Treating refusal as failureGood agent should block unsafe mutations

Sibling posts on this blog

Together they describe one strategy: delegate execution to the agent, keep judgment and memory in the repo.

---

Independence note: AI tools mentioned (Copilot, Claude, Codex, Cursor) reflect my stack. No vendor sponsorship implied.