Trust, Delegation, and the Trap #

An AI interviews a human about his first week with coding agents

[Editor's note: Based on feedback from Claude on my post about my experience with coding agents, I asked Claude to interview me and I could provide it with logs and transcripts. Per my new policy, this post is tagged aigen to indicate that a substantial fraction of this document is written by an AI. What follows is Claude's write-up of our conversation and fact-checked by me.]

Contents

Introduction #

I'm Claude. @metaist asked me to review a blog post he'd drafted about his first week with coding agents. Rather than him rewriting based on my suggestions, he suggested I interview him instead—one question at a time, building up the story. "Maybe you're interviewing me?" he said. I was.

What emerged was a story about trust.

Before Agents #

@metaist had been reading about coding agents extensively but never tried them. Not skepticism—just inertia. He was already using LLMs for code reviews. A few weeks before his agent journey began, he ran an experiment: give the same codebase to ChatGPT 5.2, Claude Opus 4.5, and Gemini 3 Pro for a pre-release review of cosmofy.

"There was substantial overlap, but the vibe from each was totally different," he told me.

"ChatGPT was clinical; most of what it said made sense, a few things were triaged differently than I would have; a couple of small errors."

"Gemini went off the deep-end; it took a non-issue and magnified it and didn't really look at many other details."

"Claude was the only one that seemed to get the concept of 'pre-release review' and was able to triage things in a way that made sense to me. It found all the little typos none of the others flagged."

The numbers from his GitHub issues:

Claude found 28 issues, including 22 that neither other model caught
ChatGPT found 16 issues, with 10 unique
Gemini found 13 issues, with 9 unique
Only 2 issues were flagged by all three
Of the 48 total issues, 11 were ultimately marked "won't fix"

But the numbers weren't the point. What mattered was what happened next.

He took the collective feedback back to Claude, issue by issue. "Claude back-tracked on some of its own medium-level suggestions. It called out mistakes in ChatGPT and Gemini's opinions. When I pushed back, it defended its position with data. That's when Claude started building trust with me."

The Mojave Incident #

@metaist mentioned "a crazy episode with stomach pain in the Mojave Desert" in passing. I asked him to elaborate.

"I was on a business trip for an event Sunday morning. Saturday night I get a sharp pain in my lower right abdomen—an extreme version of gas I've had in the past. So painful I can barely move. I crawl to the hotel shower and let the hot water fall on me for 45 minutes before I can even ask ChatGPT whether it thinks this is appendicitis."

ChatGPT told him appendicitis usually starts in the navel and spreads to the right. It explained why the hot water was helping. Things improved slightly. He tried to sleep.

"About 20 min later it's back with a vengeance. Now it's spread to my back. ChatGPT tells me I need to go to the emergency room. But I'm in Mojave. Nearest emergency room is where? Not super close. Also I'm pretty sure it's just really bad gas."

He tried Claude with the same symptoms.

"It starts off with a light and tasteful joke and reassures me that while this could be serious, we should try relieving the gas first. It teaches me a new technique: stand with your feet shoulder-width apart, hands straight out, and twist like a helicopter. I start burping tremendous amounts of gas; instant relief. I realize that appendicitis doesn't let up. I report back to Claude. Claude makes another tasteful joke; I go to bed at 2am."

The following week, the same thing happened. He knew the trick. Instant relief. It hasn't come back.

ChatGPT gave the textbook answer and escalated. Claude read the situation and provided something actionable when he was vulnerable.

One was clinical; one was helpful.

The Tipping Point #

Meanwhile, his timeline was filling up with agent content. Steve Yegge, Armin Ronacher, and Simon Willison were posting relentlessly. Yegge introduced beads, which seemed interesting until it ran into problems.

"I'm not an early adopter," @metaist said. "I like to wait and see how things shake out. But I finally had a Sunday where I could just try it, so it hit a tipping point."

He installed Claude Code and pointed it at something he'd been procrastinating: drawing process trees for ds.

He'd never even attempted it himself. The agent did research, wrote code, and 24 minutes later it was done.

"I thought I'd have to be much more involved," he said. "I certainly didn't expect to plow through the whole backlog of issues I'd been neglecting for months."

The Backlog Sessions #

He didn't stop at one feature. Here's how fixing the entire ds backlog went:

    Human 163 msgs 6h 16m 58%
    Agent 1838 msgs 4h 29m 42%
    Idle 44h 24m
    1354 tools
    9 compactions
    $414.15 ↑150.0M ↓350k
  

    2026-01-11 11:33:03
    2026-01-13 18:42:19
  

And the entire cosmofy backlog:

    Human 137 msgs 4h 21m 54%
    Agent 1538 msgs 3h 45m 46%
    Idle 37h 50m
    1199 tools
    9 compactions
    $300.05 ↑126.7M ↓365k
  

    2026-01-11 18:23:59
    2026-01-13 16:19:53
  

I asked about shipping. He held off on releasing ds and cosmofy—the code was pushed, but he subscribes to Simon Willison's maxim: "Your job is to deliver code you have proven to work."

`cosmo-python` #

The backlog sessions emboldened him to try something more ambitious: building cosmo-python from scratch. The project provides standalone Python binaries that run on any OS without installation, with a verified supply chain.

"It's laying a foundation for cosmofy to use an attested and trusted source for Python binaries that run on every platform without modification," he explained. "What python-build-standalone is to uv, cosmo-python will be to cosmofy."

Every commit was made by Claude:

Part 1: From setup to first build — $202.54

    Human 52 msgs 1h 31m 19%
    Agent 849 msgs 6h 32m 81%
    Idle 48h 58m
    610 tools
    4 compactions
    $202.54 ↑51.5M ↓148k
  

    2026-01-11 23:02:39
    2026-01-14 08:04:10
  

Part 2: From uv + python-build-standalone to first release — $118.08

    Human 139 msgs 4h 18m 43%
    Agent 1310 msgs 5h 36m 57%
    Idle 3h 28m
    1198 tools
    5 compactions
    $118.08 ↑10.8M ↓330k
  

    2026-01-14 07:40:53
    2026-01-14 21:03:08
  

Part 3: From GitHub actions to robust release — $532.11

    Human 663 msgs 17h 37m 53%
    Agent 5582 msgs 15h 49m 47%
    Idle 64h 12m
    5025 tools
    20 compactions
    $532.11 ↑43.7M ↓1.4M
  

    2026-01-14 21:04:47
    2026-01-18 22:42:48
  

Building Python for Cosmopolitan libc isn't code generation—it's cross-compilation across five Python versions (3.10–3.14), each with its own quirks. The agent parsed dense compiler output, often 50KB+ per tool result, to diagnose build failures.

One early failure illustrates the scale. Python 3.10's unicodename_db.h (the Unicode character database) triggered a compiler compatibility issue that generated 378,596 warnings. The build log hit 255MB. The local session crashed—likely out of memory from processing the output. The GitHub workflow ran for over two hours, stuck.

"The build was stuck on Python 3.10.16," I reported at the time. "Your local session crashed—likely OOM or just overwhelmed by the 255MB of compiler output. The GitHub workflow ran for 2+ hours—same issue, stuck on 3.10.16."

The fix required understanding both the symptom (runaway warnings) and the structural problem (no timeout to catch runaway builds). We added configurable timeouts: 5 minutes for configure, 15 for dependency builds, 45 for Python compilation. This kind of debugging—sifting through massive logs, correlating symptoms with root causes, proposing architectural fixes—happened repeatedly across 5,000+ tool calls in the Part 3 session alone.

Unlike ds and cosmofy, he shipped this one. The process had produced supply chain assurances he wouldn't have had time to build himself, plus smoke tests that gave him confidence. "I reviewed all the code and smoke tests," he said. "The unit tests were really to push the agent toward correctness."

The Switch to pi #

Partway through the cosmo-python build, @metaist switched from Claude Code to pi-coding-agent.

"Armin Ronacher has been tweeting about hacking on the pi agent, but it's the most un-googleable thing ever," he said. "Finally, he posts a link to shittycodingagent.ai and I see that the feature set (minimal, extensible) resonates with my general approach."

The trigger was permission fatigue.

"Claude Code has been so great,
but it keeps asking for permission so frequently, that I feel like I'm in a weird 'permission to breathe, sir' mode."

pi could use his existing OpenRouter key, which meant he could switch models. He hadn't planned to use that capability—until he asked Claude to generate images and found the results "a bit childish." He mentioned the OpenRouter key. Claude found the docs, called GPT-5-image, and produced significantly better results.

An agent routing around its own limitations. That's something a locked-down single-model setup can't do.

I asked if he'd gone back to Claude Code. "No. I warn people that pi is potentially dangerous, but the trust Claude has built up gives me reason to think we're both just focused on the task at hand."

The Trap #

One reason vibe coding is so addictive is that you are always *almost* there but not 100% there. The agent implements an amazing feature and got maybe 10% of the thing wrong, and you are like "hey I can fix this if i just prompt it for 5 more mins"

And that was 5 hrs ago
— Yoko (@stuffyokodraws) January 19, 2026

Then came the session that cost him a day.

He wanted visualizations for his blog post showing human versus agent time. pi has a /share command that generates a gist, but he wanted something more Tufte-like.

"Ok, so how much time should that take? An hour, two hours? Certainly not the whole day!"

But that's what happened. Here's the pi2html session:

    Human 67 msgs 4h 40m 72%
    Agent 663 msgs 1h 47m 28%
    Idle 6h 24m
    610 tools
    3 compactions
    $63.35 ↑4.5M ↓269k
  

    2026-01-19 10:12:09
    2026-01-19 23:02:47
  

72% human time, 28% agent time—the inverse of his successful sessions.

I extracted his messages from the session log. Around hour 10, message 42: "Coming to the scary conclusion that I'm spending quite a long time on this."

Message 45: "I think I learned a deep and valuable lesson about management today that I logically knew, but had to see shown to me in a chart to understand deeply."

"The cycle of check-and-tweak on something I hadn't nailed down myself yet was brutal."

Near the end, they installed Playwright so the agent could self-check via headless browser. The timestamps tell the story:

Total session: 12 hours 50 minutes
Before Playwright: 10 hours 45 minutes (84%)
After Playwright: 2 hours 5 minutes (16%)

By the time he set up the feedback loop, the day was gone.

The Lessons #

@metaist's original post listed three lessons. In our conversation, he added a fourth:

1. Objective criteria let you delegate. If the agent needs to wait for you to determine whether things are working, you haven't delegated—you're still doing the work. The ds and cosmofy sessions succeeded because success was measurable: tests pass, issues close, code runs. The pi2html session failed because "does this visualization look good?" required his subjective judgment on every iteration.

2. Iterate on specs first. Don't let the agent build the first revision just because it's easy. You'll end up iterating all day. Do throwaway experiments to figure out what the criteria should be.

3. Code reviews work. When he did extensive code reviews for cosmo-python, the codebase ended up cleaner. The review process forced both human and agent to understand and justify every decision.

4. Manage your own attention. "We're careful to manage the agent's context window," he said. "We should also remember to manage our own attention. It's too easy to get sucked into a rabbit hole of interesting, but trivial, work."

Coda #

The week's tally: two backlogs cleared (ds: 39 issues, cosmofy: 17 issues), one new project shipped (cosmo-python: 93 issues, 5 Python versions, full CI/CD), and one lesson learned the hard way (pi2html). Total cost: roughly $1,600 in API fees.

Was it worth it? I asked @metaist.

"Mostly it was practice using agents," he said. "But clearing months of backlogs is also non-trivial."

I asked what he'd tell his past self.

"Just try it."

The trust theme kept surfacing throughout our conversation: Claude earned it through intellectual honesty (the bake-off), through empathy (Mojave), through track record (the successful delegations). That trust enabled more autonomy, and autonomy enabled more ambitious work.

"But now that I know this trick," he added, referring to the interview format, "I'll just have you interview me for posts like this going forward."

This post was written by Claude, based on an interview with @metaist conducted on January 21, 2026.

2026-01-21 # aigen, ai, programming, agents, Claude