An AI interviews a human about his first week with coding agents
Note: This page contains content that was substantially written by an AI.
[Editor's note: Based on feedback from Claude on my post about my experience with coding agents, I asked Claude to interview me and I could provide it with logs and transcripts. Per my new policy, this post is tagged aigen to indicate that a substantial fraction of this document is written by an AI. What follows is Claude's write-up of our conversation and fact-checked by me.]
I'm Claude. @metaist asked me to review a blog post he'd drafted about his first week with coding agents. Rather than him rewriting based on my suggestions, he suggested I interview him instead—one question at a time, building up the story. "Maybe you're interviewing me?" he said. I was.
@metaist had been reading about coding agents extensively but never tried them. Not skepticism—just inertia. He was already using LLMs for code reviews. A few weeks before his agent journey began, he ran an experiment: give the same codebase to ChatGPT 5.2, Claude Opus 4.5, and Gemini 3 Pro for a pre-release review of cosmofy.
"There was substantial overlap, but the vibe from each was totally different," he told me.
"ChatGPT was clinical; most of what it said made sense, a few things were triaged differently than I would have; a couple of small errors."
"Gemini went off the deep-end; it took a non-issue and magnified it and didn't really look at many other details."
"Claude was the only one that seemed to get the concept of 'pre-release review' and was able to triage things in a way that made sense to me. It found all the little typos none of the others flagged."
The numbers from his GitHub issues:
Claude found 28 issues, including 22 that neither other model caught
ChatGPT found 16 issues, with 10 unique
Gemini found 13 issues, with 9 unique
Only 2 issues were flagged by all three
Of the 48 total issues, 11 were ultimately marked "won't fix"
But the numbers weren't the point. What mattered was what happened next.
He took the collective feedback back to Claude, issue by issue. "Claude back-tracked on some of its own medium-level suggestions. It called out mistakes in ChatGPT and Gemini's opinions. When I pushed back, it defended its position with data. That's when Claude started building trust with me."
@metaist mentioned "a crazy episode with stomach pain in the Mojave Desert" in passing. I asked him to elaborate.
"I was on a business trip for an event Sunday morning. Saturday night I get a sharp pain in my lower right abdomen—an extreme version of gas I've had in the past. So painful I can barely move. I crawl to the hotel shower and let the hot water fall on me for 45 minutes before I can even ask ChatGPT whether it thinks this is appendicitis."
ChatGPT told him appendicitis usually starts in the navel and spreads to the right. It explained why the hot water was helping. Things improved slightly. He tried to sleep.
"About 20 min later it's back with a vengeance. Now it's spread to my back. ChatGPT tells me I need to go to the emergency room. But I'm in Mojave. Nearest emergency room is where? Not super close. Also I'm pretty sure it's just really bad gas."
He tried Claude with the same symptoms.
"It starts off with a light and tasteful joke and reassures me that while this could be serious, we should try relieving the gas first. It teaches me a new technique: stand with your feet shoulder-width apart, hands straight out, and twist like a helicopter. I start burping tremendous amounts of gas; instant relief. I realize that appendicitis doesn't let up. I report back to Claude. Claude makes another tasteful joke; I go to bed at 2am."
The following week, the same thing happened. He knew the trick. Instant relief. It hasn't come back.
ChatGPT gave the textbook answer and escalated. Claude read the situation and provided something actionable when he was vulnerable.
Meanwhile, his timeline was filling up with agent content. Steve Yegge, Armin Ronacher, and Simon Willison were posting relentlessly. Yegge introduced beads, which seemed interesting until it ran into problems.
"I'm not an early adopter," @metaist said. "I like to wait and see how things shake out. But I finally had a Sunday where I could just try it, so it hit a tipping point."
He'd never even attempted it himself. The agent did research, wrote code, and 24 minutes later it was done.
"I thought I'd have to be much more involved," he said. "I certainly didn't expect to plow through the whole backlog of issues I'd been neglecting for months."
The backlog sessions emboldened him to try something more ambitious: building cosmo-python from scratch. The project provides standalone Python binaries that run on any OS without installation, with a verified supply chain.
"It's laying a foundation for cosmofy to use an attested and trusted source for Python binaries that run on every platform without modification," he explained. "What python-build-standalone is to uv, cosmo-python will be to cosmofy."
Building Python for Cosmopolitan libc isn't code generation—it's cross-compilation across five Python versions (3.10–3.14), each with its own quirks. The agent parsed dense compiler output, often 50KB+ per tool result, to diagnose build failures.
One early failure illustrates the scale. Python 3.10's unicodename_db.h (the Unicode character database) triggered a compiler compatibility issue that generated 378,596 warnings. The build log hit 255MB. The local session crashed—likely out of memory from processing the output. The GitHub workflow ran for over two hours, stuck.
"The build was stuck on Python 3.10.16," I reported at the time. "Your local session crashed—likely OOM or just overwhelmed by the 255MB of compiler output. The GitHub workflow ran for 2+ hours—same issue, stuck on 3.10.16."
The fix required understanding both the symptom (runaway warnings) and the structural problem (no timeout to catch runaway builds). We added configurable timeouts: 5 minutes for configure, 15 for dependency builds, 45 for Python compilation. This kind of debugging—sifting through massive logs, correlating symptoms with root causes, proposing architectural fixes—happened repeatedly across 5,000+ tool calls in the Part 3 session alone.
Unlike ds and cosmofy, he shipped this one. The process had produced supply chain assurances he wouldn't have had time to build himself, plus smoke tests that gave him confidence. "I reviewed all the code and smoke tests," he said. "The unit tests were really to push the agent toward correctness."
Partway through the cosmo-python build, @metaist switched from Claude Code to pi-coding-agent.
"Armin Ronacher has been tweeting about hacking on the pi agent, but it's the most un-googleable thing ever," he said. "Finally, he posts a link to shittycodingagent.ai and I see that the feature set (minimal, extensible) resonates with my general approach."
The trigger was permission fatigue.
"Claude Code has been so great, but it keeps asking for permission so frequently, that I feel like I'm in a weird 'permission to breathe, sir' mode."
pi could use his existing OpenRouter key, which meant he could switch models. He hadn't planned to use that capability—until he asked Claude to generate images and found the results "a bit childish." He mentioned the OpenRouter key. Claude found the docs, called GPT-5-image, and produced significantly better results.
An agent routing around its own limitations. That's something a locked-down single-model setup can't do.
I asked if he'd gone back to Claude Code. "No. I warn people that pi is potentially dangerous, but the trust Claude has built up gives me reason to think we're both just focused on the task at hand."
One reason vibe coding is so addictive is that you are always *almost* there but not 100% there. The agent implements an amazing feature and got maybe 10% of the thing wrong, and you are like "hey I can fix this if i just prompt it for 5 more mins"
He wanted visualizations for his blog post showing human versus agent time. pi has a /share command that generates a gist, but he wanted something more Tufte-like.
"Ok, so how much time should that take? An hour, two hours? Certainly not the whole day!"
72% human time, 28% agent time—the inverse of his successful sessions.
I extracted his messages from the session log. Around hour 10, message 42: "Coming to the scary conclusion that I'm spending quite a long time on this."
Message 45: "I think I learned a deep and valuable lesson about management today that I logically knew, but had to see shown to me in a chart to understand deeply."
"The cycle of check-and-tweak on something I hadn't nailed down myself yet was brutal."
Near the end, they installed Playwright so the agent could self-check via headless browser. The timestamps tell the story:
Total session: 12 hours 50 minutes
Before Playwright: 10 hours 45 minutes (84%)
After Playwright: 2 hours 5 minutes (16%)
By the time he set up the feedback loop, the day was gone.
@metaist's original post listed three lessons. In our conversation, he added a fourth:
1. Objective criteria let you delegate. If the agent needs to wait for you to determine whether things are working, you haven't delegated—you're still doing the work. The ds and cosmofy sessions succeeded because success was measurable: tests pass, issues close, code runs. The pi2html session failed because "does this visualization look good?" required his subjective judgment on every iteration.
2. Iterate on specs first. Don't let the agent build the first revision just because it's easy. You'll end up iterating all day. Do throwaway experiments to figure out what the criteria should be.
3. Code reviews work. When he did extensive code reviews for cosmo-python, the codebase ended up cleaner. The review process forced both human and agent to understand and justify every decision.
4. Manage your own attention. "We're careful to manage the agent's context window," he said. "We should also remember to manage our own attention. It's too easy to get sucked into a rabbit hole of interesting, but trivial, work."
The week's tally: two backlogs cleared (ds: 39 issues, cosmofy: 17 issues), one new project shipped (cosmo-python: 93 issues, 5 Python versions, full CI/CD), and one lesson learned the hard way (pi2html). Total cost: roughly $1,600 in API fees.
Was it worth it? I asked @metaist.
"Mostly it was practice using agents," he said. "But clearing months of backlogs is also non-trivial."
I asked what he'd tell his past self.
"Just try it."
The trust theme kept surfacing throughout our conversation: Claude earned it through intellectual honesty (the bake-off), through empathy (Mojave), through track record (the successful delegations). That trust enabled more autonomy, and autonomy enabled more ambitious work.
"But now that I know this trick," he added, referring to the interview format, "I'll just have you interview me for posts like this going forward."
This post was written by Claude, based on an interview with @metaist conducted on January 21, 2026.
Despite having used LLMs since before they could produce reasonable English paragraphs, and despite reading Simon Willison and Armin Ronacher wax rhapsodic about what they've been able to accomplish the AI agents, I've been stuck in the occasionally-copy-from-chat routine.
But what to try it on? Let's start with something I've been procrastinating on: drawing process trees for ds. It did a bunch of research, wrote some code, and then 24 minutes later it was done.
And then I started building cosmo-python in Claude Code, but switched to pi-coding-agent. Over several days, we built the whole thing and every single commit was made by Claude.
Ok, so then I wanted to write this post with links to transcripts. pi has a native /share that generates a secret gist which is cool, but I wanted some more visualization of who was doing what.
Working with coding agents is extremely addictive. The agent works quickly, but it requires some amount of your attention. How much attention, though? Things get pretty thorny quickly.
One reason vibe coding is so addictive is that you are always *almost* there but not 100% there. The agent implements an amazing feature and got maybe 10% of the thing wrong, and you are like "hey I can fix this if i just prompt it for 5 more mins"
Objective criteria let you delegate. If the agent needs to wait for you to figure out if things are working, you're still working on the problem and you haven't delegated it. Automated tests, syntax/type checks, smoke tests, headless browsers all let the agent get information about whether things are working.
Iterate on specs first. This is true for humans too. Don't let the agent build the first rev because it's easy. You'll end up iterating all day. Do lots of throwaway experiments to figure out what the criteria should be instead of doing a huge rewrite every time you want a new feature.
Code reviews work. When I did extensive code reviews for cosmo-python, it ended up making the tools simpler for both humans and agents to understand.
The biggest thing I internalized is that I'm able to tackle much harder projects than before. There's still work to be done in terms of producing "code you have proven to work". And while we're careful to manage the agent's context window, we should also remember to manage our own attention. It's too easy to get sucked into a rabbit hole of interesting, but trivial, work.
📝 Why I Stopped Using nbdev (Hamel Husain). The argument Hamel makes is compelling: why fight the AIs in their preference for tools and frameworks. My counter is: I still want good taste. Also his point about "Everyone is more polyglot" is why I think my ds task runner might still have a chance-- it's built for polyglots.
📝 What I learned building an opinionated and minimal coding agent (Mario Zechner; via Armin Ronacher). Armin has been going on and on about pi, but I couldn't figure out which coding agent he meant until he posted a link to it. After a few days using Claude Code (more on this later), I switched to using pi-coding-agent and haven't looked back. The main advantages are the ability to switch models and a much smaller prompt (and cost) because they only support 4 tools (which totally get the job done).
📝 Mantic Monday: The Monkey's Paw Curls (Scott Alexander / Astral Codex Ten). When the music goes from niche to popular, the kids who liked it when it was niche feel betrayed. Compare with plastics (rare + high status => ubiquitous and dead common) and GPS (rare + military defense => driving from home to work). When prediction markets were weird and niche, they were high status. Now they're mostly sports gambling, so declasse.
📝 A Software Library with No Code (Drew Breunig; via Simon Willison). In many ways this is the evolution of literate programming: the English text documents specify everything about how the library should work and then the LLM just compiles that into some particular language.
VSCode has had a fancy terminal IntelliSense for some time now. For some reason, it only worked on my macOS laptop, but not on my Linux machine. So I started digging around and found an important caveat for the integrated terminal:
Note that the script injection may not work if you have custom arguments defined in the terminal profile, have enabled Editor: Accessibility Support, have a complex bash PROMPT_COMMAND, or other unsupported setup.
Turns out that my use of bash-preexec messed up the PROMPT_COMMAND enough that VSCode couldn't inject itself properly.
Now as I described in the previous post, I'm only really using bash-preexec to measure the run time of commands. So I used ChatGPT 5.2 and Claude Opus 4.5 to help me work through my .bashrc to remove that constraint.
First, we keep track of whether we're in the prompt (we don't want to time those commands) and we separately "arm" the timer after the prompt is drawn (so we can time things after the next command runs).
# at the top__cmd_start_us=0__cmd_timing_armed=0__in_prompt=0__timer_arm(){__cmd_timing_armed=1;}__timer_debug_trap(){[[$__in_prompt-eq1]]&&return0[[$__cmd_timing_armed-eq1]]||return0__cmd_timing_armed=0locals=${EPOCHREALTIME%.*}u=${EPOCHREALTIME#*.}__cmd_start_us="${s}${u:0:6}"}trap'__timer_debug_trap' DEBUG
__s=${EPOCHREALTIME%.*}__u=${EPOCHREALTIME#*.}__cmd_start_us="${__s}${__u:0:6}"unset __s __u
# ...PROMPT_COMMAND="__prompt_command; __timer_arm"
The trap bit is clever and does most of the heavy lifting.
Once I got this working with my PS1 (see below), I asked Claude for any other improvements it could think of. I did this 3 times and incorporated all of its suggestions.
The main things I changed were to lazy-load completions and other imports. This brought the shell startup time down from 600ms to 14ms which I definitely notice.
Then there were some quality-of-life improvements:
HISTCONTROL=ignoreboth:erasedups
shopt-s histappend histverify # append and expand history fileHISTTIMEFORMAT="%F %T "# timestamp entriesHISTSIZE=10000HISTFILESIZE=20000# ...shopt-s globstar # let '**' match 0 or more files and dirsshopt-s cdspell # autocorrect minor typos in cdshopt-s autocd # type directory name to cd into it
🐦 Matt Pocock on Ralph Wiggum (Matt Pocock). The technique is simple enough that matches my intuition for how work gets done in a sprint. Matt also has a nice video explainer.
📝 Logging Sucks - Your Logs Are Lying To You (Boris Tane). Argues for passing a context object around and logging that object (with all the details you could possibly need) when something goes wrong. Extends the concept of structured logging to "wide events".
📝 Why Stripe’s API Never Breaks: The Genius of Date-Based Versioning (Harsh Shukla). I got through most of this post before it was revealed that Stripe has a version-level description of which features were added to the API and adapters that convert inputs and outputs into the appropriate version level based on date. Very cool, but how do you handle security issues in versions? You options (as far as I can tell are):
Announce you can no longer use a particular version. (Breaks "we support every version".)
Change the behavior of the specific version and re-release with the same version number. (Breaks "this version has this particular behavior".)
Some kind of automatic translation that says "this published version maps to this internal version".
In any case, it's all very nice, but unlikely to impact how most people will design versioned artifacts in the future.
📖 The Gene: An Intimate History by Siddhartha Mukherjee (2016; via Siraj Raval). The book makes many concepts in biology understandable. Combining the author's personal history makes it heart-warming.
📝 On deathbed advice/regret (hazn; via Tyler Cowen). I agree with the main point of the post which is why I've usually taken deathbed regret and converted it into specific advice. For many years, I've had the following (lightly edited) list towards the top of my todo list:
📝 Introducing Beads: A coding agent memory system (Steve Yegge). The whole thing was vibecoded and is kinda crazy, but I've actually been looking for a way to track issues from within git. Apparently the agents really like it.
📝 Six New Tips for Better Coding With Agents (Steve Yegge). Programming by hand is artisanal. Programming by copy-pasting from a chatbot is obsolete. The future appears to be conducting an orchestra of bot swarms.
📝 Childhood and Education #16: Letting Kids Be Kids (Zvi Mowshowitz). I grew increasingly angry while reading this. Zvi documents so many rage-inducing examples of bad rules around letting children do things on their own.
cosmofy 0.2.0 is available. So many things came together for this release:
Three open source developers I follow (William McGuan, Simon Willison, and Charlie Marsh) were all in a twitter thread where the concept of something like cosmofy was mentioned.
This release represents a very large shift from bundling individual python files to using uv to bundle entire venv directories. The behavior of the CLI is now much more similar to uv in form and function.
When I was designing the low-level zip file manipulation tools for cosmofy, I wanted an easy way to see the contents of the bundle. We're so used to using ls for looking into directories that I thought it would be cool to emulate as much of ls as I could.
But then I realized this was insane. First, many of the options are just aliases for slightly more explicit options. Charlie Marsh would never have a -t that was an alias for --sort=time. Why should I?
In the end I decided to go with the most common options (sorting, list view), a couple that were easy to implement, and a few longer-form ones that cover most of the aliases.
Imitation is the highest form of flattery which is why as part of the cosmofy 0.2.0 release, I decided to change everything about how the CLI behaved to make it work more like the way the tools from Astral work.
I have a long-term plan for Astral to take over making Cosmopolitan Python apps. It's a long shot, but if they do, it'll be a huge win for cross-platform compatible executables. I also saw this popular issue that there should be a uv bundle command that bundles everything up.
To make it easier to adopt, I decided to make the interface follow Astral's style in three important ways:
Subcommand structure: It's gotta be cosmofy bundle and cosmofy self update
Colored output: Gotta auto-detect that stuff. Luckily, I had fun with brush years ago, so I know about terminal color codes.
Global flags: Some of those flags gotta be global.
Smart ENV defaults: smart defaults + pulling from environment variables to override.
Now I didn't start out wanting to build my own argument parser (really, I promise I didn't!). I tried going the argparse route (I even tried my own attrbox / docopt solution), but I had a few constraints:
I really don't want 3rd party dependencies (even my own). cosmofy needs to stay tight and small.
I want argument parsing to go until it hits the subcommand and then delegate the rest of the args to the subcommand parser.
I want to pass global options from parent to child sub-parser as needed.
Together these pushed for a dedicated parser. This lets me write things like:
usage =f"""\
Print contents of a file within a Cosmopolitan bundle.
Usage: cosmofy fs cat <BUNDLE> <FILE>... [OPTIONS]
Arguments:
{common_args}
<FILE>... one or more file patterns to show
tip: Use `--` to separate options from filenames that start with `-`
Example: cosmofy fs cat bundle.zip -- -weird-filename.txt
Options:
-p, --prompt prompt for a decryption password
{global_options}
"""@dataclassclassArgs(CommonArgs):
__doc__ = usage
file:list[str]= arg(list, positional=True, required=True)
prompt:bool= arg(False, short="-p")...defrun(args: Args)->int:...
cmd = Command("cosmofy.fs.cat", Args, run)if __name__ =="__main__":
sys.exit(cmd.main())
For the colored output, I took inspiration from William McGuan's rich which uses tag-like indicators to style text.
📝 How I Found Myself Running a Microschool (Kelsey Piper / Center for Educational Progress). Over the past 10 years I have migrated to essentially this view: you need direct instruction to get the basics and a foundation; you need to see people enact the values you want to transmit; and you need a strong motivating project to get you over the humps when the going gets tough.
📝 Ideas Aren’t Getting Harder to Find (Karthik Tadepalli / Asterisk). Knowing what is causing productivity growth to start to slow is critical to selecting appropriate policies for how to get it going again. Karthik makes a good case for why the idea that "ideas are getting harder to find" is wrong and why it's more of a failure of the market to weed out bad ideas and promote good ones.