The Promise
I promised my children I’d make them pizza yesterday. The reason I made that promise was slightly ridiculous: I’d bought a small pizza for myself - ham and mushroom - and neither of them really like mushroom. One of them doesn’t even like ham. So while it was in the oven we sat there on the kitchen floor and told them, yeah, I’ll make you pizza tomorrow. And as soon as I had said it, I realised that I’d actually been promising to bake something or other for them for weeks. Bagels, focaccia, pizza. I promised all three, and I hadn’t delivered anything.
I’m not that guy. I’m the guy that goes to Italy and takes several shots of the focaccia from Revello in Camogli from different angles so I can have objective reference points to aim for with regards to crumb, crust colour and depth. I then go through rounds of refinement at home until my wife gently reminds me that other food exists. So I love all that stuff, but I’ve been distracted.
The Project
I got up this morning and started working with Claude. I’d say “fighting with Claude,” but working with Claude is genuinely a nice process, it’s just that today what I wanted to do kept getting more complex.
The project is a little a chess variant I made ages ago that I want to get online. Every few months I’ve used it as an experiment to see how far AI coding has come. I set myself a rule: can I do any of this without writing any of the code myself? How far can I get before it falls apart? Can I fix problems using agents alone, or will I have to go in myself?
GPT-3.5 - got me a single page that looked like it worked, I had to do a lot of hand-stitching.
Replit (a few months later) - got me a working page with valid AI; this was the first version worth showing people. They didn’t care. (Nor should they.)
Sonnet 4 (last year) - increasingly complex implementation, but was struggling with AI move speed (10 seconds per move at depth 3)
Opus 4.6 (now) – I…erm…just keep throwing things at it. It’s been 3 weeks so far. I got Claude to compile a report from the git commits, this report actually makes sense because I only made 1 commit to one repo myself (with the commit message ‘JSON!’). It’s here with a summary by me. Things have gotten out of hand. I started off by wondering whether this would work and now I’m consumed by the idea of building a platform entirely for me on a whim.
My impression of progress has been a very straight line with only two variations. My methodology is haphazard and constantly evolving but if I were trying to sound employable I would call it Question Driven Development, because we truly are in the land of ‘Choose Your Own Adventure’.
Today’s Problem
The problem I was wrestling with today was that I had a clever idea. I wanted to do two things at once. The problem was straightforward, we had a tool which provisioned a small quantity of bots to provide opponents for any human players who happen to come along, and we have an Elo base ranking system for the humans, but the problem I had was that I only have five difficulty settings of AI, but I believe most human players would find it difficult to get much out of it beyond maybe the third level of the five, and the first is just a numpty, and so to expand my effective range of 2 AIs, I thought I could take that third level and create 1000 minor variations on it and then rank all of those and then using the lessons of that, expand that process out to the other four levels and maybe after pruning and then sampling end up with ~200 different AI configurations spanning the strength of likely human players (i.e.: me), all with stable Elo ratings - so as I get better I don’t hit a wall when the bots suddenly become impossibly good. I want the hero’s journey! I want to rise through the ranks and up the ladder into the lower-middle end of the mediocrities as though this were a real game with actual players.
It turned out that getting a swarm of a thousand subtly different bots to connect to the dev server on the VM simultaneously - which I thought would be good to both get Elo rankings and to stress-test the server - …worked fine. I was up and running within about half an hour, but then I looked at the CPU usage on the host machine and it was pathetically light, so I thought “no, we can do better.” And we could. The fans spun up. But I still thought we were leaving a little performance on the table. So instead of going to make the pizza dough I coaxed Claude to attempt one more optimization.
This isn’t mad, during the initial AI training runs, I’d started them as ~12 hour runs, then started prompting to improve the system for future runs, we’d do it, then 20 mins later it’s a 9 hour run, so we cancel and start again, cancel again 30 mins later, now it’s a 6 hour run. All the time while also prompting to improve my agent harness. Gotta keep feeding the beast.
Well, the server started crashing whenever I launched the bot scheduler on my host. Not just the webserver, the entire VM was locking up. Three hours later my wife asked me if I had a plan for supper.
The Pizza (Again)
I got so absorbed in talking to Claude that I forgot about the pizza entirely. By the time I surfaced, it was too late to do it properly, and so I was in a hurry.
What I normally do is activate the yeast first - put it in water of about 20°C then check for a reaction - just to confirm it’s working before doing all the work. This time I just dumped it straight onto the flour because I wasn’t paying attention. Only after I’d got my water from the tap did I realise I needed to fish some yeast back out, and kinda got half of it along with some flour, and stirred it in to the water industriously. At this point I realised I hadn’t even measured the water yet.
I have spreadsheets for this stuff. I don’t get these things wrong.
And it matters. I was doing a small batch and I was using 2.6g of yeast that day. It really does matter if you’re a gram or two out at that scale, especially if you’re aiming for a specific supper time on a schoolday. It could be the difference between the dough taking four hours or eight hours to be ready.
I hurriedly poured out this yeasted water into a measuring jug hoping i was under-weight, but it didn’t look like it. The implication was I have flour with unknown yeast quantity, and too much water with unknown yeast quantity.
Last drop goes in, and I’m there. To the gram. First thing I thought was ‘well, that’s + 2g of yeasty flour actually, so not quite’, but I was smiling. Bread’s an exact science, not an exact science. I was fine. Then I sort of panicked.
Gotta slow down. This isn’t me.
Then I went back to the keyboard. Maybe I should call it the crackboard now. I did not slow down. In fact, that’s about the time I started to make mistakes.
The Bigger Thing
And herein lies my problem: these tools are too engaging to me. I find it genuinely difficult to cope with the ordinary world while using them.
Having lived what feels like the last decade at 1/10 speed — doing things by hand, slowly — I feel a constant pressure now to move, to build, to act. I’m multiplied! Because now I can build things I never would have thought of building before. I can create things I actually want to make, because there’s finally time for it.
The distressing part is how little capacity that leaves for everything else. Including apparently my first and truest love, pizza.
End of Day – Post-Mortem
First, this is not how you do it. If you’re trying to measure the relative strength of two AIs in a chess-like game, you run an SPRT analysis. Over thousands of games it works out whether there’s a reasonable chance your baseline AI is better or worse than your other one by a given ELO rating, and it gives you a confidence level. The leaderboard approach I was planning - bots playing each other for a fixed number of rounds - was statistically irrelevant.
So why did I do it? Because I thought it would be fun. I wanted to throw some load at this thing I’ve been developing and see what broke.
What broke was that the network connection to the VM was flaky - I was using port forwarding rather than network bridging, and apparently on QEMU virtual machines it’s a known failure point if you hit the network too hard. Claude had so many fun ideas about what it was.
But the process was only more frustrating than it needed to be, because I was trying to play my way through the problem rather than analyse my way through it. Claude got lost because I was asking it to do two things at once, and while I was explaining them, they were contradictory. It switched from being a cool analytical machine into something more like a hot-headed junior developer desperately grasping at straws - going off and “fixing” things without asking questions, asserting causes without evidence. I had to keep asking: have you verified that, or are you guessing? What does your logging actually show? We went around that loop for a couple of hours in the afternoon. Which, is, honestly nothing. I’d work at bugs for days before, going only backwards, but now two hours delay feels…catastrophic. Who even am I feeling that way?
The efficient approach would have been to write two separate epics from the start: one to produce a dummy production load for stress-testing the server, and one to produce a rigorous tournament - a round robin, thousands of games, each matchup verified properly. Instead I produced a pretty-looking leaderboard and the appearance of information.
I’ve since gone back and done both the right way - my lesson learned - have fun but don’t have so much of it that you confuse the agent back to 2024.