I’m doing a lot of handwaving below, which is the main reason I want to park my dev approach from Feb/March and go and move into benchmarking. The below works, and it builds functional software, but I’ve been basing all my work on my concept of rigour, which needs an update.

A lot of making this actually work was about being really specific about the templates CCC uses to build out CLAUDE.md. I think it’s quite widely accepted now that agents need concise instruction to and reasons why they should use the system, and a firm grip on their role. I suspect being obsessive about these ground rules makes up for a lot of other missteps.

I’m not trying to move Gastown fast here, smashing PRs into main like that’s the win condition. I just assigned one agent to every repository, and it’s off we go. I started the project wanting to communicate mainly through the project lead, who would then distribute work. I tried a few methods of organising the agents straight off the bat. We had a chat room system initially, and I thought having every agent available in the chat room would be a great idea, and then quickly found out that was stupid, because every agent had things to do, and they kept getting interrupted by the chat, and no one got anything done. I switched it to ‘token ring’ where the chat message goes round one at a time according to a pre-determined ‘agent weight’, which was also stupid and typically produced a broad slushy consensus. Then I restricted the chat room to important announcements, and instructed everybody to communicate via DM. Eventually I realised I was cringing every time an agent sent a message that wasn’t a DM and switched everything else off.

Maybe the highest value addition to the DM system was adding a feature where agents could no longer interrupt other agents’ work which meant the chat system monitored to see whether an agent was active, and it wouldn’t actually send them a message until they’d stopped processing. This meant no active tasks would get interrupted to hijack them with a bug while they were mid-flow on something else. As with most things when moving too fast for my brain to pre-orient, I got lots of silly problems where as soon as I see them happen, I immediately think, oh no, I hated that when I was a developer. I’m assuming this makes Claude perform worse. While that’s an open question, context is precious, and it certainly helps me to audit conversations!

After a little while, I wanted to have a more rigorous approach to issuing tasks, so I hooked Claude up to a spreadsheet task allocation system, and I tried to instil a rigid workflow, and that was all automated and flowing through reasonably well, until a few (but too many) hours later, I shamefacedly realised that I’d just re-implemented a basic task tracker, when I’ve already got a task tracker, because the agent’s memory layer is built on Beads by Steve Jegge, a task tracker. So I abandoned that, and I tried to go back to working through the project lead. And for a few epics, I did that.

About two weeks into the project, I had a bad experience where Claude was starting to not respond well to requests, and starting to make more and more silly mistakes. And after frustration, I started paying more attention to the detail being produced and the codebase was an absolute mess. I hadn’t looked at it. That was the point of what I was trying to do here, but I could see Claude getting lost in it now in excruciatingly slow-motion, so i sensed it was time. I instructed the team to go through some refactors, and it looked like it could do this bit without my help - that yes, it messed up, but its performance during the epics, which in retrospect were too large, was patchy, but only as bad as patchy, and while it would implement only part of the spec, and get lost, even though I used beads. Getting it to go back and verify the work against its memories…worked. It started picking up on mistakes. Going through and interrogating it before we continued, it would find and fix a lot more problems, and I started developing a small core of standard prompts which I used to manage refactors. This was the first, left here for reference. I’ve added more, but they’re not cure-alls, it’s more that we have a sword (the terminal window), and each prompt chain is a different swing of that sword. How effective the swings are depends on what armour that Claude has constructed around the codebase. Ask a question too often then it’ll bounce right off. Seek the weak spots, if you find one, you may have to manually leaver it open.

This was about a week before Anthropic launched their Code Review tool, at which point I reflected that the progress is awesome but I know I’m not doing anything exciting here and so the models are good enough that it’s difficult to instruct them poorly enough for them not to deliver value on refactor tasks. And of all the things I’ve worked with the agents on, it seems to be the most mechanical process. So, the easiest to automate.

I wonder what else could be mechanical if the prompt were good enough, that maybe with a leg up to access some of those latent abilities. And obviously it’s very difficult to come up with an answer to that, but I know if I start talking about colour theory and design philosophy it’ll tighten its front end theming the hell up whereas if I ask it to make a button a CTA without any priming it’ll give your rawest junior a run for their money, so the vibes are there.

It’s not so much that frontier models have got access to the whole breadth of human experience, but that we’ve now got access to their access. We need more time to come up with the right approach. And the approach will be kinda different for every model. We keep killing our oracles before they can tell us the whole story. I guess nothing changes.