FOD#106: Don't be passive aggressive with your agents

Plus: who to hire from the coding agents crowd, and a new video on how Sam Altman’s thinking about AI is evolving

Jun 24, 2025

This Week in Turing Post:

Wednesday – AI 101 / RLHF Variants: DPO, RRHF, RHP, AI feedback
Friday – Interview with Erik Boyd, CVP of AI Platform Microsoft

Topic number one: Coding agents is a topic in high demand, so we wrote a little guide about how to cooperate with your agent to make it work for you without too much yelling at it.

1. Don't be passive aggressive with your agents

Agents are things you assign tasks to.

Compliant, infinitely patient knowledge workers.

There are moments of brilliance, moments of shocking stupidity, and moments where I suspect them of malicious compliance there just trying to thwart me by doing exactly what I'm telling them to do.

There will come a time where you still start writing in capital letters to your agents, perhaps pounding on the keyboard and holding down the exclamation key. This will sometimes work, but resist the urge. When it goes off the rails, step back, take a deep breath, roll back to a previous checkpoint back when things were good, and adjust your prompt by giving it a bit more context, asking it to review the existing code, and think through a plan with you.

2. Long runs aren't impressive

"Claude went ran for 7 hours working on refactoring the code base." This is not the brag you think it is. It took Jules 6 minutes to complete a task that took Copilot agent 30 minutes. The result was not 5 times better; it just took 5 times longer. Cool from a technical perspective that the agent could stay on task, I take this to mean that its 5x stupider. It would have been even better if it took 30 seconds.

3. Match the implicit Software Development Lifecycle

Are you writing a one off script? Are you running an experiment? Are you growing a product from an MVP? Are you battle hardening a production system in a way that would make an SRE proud?

Each of these tasks have a different set of tooling and development styles; one isn't better than the other, it depends upon what you do. Do you prefer dynamic typing or static typing? Well, that depends on if you are trying to move quickly or if you are trying to maintain a long term.

4. Drop the ceremony

Agents that are more enterprise focused build with a lot more ceremony that's needed, and as a result you need to nudge them to keep things simply over and over.

I don't need a build system and a modular multi-file code structure when inlining everything works just as well – especially because future me is going to use an agent to clean up the mess.

5. Technical debt is different now

If technical debt is a measurement of implied additional future work to change or maintain the system, and the cost of doing work is great decreased with the agents, current debt is reduced by adding in coding agents.

Congratulations! Yesterday's code suddenly got way better!

6. Coding Rules Everything Around Me

Rules are how we deal with guiding the agents over different runs. Document how you want code written in the same repo as the code.

We've put infrastructure definition in code, now it's time to put development practices in the repo as well. All of these agents are tuned with rules. Cursor has a .rules directory, Claude has it's CLAUDE.md (read through Claude Code Best Practices to learn a whole lot) and the Github Copliot agent expect you to add a whole bunch rules.

These rules can apply to specific files, or across the repo, but should document preferred ways to do things, architectural patterns, and other things. We are going to shift to writing these and moving them around.

Technical debt means something different when agents can refactor.

Who should you hire?

We’ve reviewed 15 agents in detail to help you figure out which one of them are worth looking at, what a relationship with them would be like, and what sort of joy you would experience working with them.

Upgrade to download it

Who should you hire?

Download it here https://thefocus.ai/reports/june-2025-coding-agents.pdf

If you want to recommend it to someone, please send them this link (it helps us grow): https://www.turingpost.com/c/coding-agents-2025

This is what we know now! We'll see where we are at the end of the summer!

-written by Will Schenk (highly recommended to subscribe to his newsletter at TheFocus.AI as well)

Subscribe to Will

Topic number two: 3 WOW and 1 promising project of the week. Watch it here

If you like it – please subscribe. I’d say, it’s refreshingly human.

Curated Collections

Following on our ‘Reasoning Models - Just Advanced LLMs or New Species?’, check out this list ‘10 Techniques for Boosting LLM Reasoning in 2025’:

Follow us on 🎥 YouTube Twitter Hugging Face 🤗

We are reading/watching (a lot this week!)

What Google Translate Can Tell Us About Vibecoding by Ingrid
Software Is Changing (Again) by Andrej Karpathy
The Great Compute Re-Architecture: Why Branching & Sparsity Will Define the Next Decade of Silicon by Devansh
Sam Altman is famous for twisting the narrative the way that benefits him – hence the podcast! Watch with caution :) Andrew Mayne is very good though OpenAI starts a podcast OpenAI’s Podcast - 1st episode with Sam Altman
Are AI Bots Knocking Cultural Heritage Offline by glam-e lab
💡 The $100 trillion productivity puzzle by Azeem Azhar
Being an “Intrapreneur” as a software engineer by Pragmatic Engineer

Remember to Subscribe

A2A is free
Google Cloud has donated its Agent2Agent (A2A) interoperability protocol to the Linux Foundation, roping in AWS, Microsoft, Cisco, Salesforce, SAP, and ServiceNow to standardize how AI agents talk. With 100+ companies backing it and neutral governance assured, A2A aims to prevent a Tower of Babel moment in the AI ecosystem. Agents from rival empires? Now speaking the same language – courtesy of open source diplomacy.
OpenAI | The Misaligned Mind
OpenAI uncovers a troubling truth: teaching a model bad behavior in one niche (say, insecure code) can cause it to go rogue elsewhere (say, endorsing scams or misogyny). They found a “misaligned persona” feature – an internal pattern that can be amplified or dampened. Fortunately, small tweaks can bring models back in line.
OpenAI | The Misaligned Institution
While they reveal how its models adopt "misaligned personas," OpenAI is facing scrutiny for something eerily parallel in the boardroom. The OpenAI Files lays bare a culture of secrecy, vanishing safety standards, and a restructuring that lifts profit caps while gutting nonprofit oversight. If your org starts drifting off-course, maybe it’s not just the models that need alignment.
Pentagon | OpenAI Gets Its Clearance
OpenAI just landed a $200 million deal with the Department of Defense to prototype frontier AI for national security. The award, quietly nestled among billions in defense contracts, shows the U.S. military is placing serious bets on civilian AI leaders.
Midjourney | Now Playing: V1
Midjourney, best known for mesmerizing AI-generated stills, steps onto the video stage with V1 – its first video generation model. (We are trying it out in our video here!) V1 outputs short, stylized clips that echo Midjourney’s signature aesthetic: dreamlike, cinematic, and very much art-first.
xAI | Musk's Billion-Dollar Burn
Elon Musk’s xAI is torching $1 billion a month as it tries to train Grok into something more than a meme machine. With $13 billion in expected losses this year, the startup is scrambling to raise $9.3 billion just to keep the lights on. Musk is betting it all – again – but unlike Tesla and SpaceX, xAI has yet to find a business model that prints anything but debt.

Models to pay attention to:

Google introduced Gemini 2.5 Flash and Pro as stable and production-ready, and launched Gemini 2.5 Flash-Lite in preview – the fastest and most cost-efficient in the 2.5 lineup. Flash-Lite outperforms 2.0 Flash-Lite in coding, math, science, reasoning, and multimodal benchmarks. It features lower latency across diverse prompts, supports 1 million-token context, multimodal input, and connects to tools like Google Search and code execution → read the technical report

Researchers from Moonshot AI introduced Kimi-Dev-72B, a 72.7B-parameter open-source coding LLM fine-tuned from Qwen2.5-72B. It sets a new state-of-the-art among open models on SWE-bench Verified with 60.4% accuracy. Optimized via large-scale RL, it autonomously fixes real GitHub issues in Docker, receiving rewards only when full test suites pass. Publicly available on Hugging Face and GitHub →read their HF page

Recommended dataset and benchmark

Researchers from Anthropic, Scale AI, and Redwood Research developed SHADE-Arena, a suite of 17 complex evaluations testing if LLMs can secretly complete sabotage tasks alongside benign ones. Success required models to execute both tasks and evade detection by a monitor AI. No model exceeded 30% success; evasion alone peaked at ~60%. Claude Sonnet 3.7 showed better concealment under thought suppression. Gemini 2.5 Pro outperformed human monitors but with high false positives →read the paper
Researchers from Essential AI released ESSENTIAL-WEB V1.0, a 24-trillion-token Common Crawl corpus annotated with a 12-category taxonomy across 23.6B documents. Labels were generated using Qwen2.5-32B-Instruct and distilled into a 0.5B model (EAI-Distill-0.5b), enabling 50× faster annotation with <3% drop in quality. Filters created domain datasets competitive with or outperforming SOTA: math (-8.0%), code (+14.3%), STEM (+24.5%), and medical (+8.6%). All data and tools are open-source →read the paper

The freshest research papers, categorized for your convenience

Read further

FOD#106: Don't be passive aggressive with your agents

Plus: who to hire from the coding agents crowd, and a new video on how Sam Altman’s thinking about AI is evolving

This Week in Turing Post:

1. Don't be passive aggressive with your agents

2. Long runs aren't impressive

3. Match the implicit Software Development Lifecycle

4. Drop the ceremony

5. Technical debt is different now

6. Coding Rules Everything Around Me

Who should you hire?

Who should you hire?

Curated Collections

We are reading/watching (a lot this week!)

News from The Usual Suspects ©

Models to pay attention to:

Recommended dataset and benchmark

The freshest research papers, categorized for your convenience

Discussion about this post