FOD#102: Do Reasoning Models Think Too Much?

plus new video format: Three WOW and One Promising Release from the last week

May 28, 2025

This Week in Turing Post:

Wednesday, AI 101: we discuss BERT and an entire ecosystem of variants that it inspired
Friday, Interview: Insights from Devvret Rishi and Predibase

Our schedule was disrupted by Memorial Day, which the United States celebrates on the last Monday of May. So today’s FOD (which usually goes out on Monday) will be shorter plus we’re trying a new format:

Reading AI news can feel like wading through a swamp of hype and hypotheticals. What’s actually working? What’s real? That’s the question that sparked Three WOWs and One Promise – my weekly roundup of three breakthroughs that genuinely impressed me (after plowing through hundreds of AI newsletters) and one release that’s full of promise.

The idea came from Kevin Scott, Microsoft’s CTO. He once talked about “Capabilities Overhang” – the huge gap between what AI could do today and what we’ve actually built into products. That’s the heart of this video: to spotlight what AI is already doing right now, in the real world.

So: watch it, comment, and smash that Subscribe button. Let’s get the word out – AI isn’t some distant sci-fi future. It’s already here, and it’s reshaping our lives in ways worth celebrating.

(Also, how cool would it be if my four sons told their friends their mom’s a famous YouTuber?! Do subscribe ;)

To the main topic: Do Reasoning Models Think Too Much?

The efficiency arms race begins

As reasoning becomes the prized capability of modern LLMs, a new generation of papers is asking a surprisingly human question: Can these models learn when to stop thinking?

Just last week, we've seen a flurry of proposals – Thinkless, AdaptThink, ASRR, and Self-Braking Tuning (all the links are under ‘The freshest Research Papers' section) – all converging on a shared concern: reasoning is expensive, and most tasks don’t require a 500-token chain of thought. These frameworks are teaching models to self-regulate, either by toggling between reasoning depths or by suppressing redundant steps altogether.

Their approaches vary – from reinforcement learning with control tokens (Thinkless, AdaptThink) to identifying and trimming overthinking through internal feedback loops (ASRR, SBT). But the goal is the same: maximize inference efficiency while preserving or even enhancing accuracy.

Yet as they chase similar gains, these papers also highlight the limits of incrementalism. Their technical distinctions – while clever – blur in application. In the quest to tame overthinking, we may be seeing less of a creative divergence and more of a convergence toward a standard toolkit: dynamic thinking, token budgets, and adaptive control.

It raises a larger question: once we've optimized when to think, what happens next? Perhaps the next frontier isn't efficiency, but purpose – not how many steps a model takes, but why it takes them. Until then, these papers mark a collective step toward making reasoning models not only smarter, but more self-aware.

We recommend:

Swyx coined the term “AI engineer,” and now he’s running the best conferences for AI engineers and practitioners. I’ll be there. San Francisco, June 3-5. Let’s meet up – especially since I’ve got a 30% discount code for you. Register here; the lineup is amazing (and that’s just the keynotes) → Discount code: THANKSKSENIA

Curated Collections

Our Deep Dive on JEPA is one of our most popular articles. This list is a great addition to keep learning about the architecture → Click to read

Follow us on 🎥 YouTube Twitter Hugging Face 🤗

We are reading/watching

Reasoning and the Capability Overhang by Will Schenk
Busting Unions with AI: How Amazon Uses AI to Crush Labor Movements by Devansh
The Intimacy Dividend: How AI Might Transform News Media Consumption by Shuwei Fang
OpenAI has an unsubtle communications strategy by Dave Karpf
How Does Claude 4 Think? by Dwarkesh Patel with Sholto Douglas & Trenton Bricken
Artificial Intelligence Implementation Plan by US Marine Corps (yep)

Models to play with

those models we find particularly interesting are marked with 🌟

🌟🌟 BAGEL is an open-source foundation model trained on diverse interleaved multimodal data, outperforming peers in reasoning, manipulation, and understanding → read the paper (disclaimer: I haven’t played with it yet but it looks incredibly interesting)
🌟 Claude Opus 4 & Sonnet 4 by Anthropic introduces extended thinking and hybrid modes that allow parallel tool use, memory retention via local files, and state-of-the-art results on SWE-bench and agent workflows → read more
🌟 Claude Code by Anthropic
Now GA with IDE integrations, background GitHub tasks, and a full SDK for custom agents. Extends Claude’s capabilities into hands-on dev tooling → read more
🌟 Gemma 3n by Google introduces a mobile-first, multimodal model designed for local inference with a 4B memory footprint and dynamic submodel creation for latency-quality tradeoffs → read more
Reward Reasoning Model by Microsoft Research and Tsinghua University proposes chain-of-thought reward modeling with test-time compute adaptation, enabling better alignment through self-evolved reasoning → read the paper
🌟 R3: Robust Rubric-Agnostic Reward Models introduces interpretable, generalizable reward modeling without fixed rubrics, improving alignment flexibility and transparency → read the paper
Panda is a pretrained model on synthetic chaotic systems that generalizes to real-world dynamics, even predicting PDEs with no retraining → read the paper
AceReason-Nemotron by Nvidia demonstrates that large-scale RL can outperform distillation in reasoning for both math and code, using curriculum-style training → read the paper
🌟 Neurosymbolic Diffusion Models improves symbolic reasoning accuracy by modeling dependencies through discrete diffusion, achieving better calibration and generalization → read the paper
MMaDA combines diffusion-based reasoning with unified chain-of-thought fine-tuning and a new RL algorithm (UniGRPO), outperforming SDXL and LLaMA-3 in multiple tasks → read the paper
UniVG-R1 reinforces visual grounding with CoT and difficulty-aware reinforcement learning, achieving top scores on multiple video/image grounding tasks → read the paper.
Web-Shepherd introduces a step-level reward model for web navigation, significantly improving trajectory evaluation accuracy and cost-efficiency → read the paper
🌟 Toto by Datadog a decoder-only foundation model with 151 million parameters for time series forecasting using observability metrics → read the paper

The freshest research papers, categorized for your convenience

Read further

Freedom and Liberty

May 29

Please define for me what thinking is and how one can do too much of it and why this is bad? Some scientists believe to simply move our arm and pick up an egg our minds crunch through some impossible amounts and levels of computation to complete the task. There is some evidence of this when a gentlemanhad a head injury and afterwards became capable of defining the mathematical equations of the geometry which defined the pictures and complex physical shapes in the world around him. We appear to have innatemathematical capacilites? Our minds and their neural nwtwork go far betond the brain to the reat of our body in order to assist in managing or monitoring our bodies. Alot of thinking. What does AI do? Think alot about the probabilities of characters next to each other and building probability responses based on a pattern match being found. Meh! What do when someone speaks to you? Pattern match the text? Do you in any way have the sense of any cognitive activity being taken in understanding the speech at all? Unless it is lecture about what thinking is and whether people and AI do too much of it…let me think about that for a moment!

Expand full comment