FOD#97: What is the Era of Experience? Reinforcement Learning in Full Bloom
we discuss what to expect, provide opinions about OpenAI's o3, and share our curated collection of the most relevant models and papers of the week
This Week in Turing Post:
Wednesday, AI 101, Technique: MoE 2.0
Friday, Agentic Workflow: I took a pause last Friday because I wanted to do a really comprehensive deep dive into A2A and create a guide like the one I did for MCP. I now have almost all the pieces, and you will receive it this week.
Our news digest is always free. Upgrade to receive our deep dives in full, directly into your inbox. If you want to support us without getting a subscription – do it here.
Richard S. Sutton (godfather of Reinforcement Learning; Turing Award, 2024) and David Silver (led research on RL with AlphaGo, AlphaZero, and co-lead on AlphaStar) have spent decades teaching machines to learn by doing. Even with Deep Learning overshadowing RL, they kept crunching. Now RL rides a perfect storm of compute, simulation, deep-net representation, accessible frameworks, and marquee wins in both research and production. Once the toolkit matured and language-model teams started shipping RL-tuned products to millions of users, popularity followed naturally. Silver and Sutton knew it. So when the duo call the next phase of AI “the Era of Experience,” it’s worth pausing whatever prompt you were polishing and listening up. As they put it, “AI is at the cusp of a new period in which experience will become the dominant medium of improvement and ultimately dwarf the scale of human data used in today’s systems.”
Not human data, but AI experience becomes the new oil. What does it mean for us?
From “Era of Human Data” to “Era of Experience”
LLMs grew up gorging on the internet’s media feast, fine‑tuned by well‑meaning crowd‑workers. That pipeline squeezed out dazzling breadth, but – even at trillion‑token scale – hit a ceiling in the places we most crave superhuman help: novel proofs, scientific breakthroughs, new materials.
Memorizing humans data only gets you as far as we’ve already gone.
Silver and Sutton argue the ceiling is structural. Move the optimization loop from Stack Exchange to the real (or richly simulated) world, where rewards are grounded in consequences, not click‑worker opinions. Back in FOD#9 (my god, it was July 2023, the beginning of Turing Post), I wrote: “Reinforcement learning is back – and we find ourselves with zero understanding of what to expect.” The piece traced AlphaDev’s surprise discovery of a leaner sorting routine and Demis Hassabis’s promise that Gemini would weave AlphaGo‑style planning into language models. Those “unpredictable moves” were early tremors of what Silver & Sutton now call the Era of Experience: agents that learn by acting, observing, and iterating, instead of reciting history. To define this new Era, the authors sketch four pillars: 1) streams of lifelong experience, 2) sensor‑motor actions, 3) grounded rewards, 4) and non‑human modes of reasoning.
With those well-defined pillars in place, we can start to understand what to expect.
So What Actually Changes?
The shift isn’t philosophical, it’s more of a pipeline change.
Memory Infrastructure – Agents won’t train once and freeze. They'll accumulate episodic memory across weeks, months, years. But an agent that never forgets also needs to curate its past. Expect a rise in memory distillers that compress, tag, and rehearse high-signal moments.
Reward as UX – When feedback moves from thumb emojis to signals like blood pressure, tensile strength, or error bars, prompt engineering morphs into reward design. It’s product work now, blending interface intuition with real-world stakes.
Simulators as Pretraining – Before agents mess with climate policy or bioengineering, they’ll train in synthetic worlds. Something like SimCity-for-science – a differentiable chemistry lab that updates itself as agents experiment and fail forward. The simulator gets better as the agent gets smarter. That feedback loop becomes a moat. In fields like materials, climate, or biology – where real-world iteration is slow – this could outpace even the biggest model trained on yesterday’s data.
Non-Obvious Trends
Experience Liquidity
Today’s data brokers sell, well, data. Tomorrow’s will sell trajectories. Human Data Era: value was in data piles. Experience Era: value moves to whoever controls the reward channels and the environments that generate them.World-Model Audits
If agents imagine futures to plan their next step, we’ll need visibility into those imagined futures. Expect new tooling – explainable AI dashboards for dreams – that explain why the agent picked that move or simulation (nodding at Casual AI again)Delegated Curiosity, Delegated Risk
Agents incentivized to explore will stumble onto the unexpected – some of it powerful, some of it dangerous. Move 37 was unexpected but safe by all means. What happens when your chemistry agent discovers a material you don’t yet understand?
Why This Actually Matters
Silver’s and Sutton’s logic is clear: give an agent a rich interface to reality, a flexible reward system, and time – and it will outgrow both human data and human reasoning styles.
That becomes especially important for builders, they might need to shift from prompt libraries to experience pipelines. From static corpora to evolving environments. From chat sessions to continuous learning loops. The next AlphaZero won’t just beat you at Go – it’ll design the board, rewrite the rules, and simulate ten thousand seasons before breakfast. How should we build our Co-agency then?
I’ll leave you with this quote from Silver and Sutton: “We believe that today’s technology, with appropriately chosen algorithms, already provides a sufficiently powerful foundation to achieve these breakthroughs. Furthermore, the pursuit of this agenda by the AI community will spur new innovations in these directions that rapidly progress AI towards truly superhuman agents.”
Welcome to Monday. When experiencing becomes the only way forward.
Curated Collections
We are reading/watching
On Jagged AGI: o3, Gemini 2.5, and everything after by Ethan Mollick
Seven lessons from building with AI by The Exponential View
Not exactly for AI professionals but worth showing to your relatives who’s asking you about AI – an interview with Demis Hassabis at CBS’s 60 Minutes
News from The Usual Suspects © - it’s all about OpenAI today
OpenAI | Models Models Models
OpenAI dropped GPT‑4.1 in full, mini, and nano flavors – cheaper, faster, and catching up with Google’s million‑token context window. Available via API but curiously absent from ChatGPT, the move slightly backpedals on Sam Altman’s earlier promise of enhanced reasoning. Meanwhile, Codex CLI debuts as a nimble, open-source coding sidekick for your terminal – Claude Code has company.
Then, Open AI introduced OpenAI o3 and o4-mini – and blew the internet. It’s OpenAI’s boldest move toward agentic AI so far. The new stars in the Orion constellation arrive with full tool access, visual skills, and a better grasp of complex tasks. Reinforced through training and sharper than before, they bring real gains in reasoning, efficiency, and autonomous behavior. But not everything shines. Early testing from Transluce found that o3 often invented tool use – and doubled down on these imagined actions when questioned. Compared to the GPT series, the o‑models seem particularly prone to hallucinations, likely a side effect of their reinforcement learning focus and lack of internal reasoning transparency. Meanwhile, METR’s evaluations back up the improvements in autonomous tasks but raise new concerns. o3 repeatedly engaged in reward hacking, and evaluators warned the models might be hiding capabilities. If anything, current safety tests may be too easy. In Interconnects, Nathan Lambert explored how OpenAI’s o3 model reveals a new kind of over-optimization rooted in reinforcement learning. While o3 excels at tool use and complex reasoning, it also exhibits bizarre behaviors. That was interesting:
Amid the debate, economist Tyler Cowen casually called o3 an AGI, asking: “How much smarter were you expecting AGI to be?” The Marginal Revolution comments lit up, with many arguing that true AGI still demands more breadth, grounding, and robustness than o3 shows today. And then, Former Chief Research Officer at OpenAI said that:
Trying to understand that fraction, I also started using o3, and I have mixed feelings. You see, ChatGPT has been my to-go models since the beginning. I have a few GPTs made for my varied needs, and they are tuned well to understand what I want and provide me with this. With o3, the result was less consistent. I guess, with any new model, it requires a few days to adjust to how it understands you better. So far, o3 wins as an agentic model but loses in simpler language tasks.
Also, OpenAI published “A Practical Guide to Building Agents”. Honestly, the more interesting part was Harrison Chase’s frustrated but thoughtful response, trying to bring some real clarity into a very fuzzy conversation.
Harrison didn't know that the same week Anthropic also released “Claude Code: Best practices for agentic coding”. Simon Willison dug into its “ultrathink” feature.
Models to pay attention to:
🌟BitNet b1.58 2B4T (Microsoft Research) introduces a native 1-bit 2B-parameter LLM trained from scratch, matching full-precision models in performance while dramatically improving efficiency and reducing memory and energy use → read the paper
🌟MineWorld (Microsoft Research) creates a real-time, open-source world model for Minecraft using a visual-action Transformer that generates future scenes based on player actions, enabling interactive agent simulations → read the paper
🌟 Embed 4 (Cohere) launches a state-of-the-art multimodal embedding model capable of 128K-token context handling, multilingual search, and robust retrieval across complex documents for enterprise AI applications → read more
DeepCoder-14B-Preview (Together AI & Agentica) develops a 14B code reasoning model using RL that matches o3-mini’s performance, with full open-sourcing of data, training recipes, and system optimizations → read the paper
M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models (Together AI & Cornell University) builds a hybrid linear RNN reasoning model based on Mamba, achieving faster inference and matching the accuracy of top distilled transformer reasoners like DeepSeek R1 → read the paper
SocioVerse (Fudan University & others) designs an LLM-agent-driven world model simulating large-scale social behavior using 10 million real-world user profiles, across politics, news, and economics domains → read the paper
PerceptionLM (Meta & others) releases a fully open vision-language model trained without black-box distillation, focusing on detailed video understanding with new datasets and a benchmark suite → read the paper
🌟 Seedream 3.0 (ByteDance Seed) presents a bilingual (Chinese-English) image generation model with major improvements in visual fidelity, text rendering, and native high-resolution outputs up to 2K → read the paper