🎙️What AI Is Missing for Real Reasoning?

Axiom Math’s Carina Hong on how to build an AI mathematician

Dec 03, 2025

Is math the ultimate test for AI reasoning? Or is next-token prediction fundamentally incapable of discovering new truths and discovering conjectures? Carina Hong, co-founder and CEO of Axiom Math, argues that to build true reasoning capabilities, we need to move beyond “chatty” models to systems that can verify their own work using formal logic.

They also just published a very interesting research “Transformers know more than they can tell Learning the Collatz sequence”. This work shows a small transformer can actually learn the hidden control logic of a hard Collatz-style algorithm, not just fit patterns. Its mistakes are structured and explainable, so Collatz becomes a microscope for how transformers acquire real algorithmic reasoning. It’s the first clean proof that a tiny transformer can infer and internalize the control flow of a genuinely hard mathematical process. Not arithmetic tricks, not memorized mappings, but the loop structure itself. And when it errs, the errors expose the learned algorithm instead of random hallucinations. This gives us a rare, mechanistic window into how transformers acquire algorithmic reasoning. Fascinating.

Subscribe to our YouTube channel, or listen the interview on Spotify / Apple

In this episode of Inference, we get into:

Why current LLMs are like secretaries (good at retrieval) but bad at de novo mathematics
The three pillars of an AI Mathematician
How AlphaGeometry proved that symbolic logic and neural networks must merge
The difference between AGI and Superintelligence
Why “Theory Building” is harder to benchmark than the International Math Olympiad (IMO)
The scarcity of formal math data (Lean) compared to Python code

We also discuss the bottlenecks: the “chicken and egg” problem of auto-formalization, why Axiom bets on specific superintelligence over general models, and how AI will serve as the algorithmic pillar for the future of hard science.

Carina has a very clear vision and it was a pleasure talking to her. Watch the interview

This is a free edition. Upgrade if you want to receive our deep dives directly in your inbox. If you want to support us without getting a subscription – do it here.

We prepared a transcript (edited by Genspark Super Agent) for your convenience. But as always – please watch the full video, subscribe, like and leave your feedback. It helps us grow on YouTube and bring you more insights ⬇️

Ksenia Se: Hello everyone and welcome to Inference by Turing Post. I’m very excited to be joined today by Carina Hong. She’s co-founder and CEO of Axiom Math, a fascinating new company that builds a full stack for machine-checkable mathematical reasoning by turning math into a programmable, verifiable language and using it to push towards self-improving AI, which is super hard. Welcome, Carina.

Carina Hong: Great to be here.

Why Can’t AI Add?

Ksenia: You know what’s jarring? I’ll start with a simple question. Everyone is talking about AGI, and then I’m talking to my model and it says: “This meal is 400 calories, this meal is 350, the snack is 200 calories, and dinner is 350. Overall it’s 800 calories, you’re totally fine.” And I’m just staring at it. Why is math so hard?

Carina: That’s a great question. Current models are trained by next token prediction. That isn’t exactly what’s going on when you’re doing mathematical reasoning. If you’re a human mathematician, you don’t give an answer just because other people say it’s correct, right?

Currently, machines are also trained on human preferences. It’s not exactly how human reasoning works. You can have very similar concepts and a totally different problem. So just looking at training data relating to that concept isn’t going to help you.

We do see machines being good at some parts of math – for example, being almost like the secretary for a mathematician. Looking up citations, looking up what has shown up in an ancient paper or even in a German mathematician’s notebook. That sort of retrieval and searching in the training data is something machines are good at. But they’re really not great at coming up with de novo mathematical knowledge.

When it comes to arithmetic, like the examples you give, it’s quite interesting that it gets that wrong or gets some basic linear algebra proof wrong. We also think it’s reasonable to say that when the problem becomes very complex – when it’s a multi-step reasoning problem, when the tree gets very deep – models are going to struggle even more.

Three Pillars of an AI Mathematician

Ksenia: But you are trying to build an AI mathematician that will be basically independent from a human, right? What are the steps that you need to make? What are the main bottlenecks? I know it’s a lot, but give me an overview.

Carina: We believe there are some pillars that are important to the AI mathematician.

First, we want a prover system that can give mathematical proofs when it’s asked a question and can do that very well. This is almost like the focus of some other formal math research work – to have a system that can come up with proofs. That’s different from an informal model. Large language models give you just the numerical answer.

When we talk about math as a verifiable domain, what we want to see is not just the numerical answer – it’s verifiable because a machine can grade that 957 is 957. What that misses is the intermediate steps. We want all the intermediate steps, the reasoning process, to be verifiable as well. That’s the prover system, number one.

Number two, we want the model to have a great knowledge base. Know what’s already in the knowledge base and what is not. Be able to introduce new nodes to the knowledge base as the prover generates more correct proofs, turning conjectures into theorems, and even bringing in new definitions and concepts that are useful for coming up with new problems in the knowledge base. This knowledge base could have a graph structure, a citation-based knowledge graph.

Number three, we really want to have a conjecture system – a model that can propose interesting new math questions. Then this conjecture model is going to give the challenge to the prover model, and the two of them can just talk. That forms a self-improving loop. The conjecturer can look at what the prover gets right and wrong and come up with an entirely new curriculum of problems to help the prover model hill-climb.

So these are the three parts. And there’s also auto-formalization, or the ability to turn natural language math into Lean, the programming language for proofs. This is sort of interweaving all these different parts.

For example, one might have a really good knowledge base of informal mathematics. We need auto-formalization to turn that into a Lean knowledge base. One may have a good conjecture that’s able to give some high-level intuitions. Also, auto-informalization – turning Lean back into English – will be needed for the conjecture.

If it can be something in informal language and the prover proves it in formal language, turning that back adds to existing human mathematics by auto-informalizing it back into English. Human mathematicians might find inspirations in that.

I would say one final thing on the AI mathematician point: while we believe this model is going to have incredible capability, having the model be able to collaborate with human mathematicians is also a really important part.

We think that human understanding and mathematical capability are sometimes different from what is difficult and what is easy from what a machine, an AI mathematician, may think. So there’s definitely a lot of collaboration space. Because if they’re good at exactly the same thing and bad at exactly the same thing, then you might argue that when the AI mathematician becomes really superintelligent, why is there a need for human mathematicians to exist?

The answer is: they’re good at very different things. So therefore, the collaboration and the synergy becomes incredibly interesting.

The Conjecture Problem

Ksenia: That’s incredibly interesting, because I want to go back to this conjecture model and how you’re even going to create it. Math is so endless – how is it going to narrow down to conjecture? And also it’s somehow related, I think, to intuition that a human mathematician can have.

So I think it’s a very important topic of conjecture creation, and also what role intuition plays here and how – or even can – you build it into a model?

Carina: My eureka moment – “Wow, this is so amazing that we are probably functioning in a space where we can create new knowledge” – was the AlphaGeometry paper.

In AlphaGeometry, Google DeepMind had all the Euclidean geometry figures like triangles, circles, lines, intersection points turned into symbolic language. They turn that into vector-based language and then in fact generate de novo geometry problems by functioning in the vector-based language. So they generate code snippets. And what they do is turn these into geometry representations after they’re done generating these new problems.

So almost like extremely creative geometry problems are generated synthetically. That’s an incredibly powerful idea.

We’ve also seen this in the line of mathematical discoveries work by my colleague François Charton and Alberto Bietti. They are researchers who were previously at Meta’s Fundamental AI Research lab, but currently at Axiom. They’re able to use machine learning to generate very interesting constructions and examples – not conjectures, but mathematical objects.

By observing the underlying pattern of a family of them, the human mathematicians they work with were actually able to come up with de novo conjectures. So using these machine-generated objects to guide human intuition – that’s what they do in discovery.

Now, the question of intuition is very big, right? It’s very broad. One important part is the ability to generate these examples and constructions. The other part is being able to synthesize them into – the human would synthesize them and formulate interesting conjectures.

Or you could say that the model in the formal language space can generate synthetically new problems and statements by taking in a couple of seed statements. This is what people have been trying, for example, in STP, Self-Taught-Theorem-Prover work from Stanford. They’re looking at seed statements from a dataset called Lean Workbook, a high school level math dataset.

Then they try to generate conjectures that are fully in Lean. And then they have the prover model to actually try to give reward to the conjecture model. But obviously there are certain additional designs they need to do, because one can easily imagine a case where the prover-conjecture system didn’t get anywhere. Well, is it the prover that’s too bad or is it the conjecture that’s too bad?

This sort of difficulty – troubleshooting or disentanglement – is something that a lot of researchers are still working on. I do think that by combining formal language, which is a different abstraction level, with informal, you have more space for conjecturing.

The Return of Symbolic AI?

Ksenia: Lean is a symbolic language, right? Is this the return of symbolic machine learning?

Read further

Discussion about this post

Ready for more?