🎙️Are Memories the missing half of AI brain?
Why AI Intelligence is Nothing Without Visual Memory | Shawn Shen on the Future of Embodied AI
We often confuse intelligence with memory. But in the human brain, reasoning and retrieval are separate processes. Shawn Shen – founder of Memories.ai and former researcher at Meta Reality Labs – believes that for AI to move from chatbots to the physical world – into robots, glasses, and wearables – it must stop trying to memorize everything in the model weights and start “seeing” like a human.
He explains why the next leap in AI is about long-term visual persistence and breaks down why today’s Transformers struggle with object permanence and how his team is building the “hippocampus” for embodied agents. Shawn also told me that they are developing a world model architecture to solve contextual awareness! Super interesting. You’ll learn a lot.
Subscribe to our YouTube channel, or listen the interview on Spotify / Apple
In this episode of Inference, we get into:
“Encode for Machine”: Why we need to stop compressing video for human eyes and start compressing it for AI logic.
The critical architectural split between the Intelligence Model (creative, generative) and the Memory Model (retrieval, factual).
Why Transformers don’t understand physics or time – and why World Models are the answer.
The brutal engineering constraints of running infinite visual memory on-device without Wi-Fi.
How to build a system that remembers your life without becoming a surveillance nightmare.
Why The Mom Test is the most important book for researchers transitioning to product builders.
We also discuss the “Era of I Don’t Know” in research, the limitations of current context windows, and the future where your smart glasses actually know if you’ve been eating healthy this week.
This is a conversation about the missing half of the AI brain. Watch it!
We prepared a transcript for your convenience. But as always – watch the full video, subscribe, like and leave your feedback. It helps us grow on YouTube and bring you more insights ⬇️
Ksenia Se: Hello everyone. Today I’m joined by Shawn Shen, who left Meta Reality Labs with a provocative thesis: that intelligence without long-term visual memory isn’t really intelligence. He co-founded Memories.ai to build the world’s first Large Visual Memory Model. Welcome, Shawn!
Shawn Shen: Thank you so much, Ksenia.
Ksenia: You guys emerged from stealth just recently in July of this year, right? That was when I first noticed you, and I thought about all the sci-fi movies that I’ve watched, because in these movies, the memory, the recognition, the search through memories – it’s all solved. But we’re not in the movies. So, do you think that visual memory specifically is the thing that unlocks intelligence?
Shawn: I think in the future, intelligence is going to be embodied. What that means is that your robots, smart glasses, smart wearables, your cameras – cameras on the street, cameras in the home, anything that has a physical camera – that is going to be the real embodied intelligence, and that is going to be the real intelligence. For that intelligence to work, it has to have visual memories, because you can’t have an AI that’s able to see but not remember what it has seen. So they need to have this visual memory.
We also take it from a very first-principles point of view. We think about how human memory works, how human cognition systems work. Our cognition system works by two things, right? One is intelligence. One is memory. They’re totally separate and in parallel. And if we are going to build better AI similar to how humans work, then we also need to split this into intelligence and memory. So memories will make intelligence better.
And what are memories? I define them as visual memories, because most of our memories are actually visual. For example, if I ask you: When was the last time you had an amazing burger? How many times you went to the gym? Are you eating healthy this week? If I ask you these questions, you would usually recall what you have eaten, the gym, et cetera – all visually, all vividly. And then you recall that and you reason on top of that using your intelligence.
So what we build is this encoding and retrieval process. We call this memories.
Beyond Transformers: Building the Large Visual Memory Model
Ksenia: Fascinating, but super hard to build. Your model is called Large Visual Memory Model. As far as I understand, it’s still based on Transformer architecture, right? Or have you moved beyond that?
Shawn: It’s still based on Transformer architecture. There are some other architectures – for example, Mamba – but the only model architecture that still scales, the only one that you can scale up using data, Transformer is still the best option. So we’re still using Transformer architecture, but we’re training this model for very different purposes than other models leveraging Transformers.
For example, large language models are trained to be creative, to be intelligent. And sometimes they have hallucinations, right? But when we’re training our Large Visual Memory Model, it is essentially an all-in-one embedding model. What that means is that it turns all the videos and actually all the contexts – including audios, text, actions, everything – into the same embedding space.
And we don’t need any creativeness. We just want to turn all those different multimodal data into embeddings that can ultimately be losslessly reconstructed back to the original formats. So this is how we train the model. The way that we train intelligence models and the way we train memory models are fundamentally different.
Ksenia: I see. But when you talk about memories, you talk more about the brain. And when you talk about the model, you talk more about… sort of a database. So in your understanding, is the ultimate memory model closer to the brain or closer to the database?
Shawn: So the ultimate memory model is composed of two parts. It’s a system. It’s not an end-to-end model itself. Just think about how human memory works. Our memories are based on retrieval and reconstruction.
When we see things – for example, I’m 29 years old, I have 29 years of visual memories – I index the world in real time and store all of this in my memories. And whenever you ask me a question, I retrieve from it. So our human brain also operates similar to a system.
This system is composed of two very important key components. One is this indexing and encoding process. The other is the retrieval process. So what we built is also these two key parts. We built state-of-the-art indexing models, which is our Large Visual Memory Model. And then we also built this very AI-native, video-native, robust retrieval system. And on top of that, people can build different multimodal AI agents to enable different applications for embodied AI.
The World Model Evolution
Ksenia: But do you feel that Transformer kind of limits you in that? Do you look for other architectures? Is there research for you there?
Shawn: Yes, of course. We’re actually looking at leveraging world models to build this embedding model and indexing model. Because what we found out is that, for example, the current Transformers or current model architectures don’t really understand the world and don’t really understand physics. They don’t really understand the temporal contextual awareness.
A very simple example: when we’re reaching for something, getting something, and then when we put things back, how does it know that, for example, this AirPods is still my AirPods? And also how does it know that Ksenia, when Ksenia changes an outfit or changes hair color or even turns her back to me, is still Ksenia? How do you enable this contextual awareness on the temporal side of objects, of humans? Transformer architecture doesn’t support that.
So now we’re developing a world model architecture to solve that. We haven’t launched it, but this is our current state-of-the-art work.
Ksenia: And you develop it all in-house? From scratch, your own models?
Shawn: Yes, we develop all of the models in-house because we are a bunch of research scientists coming from all different big lab backgrounds. So we have pretty long-standing experience building models, and especially building multimodal AI models ourselves.
From Neuroscience to Customer Needs
Ksenia: I wonder how your process is going. Do you have a big picture of the brain and say, “We solved this part of it, now we’re moving forward”? How do you plan what to solve next?
Read further for free or better watch it on YouTube


It's interesting how you framed intelligence versus memory. The idea of a "hippocampus" for embodied agents is trully fascinating. When talking about a world model architecture for contextual awareness, how does it inherently address the temporal aspect better than current Transformer designs? This sounds like a real game-changer for robotics.