🌁#79: Sora and World Models – Bringing magic to muggles

Spatial Intelligence just got a boost! Plus, a concise coverage of the remarkably rich week in ML research and innovations

Dec 10, 2024

It should be illegal to ship that many updates and releases so close to the holidays, but here we are – two weeks before Christmas, with our hands full of news and research papers (thank you, OpenAI's 12 days of shipping and booming NeurIPs, very much!). Let’s dive in: Sora, Genie 2 by Google DeepMind, and World Labs by Fei Fei Li – it was truly a fascinating week. Be aware: a lot of videos in this newsletter! You might want to

Read online

Now, to the week’s hottest topics: Sora, Genie 2 and World Labs

It’s not exactly trivial to get access to Sora, and there are a couple of issues:

Lack of communication from the team: For example, OpenAI announced that Sora is included with ChatGPT Plus/Pro – it wasn’t for us. And nobody from the team could immediately clarify that. That’s frustrating. We had to buy an additional subscription.
A lot of demand created by their professional “12 days of Shipmas” hype-making. To the point that Sam Altman had to say, “Signups will be disabled on and off, and generations will be slow for a while.”
And, if you are in Europe or the UK – you simply can’t get access to Sora.

But.

If and when you finally get your hands on it – Sora is pretty magnificent. It’s actually quite incredible. Once again, OpenAI beats everyone with an intuitive user experience, delivering sophisticated technology to every noob out there. In every sense of it, bringing magic to muggles.

Enigmatic Computing Adventure – Sora by OpenAI

One thing Sora doesn’t allow, no matter how hard you try, is generating a realistic depiction of an actual person, even historical figures. (In the video above, I attempted to create Alan Turing, of course!) Considering that competing models are likely to support this soon, it’s a disadvantage – but an understandable one, given the current legal battles around copyrights OpenAI is involved in.

As noted in the presentation: if you’re expecting Sora to produce a feature film for you, that’s not going to happen. But consider how far we’ve come. Just two years ago, text-to-image generation was clumsy at best – ah, the nostalgia of extra fingers! Now, we have the ability to create entire video clips with intuitive storyboards, allowing you to turn text into video, incorporate your own images, and refine the result into something surprisingly polished.

Enveloped by Technology – Sora by OpenAI

And even if the law of physics are still suffering, the progress is enormous.

Now to the nerdy part: This exciting progress ties closely to the concept of spatial intelligence, which we use daily – whether it’s navigating a map, packing a suitcase, parking a car, or planning the steps of a complex recipe. Spatial intelligence aligns with the idea of “world models,” a term introduced by David Ha and Jürgen Schmidhuber in their 2018 paper World Models. Since then, the discussion and development have advanced considerably.

Two World Models from last week

Google DeepMind introduced Genie 2, a large-scale foundation world model capable of generating diverse, action-controllable 3D environments from a single image or text prompt. Trained on extensive video datasets, Genie 2 can simulate various scenarios, including object interactions, character animations, and physical effects like gravity and lighting. Users can interact with these generated worlds in real-time using standard inputs such as a keyboard and mouse.

Smoke 2 by Google DeepMind Genie 2

This development represents a significant advancement in the creation of adaptable training grounds for AI, enabling rapid prototyping of interactive experiences and providing diverse environments for training and evaluating embodied agents.

Similarly, World Labs, co-founded by AI pioneer Fei-Fei Li, unveiled an AI system that generates interactive 3D scenes from a single image. This system allows users to explore AI-generated scenes directly in a web browser, with the ability to move within the environment and interact with various elements. The technology adapts to different art styles and scenes, bringing the physics of real life into the virtual space.

World Labs Unveils AI System That Transforms Single Images into Interactive 3D Worlds

World Labs' approach focuses on creating large world models to perceive, generate, and interact with the 3D world, aiming to democratize the creation of virtual spaces and make the process faster and more accessible.

Diving into Genie 2 or World Labs’ system, you’ll discover they’re nothing short of revolutionary. These systems take the foundational principles of World Models and push them into uncharted territory, evolving into rich, interactive 3D environments.

This leap – from task-specific applications to versatile, immersive systems –demonstrates the transformative power of world models. Spatial intelligence marks a fundamental shift, breaking free from the "flat" screen paradigm to embrace the three-dimensional way our minds are naturally wired to think, explore and interact.

The possibilities are truly thrilling.

Twitter library

16 New Types of Retrieval-Augmented Generation (RAG)

AI in Practice – Rats welcome robot-rat

AI infiltrates the rat world: New robot can interact socially with real lab rats

To add to that: Almost 10% Of South Korea's Workforce Is Now A Robot

We are reading – Intel on our mind (is it really dying?)

Rene Haas highlighted Intel's struggle between vertical integration and a fabless model, citing high costs and innovation challenges. He mentioned attempting to encourage Intel to license Arm technology and acknowledged the strategic benefits of vertical integration amid rumors of Arm's interest in acquiring parts of Intel.
Meanwhile, Ben Thompson argues that Intel’s decline stems from its inability to adapt to mobile and efficiency-first computing, allowing ARM and TSMC to dominate. He highlights missed opportunities, such as Intel’s refusal to embrace ARM manufacturing or prioritize power efficiency. While Pat Gelsinger’s foundry plan aimed to address these issues, it was too late to reverse Intel’s losses in AI and profitability. Thompson suggests that Intel’s revival hinges on government-backed AI initiatives, positioning it as a vital domestic foundry for U.S. technological sovereignty.
Semianalysis attributes Intel's decline to decades of leadership failures, poor board decisions, and a loss of cultural and technical leadership. Firing CEO Pat Gelsinger and prioritizing financial engineering over innovation worsened the situation. Intel's delays in advanced nodes allowed competitors like TSMC and AMD to dominate. ARM-based architectures and hyperscaler custom chips further erode its market. Intel Foundry Services is seen as its last chance for relevance, requiring massive investment and government support to secure U.S. semiconductor independence. The article advocates divesting non-core businesses and focusing on revitalizing the foundry as Intel's lifeline.

Top Research – System Cards, Tech reports and Surveys:

OpenAI o1 System Card →read it here
From 01.ai – Yi-Lightning Technical Report →read it here
This technical report introduces O1-CODER, an attempt to replicate OpenAI’s o1 model with a focus on coding tasks →read the paper
Also, Densing Law of LLMs →read the paper here

Models

Meta AI’s Llama 3.3 →read it here
Efficient Track Anything and Segment Anything Model 2 (SAM 2) also from Meta AI develops EfficientTAM for real-time video object tracking on resource-constrained devices with high accuracy and efficiency →read the paper
Amazon Nova Foundation Models for understanding and creative tasks, focusing on scalability, safety, multilingual support, and cost-efficiency →read the paper
PaliGemma 2 from Google DeepMind advances transfer learning with Vision-Language Models optimized for tasks like OCR, molecular structure recognition, and music score transcription →read the paper.
NVILA by Nvidia reduces training and inference costs while maintaining high accuracy for tasks like medical imaging and robotic navigation →read the paper

You can find the rest of the curated research at the end of the newsletter.

Google got wow reaction from both Elon Musk and Sam Altman on their quantum computing chip Willow.
Hugging Face: Visualizing the 2024 rise of open-source AI with style
Microsoft is seeing the big picture
Microsoft's new Copilot Vision brings real-time insights to Edge browser for Pro users. Aimed at enterprise decision-makers, it turns data into actionable visuals with the click of a button. Microsoft continues weaving AI deeper into everyday workflows.
OpenAI levels up with ChatGPT Pro and Reinforcement Fine-Tuning Research Program
OpenAI introduces ChatGPT Pro, offering unlimited access to all models for $200/month, including the powerful GPT-4 turbocharged “o1” and expanded their RFT Program to enable developers and ML engineers to create expert models fine-tuned to excel at specific sets of complex, domain-specific tasks.
AWS Reinvents AI again
AWS drops the mic with cutting-edge AI updates at re:Invent 2024. Highlights include Multi-Agent Orchestration on Bedrock, the Nova AI Model Family, and Prompt Caching for big savings. Enterprises like Moody's are already reaping the benefits of AI-first workflows.
Salesforce measures AI’s pulse
Salesforce's Agentforce platform is delivering on its promise with soaring adoption KPIs. Enterprise AI agents are automating workflows, driving real ROI, and making humans feel slightly less indispensable.
Canada gets cooler with AI
Cohere and CoreWeave are teaming up to build a cutting-edge data center in Canada. The collaboration promises to accelerate AI research while keeping the great white north on the innovation map.

More interesting research papers from last week (categorized for your convenience)

Read further