π#79: Sora and World Models β Bringing magic to muggles
Spatial Intelligence just got a boost! Plus, a concise coverage of the remarkably rich week in ML research and innovations
It should be illegal to ship that many updates and releases so close to the holidays, but here we are βΒ two weeks before Christmas, with our hands full of news and research papers (thank you, OpenAI's 12 days of shipping and booming NeurIPs, very much!). Letβs dive in: Sora, Genie 2 by Google DeepMind, and World Labs by Fei Fei Li β it was truly a fascinating week. Be aware: a lot of videos in this newsletter! You might want to
Now, to the weekβs hottest topics: Sora, Genie 2 and World Labs
Itβs not exactly trivial to get access to Sora, and there are a couple of issues:
Lack of communication from the team: For example, OpenAI announced that Sora is included with ChatGPT Plus/Pro β it wasnβt for us. And nobody from the team could immediately clarify that. Thatβs frustrating. We had to buy an additional subscription.
A lot of demand created by their professional β12 days of Shipmasβ hype-making. To the point that Sam Altman had to say, βSignups will be disabled on and off, and generations will be slow for a while.β
And, if you are in Europe or the UK β you simply canβt get access to Sora.
But.
If and when you finally get your hands on it β Sora is pretty magnificent. Itβs actually quite incredible. Once again, OpenAI beats everyone with an intuitive user experience, delivering sophisticated technology to every noob out there. In every sense of it, bringing magic to muggles.
Enigmatic Computing Adventure β Sora by OpenAI
One thing Sora doesnβt allow, no matter how hard you try, is generating a realistic depiction of an actual person, even historical figures. (In the video above, I attempted to create Alan Turing, of course!) Considering that competing models are likely to support this soon, itβs a disadvantage β but an understandable one, given the current legal battles around copyrights OpenAI is involved in.
As noted in the presentation: if youβre expecting Sora to produce a feature film for you, thatβs not going to happen. But consider how far weβve come. Just two years ago, text-to-image generation was clumsy at best β ah, the nostalgia of extra fingers! Now, we have the ability to create entire video clips with intuitive storyboards, allowing you to turn text into video, incorporate your own images, and refine the result into something surprisingly polished.
Enveloped by Technology β Sora by OpenAI
And even if the law of physics are still suffering, the progress is enormous.
Now to the nerdy part: This exciting progress ties closely to the concept of spatial intelligence, which we use daily β whether itβs navigating a map, packing a suitcase, parking a car, or planning the steps of a complex recipe. Spatial intelligence aligns with the idea of βworld models,β a term introduced by David Ha and JΓΌrgen Schmidhuber in their 2018 paper World Models. Since then, the discussion and development have advanced considerably.
Two World Models from last week
Google DeepMind introduced Genie 2, a large-scale foundation world model capable of generating diverse, action-controllable 3D environments from a single image or text prompt. Trained on extensive video datasets, Genie 2 can simulate various scenarios, including object interactions, character animations, and physical effects like gravity and lighting. Users can interact with these generated worlds in real-time using standard inputs such as a keyboard and mouse.
Smoke 2 by Google DeepMind Genie 2
This development represents a significant advancement in the creation of adaptable training grounds for AI, enabling rapid prototyping of interactive experiences and providing diverse environments for training and evaluating embodied agents.
Similarly, World Labs, co-founded by AI pioneer Fei-Fei Li, unveiled an AI system that generates interactive 3D scenes from a single image. This system allows users to explore AI-generated scenes directly in a web browser, with the ability to move within the environment and interact with various elements. The technology adapts to different art styles and scenes, bringing the physics of real life into the virtual space.
World Labs Unveils AI System That Transforms Single Images into Interactive 3D Worlds
World Labs' approach focuses on creating large world models to perceive, generate, and interact with the 3D world, aiming to democratize the creation of virtual spaces and make the process faster and more accessible.
Diving into Genie 2 or World Labsβ system, youβll discover theyβre nothing short of revolutionary. These systems take the foundational principles of World Models and push them into uncharted territory, evolving into rich, interactive 3D environments.
This leap β from task-specific applications to versatile, immersive systems βdemonstrates the transformative power of world models. Spatial intelligence marks a fundamental shift, breaking free from the "flat" screen paradigm to embrace the three-dimensional way our minds are naturally wired to think, explore and interact.
The possibilities are truly thrilling.
Twitter library
16 New Types of Retrieval-Augmented Generation (RAG)
AI in Practice β Rats welcome robot-rat
AI infiltrates the rat world: New robot can interact socially with real lab rats
To add to that: Almost 10% Of South Korea's Workforce Is Now A Robot
We are reading β Intel on our mind (is it really dying?)
Rene Haas highlighted Intel's struggle between vertical integration and a fabless model, citing high costs and innovation challenges. He mentioned attempting to encourage Intel to license Arm technology and acknowledged the strategic benefits of vertical integration amid rumors of Arm's interest in acquiring parts of Intel.
Meanwhile, Ben Thompson argues that Intelβs decline stems from its inability to adapt to mobile and efficiency-first computing, allowing ARM and TSMC to dominate. He highlights missed opportunities, such as Intelβs refusal to embrace ARM manufacturing or prioritize power efficiency. While Pat Gelsingerβs foundry plan aimed to address these issues, it was too late to reverse Intelβs losses in AI and profitability. Thompson suggests that Intelβs revival hinges on government-backed AI initiatives, positioning it as a vital domestic foundry for U.S. technological sovereignty.
Semianalysis attributes Intel's decline to decades of leadership failures, poor board decisions, and a loss of cultural and technical leadership. Firing CEO Pat Gelsinger and prioritizing financial engineering over innovation worsened the situation. Intel's delays in advanced nodes allowed competitors like TSMC and AMD to dominate. ARM-based architectures and hyperscaler custom chips further erode its market. Intel Foundry Services is seen as its last chance for relevance, requiring massive investment and government support to secure U.S. semiconductor independence. The article advocates divesting non-core businesses and focusing on revitalizing the foundry as Intel's lifeline.
Top Research β System Cards, Tech reports and Surveys:
OpenAI o1 System Card βread it here
From 01.ai β Yi-Lightning Technical Report βread it here
This technical report introduces O1-CODER, an attempt to replicate OpenAIβs o1 model with a focus on coding tasks βread the paper
Also, Densing Law of LLMs βread the paper here
Models
Meta AIβs Llama 3.3 βread it here
Efficient Track Anything and Segment Anything Model 2 (SAM 2) also from Meta AI develops EfficientTAM for real-time video object tracking on resource-constrained devices with high accuracy and efficiency βread the paper
Amazon Nova Foundation Models for understanding and creative tasks, focusing on scalability, safety, multilingual support, and cost-efficiency βread the paper
PaliGemma 2 from Google DeepMind advances transfer learning with Vision-Language Models optimized for tasks like OCR, molecular structure recognition, and music score transcription βread the paper.
NVILA by Nvidia reduces training and inference costs while maintaining high accuracy for tasks like medical imaging and robotic navigation βread the paper
You can find the rest of the curated research at the end of the newsletter.
News from The Usual Suspects Β©
Google got wow reaction from both Elon Musk and Sam Altman on their quantum computing chip Willow.
Hugging Face: Visualizing the 2024 rise of open-source AI with style
Microsoft is seeing the big picture
Microsoft's new Copilot Vision brings real-time insights to Edge browser for Pro users. Aimed at enterprise decision-makers, it turns data into actionable visuals with the click of a button. Microsoft continues weaving AI deeper into everyday workflows.
OpenAI levels up with ChatGPT Pro and Reinforcement Fine-Tuning Research Program
OpenAI introduces ChatGPT Pro, offering unlimited access to all models for $200/month, including the powerful GPT-4 turbocharged βo1β and expanded their RFT Program to enable developers and ML engineers to create expert models fine-tuned to excel at specific sets of complex, domain-specific tasks.AWS Reinvents AI again
AWS drops the mic with cutting-edge AI updates at re:Invent 2024. Highlights include Multi-Agent Orchestration on Bedrock, the Nova AI Model Family, and Prompt Caching for big savings. Enterprises like Moody's are already reaping the benefits of AI-first workflows.Salesforce measures AIβs pulse
Salesforce's Agentforce platform is delivering on its promise with soaring adoption KPIs. Enterprise AI agents are automating workflows, driving real ROI, and making humans feel slightly less indispensable.Canada gets cooler with AI
Cohere and CoreWeave are teaming up to build a cutting-edge data center in Canada. The collaboration promises to accelerate AI research while keeping the great white north on the innovation map.