The next grand challenge for AI | Jim Fan
Why this matters
Auto-discovered candidate. Editorial positioning to be finalized.
Summary
Auto-discovered from TED Talks. Editorial summary pending review.
Perspective map
The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.
An explanation of the Perspective Map framework can be found here.
Episode arc by segment
Early → late · height = spectrum position · colour = band
Risk-forwardMixedOpportunity-forward
Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).
Across 8 full-transcript segments: median 0 · mean -4 · spread -10–0 (p10–p90 -10–0) · 0% risk-forward, 100% mixed, 0% opportunity-forward slices.
Mixed leaning, primarily in the Governance lens. Evidence mode: interview. Confidence: medium.
- - Emphasizes governance
- - Emphasizes safety
- - Full transcript scored in 8 sequential slices (median slice 0).
Editor note
Auto-ingested from daily feed check. Review for editorial curation under intake methodology.
Play on sAIfe Hands
On-site playback is enabled when an episode-level media URL is connected. This entry currently points to a source page.
This entry currently has a show-level source URL, not an episode-level media URL.
Episode transcript
Official TED subtitles (WebVTT from TED’s streaming metadata) · stored Apr 10, 2026 · ~103 subtitle cues (estimated)
Same subtitle track TED serves to logged-in players; high-trust for talk content when YouTube captions are unavailable.
No editorial assessment file yet. Add content/resources/transcript-assessments/the-next-grand-challenge-for-ai-jim-fan.json when you have a listen-based summary.
Show full transcript
In spring of 2016, I was sitting in a classroom at Columbia University but wasn't paying attention to the lecture. Instead, I was watching a board game tournament on my laptop. And it wasn't just any tournament, but a very, very special one. The match was between AlphaGo and Lee Sedol. The AI had just won three out of five games and became the first ever to beat a human champion at a game of Go. I still remember the adrenaline of seeing history unfold that day. The [glorious] moment when AI agents finally entered the mainstream. But when the excitement fades, I realized that as mighty as AlphaGo was, it could only do one thing and one thing alone. It isn't able to play any other games, like Super Mario or Minecraft, and it certainly cannot do dirty laundry or cook a nice dinner for you tonight. But what we truly want are AI agents as versatile as Wall-E, as diverse as all the robot body forms or embodiments in Star Wars and works across infinite realities, virtual or physical, as in Ready Player One. So how can we achieve these science fictions in possibly the near future? This is a practitioner's guide towards generally capable AI agents. Most of the ongoing research efforts can be laid out nicely across three axes: the number of skills an agent can do; the body forms or embodiments it can control; and the realities it can master. AlphaGo is somewhere here, but the upper right corner is where we need to go. So let's take it one axis at a time. Earlier this year, I led the Voyager project, which is an agent that scales up massively on a number of skills. And there's no game better than Minecraft for the infinite creative things it supports. And here's a fun fact for all of you. Minecraft has 140 million active players. And just to put that number in perspective, it's more than twice the population of the UK. And Minecraft is so insanely popular because it's open-ended: it does not have a fixed storyline for you to follow, and you can do whatever your heart desires in the game. And when we set Voyager free in Minecraft, we see that it's able to play the game for hours on end without any human intervention. The video here shows snippets from a single episode of Voyager where it just keeps going. It can explore the terrains, mine all kinds of materials, fight monsters, craft hundreds of recipes and unlock an ever-expanding tree of skills. So what's the magic? The core insight is coding as action. First, we convert the 3D world into a textual representation using a Minecraft JavaScript API made by the enthusiastic community. Voyager invokes GPT4 to write code snippets in JavaScript that become executable skills in the game. Yet, just like human engineers, Voyager makes mistakes. It isn't always able to get a program correct on the first try. So we add a self-reflection mechanism for it to improve. There are three sources of feedback for the self-reflection: the JavaScript code execution error; the agent state, like health and hunger; and a world state, like terrains and enemies nearby. So Voyager takes an action, observes the consequences of its action on the world and on itself, reflects on how it can possibly do better, [tries] out some new action plans and rinse and repeat. And once the skill becomes mature, Voyager saves it to a skill library as a persistent memory. You can think of the skill library as a code repository written entirely by a language model. And in this way, Voyager is able to bootstrap its own capabilities recursively as it explores and experiments in Minecraft. So let's work through an example together. Voyager finds itself hungry and needs to get food as soon as possible. It senses four entities nearby: a cat, a villager, a pig and some wheat seeds. Voyager starts an inner monologue. "Do I kill the cat or villager for food? Horrible idea. How about a wheat seed? I can grow a farm out of the seeds, but that's going to take a long time. So sorry, piggy, you are the chosen one." (Laughter) And Voyager finds a piece of iron in its inventory. So it recalls an old skill from the library to craft an iron sword and starts to learn a new skill called "hunt pig." And now we also know that, unfortunately, Voyager isn't vegetarian. (Laughter) One question still remains: how does Voyager keep exploring indefinitely? We only give it a high-level directive, that is, to obtain as many unique items as possible. And Voyager implements a curriculum to find progressively harder and more novel challenges to solve all by itself. And putting all of these together, Voyager is able to not only master but also discover new skills along the way. And we did not pre-program any of this. It's all Voyager's idea. And this, what you see here, is what we call lifelong learning. When an agent is forever curious and forever pursuing new adventures. Compared to AlphaGo, Voyager scales up massively on a number of things he can do, but still controls only one body in Minecraft. So the question is: can we have an algorithm that works across many different bodies? Enter MetaMorph. It is an initiative I co-developed at Stanford. We created a foundation model that can control not just one but thousands of robots with very different arm and leg configurations. Metamorph is able to handle extremely varied kinematic characteristics from different robot bodies. And this is the intuition on how we create a MetaMorph. First, we design a special vocabulary to describe the body parts so that every robot body is basically a sentence written in the language of this vocabulary. And then we just apply a transformer to it, much like ChatGPT, but instead of writing out text, MetaMorph writes out motor controls. We show that MetaMorph is able to control thousands of robots to go upstairs, cross difficult terrains and avoid obstacles. Extrapolating into the future, if we can greatly expand this robot vocabulary, I envision MetaMorph 2.0 will be able to generalize to robot hands, humanoids, dogs, drones and even beyond. Compared to Voyager, MetaMorph takes a big stride towards multi-body control. And now, let's take everything one level further and transfer the skills and embodiments across realities. Enter IsaacSim, Nvidia's simulation effort. The biggest strength of IsaacSim is to accelerate physics simulation to 1,000x faster than real time. For example, this character here learns some impressive martial arts by going through ten years of intense training in only three days of simulation time. So it's very much like the virtual sparring dojo in the movie "Matrix." And this car racing scene is where simulation has crossed the uncanny valley. Thanks to hardware accelerated ray tracing, we're able to render extremely complex scenes with breathtaking levels of details. And this photorealism you see here will help us train computer vision models that will become the eyes of every AI agent. And what's more, IsaacSim can procedurally generate worlds with infinite variations so that no two look the same. So here's an interesting idea. If an agent is able to master 10,000 simulations, then it may very well just generalize to our real physical world, which is simply the 10,001st reality. And let that sink in. As we progress through this map, we will eventually get to the upper right corner, which is a single agent that generalizes across all three axes, and that is the "Foundation Agent." I believe training Foundation Agent will be very similar to ChatGPT. All language tasks can be expressed as text in and text out. Be it writing poetry, translating English to Spanish or coding Python, it's all the same. And ChatGPT simply scales this up massively across lots and lots of data. It's the same principle. The Foundation Agent takes as input an embodiment prompt and a task prompt and output actions, and we train it by simply scaling it up massively across lots and lots of realities. I believe in a future where everything that moves will eventually be autonomous. And one day we will realize that all the AI agents, across Wall-E, Star Wars, Ready Player One, no matter if they are in the physical or virtual spaces, will all just be different prompts to the same Foundation Agent. And that, my friends, will be the next grand challenge in our quest for AI. (Applause)