Dwarkesh PodcastCivilisational risk and strategySpotlightReleased: 17 Oct 2025

Andrej Karpathy — AGI is still a decade away

Why this matters

Auto-discovered candidate. Editorial positioning to be finalized.

Summary

Auto-discovered from Dwarkesh Podcast. Editorial summary pending review.

Perspective map

MixedGovernanceMedium confidenceTranscript-informed

The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.

An explanation of the Perspective Map framework can be found here.

Episode arc by segment

Early → late · height = spectrum position · colour = band

Risk-forwardMixedOpportunity-forward

Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).

StartEnd

Across 23 full-transcript segments: median 0 · mean 0 · spread 0–0 (p10–p90 0–0) · 0% risk-forward, 100% mixed, 0% opportunity-forward slices.

Slice bands

23 slices · p10–p90 0–0

Mixed leaning, primarily in the Governance lens. Evidence mode: interview. Confidence: medium.

- Emphasizes governance
- Emphasizes safety
- Full transcript scored in 23 sequential slices (median slice 0).

Editor note

Auto-ingested from daily feed check. Review for editorial curation under intake methodology.

ai-safetydwarkesh

Play on sAIfe Hands

Episode transcript

YouTube captions (auto or uploaded) · video hguIUmMsvA4 · stored Apr 8, 2026 · 623 caption segments

Captions are an imperfect primary: they can mis-hear names and technical terms. Use them alongside the audio and publisher materials when verifying claims.

No editorial assessment file yet. Add content/resources/transcript-assessments/andrej-karpathy-agi-is-still-a-decade-away.json when you have a listen-based summary.

Show full transcript

Hi everyone, my name is Keon Vafa and in this video I'm going to go over some research that tries to answer the question, what are the implicit world models inside of generative models in AI. Now to see what I mean by this, let's go over some functionalities we may want from AI systems like large language models or LLMs. One exciting functionality is that they can synthesize concepts. We're also excited they can apply concepts to new domains such as in Fshot or in context learning. Lately, we focused on reasoning, and it's amazing that just with the right prompting, Gemini can solve math olympiad problems. Today's models can also be creative while being grounded in reality, as you can see, not only in text, but also in video models. These may all seem like different uses, but they all have a shared kind of functionality. They could all be performed by a model that has understood the world to some capacity, or in other words, a model that has learned the correct world model. Now this raises a couple of questions like what would it mean to have a correct world model and how would we even be able to evaluate or measure if a model has understood the world. It may seem that one approach is to evaluate understanding like how we do most other kinds of evaluation in AI using benchmark test questions. These benchmarks such as the questions on AP exams are used to signify understanding in people. But do they signify understanding in AI? Benchmarks test a very narrow kind of understanding, the ability to answer test questions. If a person aces an AP math exam, we'd say they understand math, but it requires strong assumptions to extend this logic to LLMs. And the evidence we have is that LLMs learn things in nonhumanlike ways. GBD5 does incredibly well on an Amy math competition, but it thinks that 4.11 is larger than 4.9. To give a bit of a crude analogy, using human tests to evaluate the understanding of LLMs is a bit like evaluating the vision capacities of an AI model using a vision exam like the one we get at an optometrist. It's just not what it's designed for. In this video, I'm going to take a step back and focus on the question of how to evaluate a generative models implicit world model. I'll start by discussing some general strategies for evaluation and then I'll focus on a couple of notions of world models for my own research. For each notion, I'll walk through what a world model means in that setting. I'll discuss evaluation metrics that can be used to test it, and I'll go over empirical results. At the end of this video, I'll discuss related ideas and offer some ways forward. So, let's start with some background. It may seem like evaluating understanding is challenging and it is but there's been a general strategy that different groups have used and the strategy is to take constrained problems for which we know true world models and use those as test beds to answer important questions like what is a world model. One of the early test beds was Eth. For those of you who don't know, Ethell is a board game that involves players taking turns placing black and white tiles on a board a little bit like Go. Every time a piece is put down, it results in a cascade of other changes in the board. To make a test bed, researchers collected sequences of transcripts from Athell games and trained a transformer model to predict next tokens on these sequences. The model never saw the true world, which in this case is the Athell board, but it was trained to predict the moves that were made in games. And the evaluation question here is did this transformer uncover the implicit structure of the Athell board? Now, even though the other domains are test beds, they're used to create general procedures that can be applied elsewhere. In this video, I'm going to focus on two research projects I've worked on which form evaluation metrics for world models in two kinds of settings. One where we care about a world model for a single task and one where we care about a world model for many tasks. So I'm going to start with a single test setting. And here our test bed is going to be Manhattan. Specifically, we collected a data set of taxi rides that took place in Manhattan. And we tokenized each taxi ride into a sequence of directions, sort of like a language for Google Maps directions. We took a transformer and only trained it on these directions. We never allowed it to see the map of Manhattan. But if a model has succeeded at this task, you can imagine using it to generate new rides. It could provide the functionality of Google Maps or Ways without ever requiring hard-coded maps or navigation algorithms, just sequences of trips. So when we ran this exercise, the model looked good. It proposed legal terms nearly 100% of the time and it could find valid routes between new points 98% of the time. So had the model discovered the world model of Manhattan? Answering this question first requires defining a world model. And there's a natural definition here. The map of Manhattan can be described by a structure known as a deterministic finite automaton or DFA. A DFA is pretty much a way to represent a set of rules. And it's made up of two things. A set of states. Here, each state is an intersection in Manhattan. And transition rules, the legal turns at each intersection and where they take you. For each taxi ride being taken, its trajectory can be tracked by a DFA, which tells you where each turn ends up. And while the DFA here is a map of Manhattan, DFAs are a general way to describe structure and can be used to model many other kinds of tasks. This notion gives rise to a definition which is that a generative model like a transformer recovers a DFA if every sequence that generates is valid in the DFA and vice versa. Now, there's a little bit I'm glossing over here, which is what does valid or invalid mean in the context of a transformer? But you can imagine this being determined by some probability threshold that a sequence is valid according to a transformer if all of a token probabilities are above 1% or something. Now, while it may seem intractable to compare these two enormous sets, this definition gives rise to a nice result, which is that if a model always predicts legal single next tokens, it must have recovered the DFA. This connects to how most LLMs are trained, which is to make accurate single next token predictions. And it also suggests a test. Measure how often a model's predicted single next tokens are valid. But there's a problem with this test. To see, let's consider a simplified game called cumulative connect 4. This is like the real game of Connect 4 where players take turns placing tiles in columns and moves are allowed in unfilled columns except there are n rows and people keep placing tiles in the board even if there are four in a row. The next token test here works by providing a model with the beginning of a connect four game and seeing what percent of the time it predicts legal next move. But it turns out there's a very simple model that gets 99% accuracy for a large enough board. always predict everything is legal. The model has clearly understood none of the structure of the world, but it does well because many states have the same possible next tokens. For example, these two boards are clearly different, but the set of legal next moves allowed all of them is the same. So, while perfect next token prediction implies world model recovery, nearperfect next token prediction doesn't mean much. While single next tokens aren't enough to differentiate states, there's a classic result from language theory that's here to help. The mile and the road theorem states that for every pair of states, there's some number k where there's a continuation of length k that's allowed by one state and not the other. So for example, a and b here are two separate intersections or states in Manhattan. They have the same legal next turns, so they would look the same if you just looked at next token prediction. They also have the same set of legal next two turns. But crucially they don't have the same set of legal next three turns. So here the myhole neurode boundary is three. There exists some k that differentiates states but it doesn't necessarily need to be one. Now this result motivates new metrics for testing world models which we call compression and distinction. At a high level, these metrics go beyond next token prediction and instead go to the full boundary defined by the myel road theorem. There's more details in the paper which is linked below, but briefly compression tests whether a model recognizes that the same state can be reached in different ways like there are multiple ways to get from 14th Street to Time Square and a model should provide the same continuations no matter how you got to where you are now. distinction is a little more general and it says that if two sequences lead to distinct states, a model shouldn't distinguish their length K continuations where K is defined by the my Holden road boundary. So how about some results? We train models on multiple kinds of data, not just shortest paths, but also simulated traffic and random walks and show that all models have greater than 99.9% next token accuracy. But they perform poorly on these metrics. They fail to compress sequences that lead to the same state and they incorrectly differentiate sequences that lead to different states. Now, at this point, you may be thinking, hold up, why should we care about world models? After all, I just told you that the model could find shortest paths. The reason to care is that not having the right world model means a model can perform poorly on different but related tasks. I'll show you one such task, which is detours. When we add detours by forcing models to take certain turns, they often fail to reroute. Models that navigate well without detours perform poorly once they're there precisely because they haven't learned a coherent world model. We also try to visualize each model's implicit map of Manhattan. I I won't go into the details here, but we did this by trying to reconstruct each model's map of Manhattan in a way that would be generous to the model. As a sanity check, we tried doing graph reconstruction on the true world model and found that it reconstructed the true map. We also tried adding transcription errors to sequences from the true model to match the transformers error rate and found the reconstructed map to be imperfect but largely sensible. But when we tried to reconstruct the transformers map, we found nonsense. The model assumed many roads that didn't exist existed and also added physically impossible flyovers. And this was all despite being generous to the model in the way we reconstructed the graph. So what's happening here? The most helpful way for me to think about it is that you and I have a single map of Manhattan. If I'm on 14th Street and I go a block north, I don't throw out my map. I just shift my eyes because I know I'm on 15th Street. Meanwhile, transformers have to reconstruct a map at every turn they make. If these reconstructions are inconsistent, the world model is incoherent. We also extended these metrics to large language models focusing on logic puzzles. And most LLMs we tried incorrectly differentiated between two ways of arriving at equivalent states indicating poor world models for these kinds of logic problems. So to summarize, we've looked at a definition of world model based on DFAS and evaluation metrics inspired by the myhill road theorem. On both our Manhattan test beds and other applications, models could achieve good predictive performance while forming poor world models. While these definitions and tests are specific to today's generative models, we've been here before. These results relate to the idea of the Rashimon effect which was coined by Leo Briman in his 2001 paper the two cultures. The effect describes the fact that two separate regression or classification models can have similar performance in dramatically different ways. And it's relevant here because it shows that a model here can achieve nearperfect prediction without recovering structure from the true world model. But what if a model gets perfect predictions? All of these results have been for one notion of world model, but sometimes we may want another notion. To see why, it's helpful to look at an example. So consider the problem of predicting how planets move in the night sky. Physicists and astronomers have worked on this problem for centuries, and a breakthrough model was offered by the Dutch astronomer Johannes Kepler in the 17th century. He used geometric properties to pinpoint the future locations of planets in the night sky. These properties couldn't explain why the planets move the way that they did, but they offered nearperfect predictions. A little later, Isaac Newton built off this progress to develop rules, which we now know as Newtonian mechanics, to predict orbits. These mechanics could not only predict orbits, but they could also explain their movement. So, who was right? In one sense, both were right. They both made perfect orbital predictions. We'd be fine with both if we only cared about predicting future movements, which is what the definitions so far have addressed. But Newton provided more generality. The same laws he developed could be used to solve new problems. Anything ranging from pendulums to cannonballs to rockets. And many of the uses of foundation models we're excited by involve this kind of generality, such as fuchter and context learning. So, I'll now shift and focus on how we may measure world models if we care about performing well at many different tasks that involve shared structure. It's helpful to think about a foundation model as a learning algorithm. It takes in a small amount of data from a new task and gives us a new predictive model for that task. Now, I'm not going to specify the way it adapts. We can think about it generally, but we can think about this as being via maybe fine-tuning or end context learning. And again, it's helpful to go back to a theoretical result, the no launch theorem for learning algorithms. Loosely, this theorem states that every learning algorithm has an inductive bias towards some set of functions, or in other words, problems that's better at solving from a limited amount of data. This gives rise to a natural notion of a world model, a restriction over functions described by a state space. In the illustration to the right, every row is a function, and the shadings describe different values we'd allow the function to take, which all obey a state structure. With these tools, we can see that a foundation model's inductive bias reveals its world model. In other words, how a model behaves when it extrapolates to small amounts of data reveals its structure. We came up with a method for testing this, which we call an inductive bias probe. The probe has two steps. Given a foundation model, we apply it to many small synthetic tasks that obey some world model. We then look at statistical patterns in the functions it learns. We look at how it extrapolates to see if it follows patterns that would be dictated by the true world model. So as an example, let's consider the case where the state space is discrete. Here two metrics pop out. One is whether a model's learned functions respect state. The picture in the middle shows what happens when the learn functions don't respect state. Another is the opposite, whether a model's learn functions successfully distinguish state. And a failure of this is illustrated on the right. So these metrics test both kinds of failures and they're analogous to type one and type two errors in classification. As an example, we consider a 1D state tracking problem. Essentially a very small map with K states in a line. And while we find good inductive bias for small states, we find that inductive biases worsen quickly for many different kinds of models, we also find that generally states space models like RNN's and Mamba are consistently better than transformers. So we also try these metrics with models trained on planetary orbits. We train a transformer to predict the future locations of planets across many solar systems. And we find that like Kepler, the model makes good predictions. But has it learned Newtonian mechanics? When we use the inductive bias probe, we find a low inductive bias toward Newtonian mechanics. The model makes similar predictions for orbits with different states and different predictions for orbits with similar states. To illustrate this, we try fine-tuning the model to predict the force vectors between planets using a small amount of data. Force vectors are a cornerstone of Newtonian mechanics. So, a model that's using Newtonian mechanics should easily pick this up. But we find that the transformer struggles to learn force. It learns something nonsensical. When we use a symbolic regression to try to estimate the implied force law of the model, we find the law to not only be nonsensical but also fickle. It changes depending on the galaxy it's applied to. And it's not just our domain specific transformer. We find that LLM which have surely been trained on the text of Newton's laws struggle at this too. So if inductive biases aren't toward the true world models, what are they toward? One possibility is that models which are trained to predict next tokens conflate sequences that have similar legal next tokens even if they correspond to very different states. As an example, two different boards can have the same allowed set of legal next tokens. When we fine-tune a model trained on a moves to predict the boards, we find that it often reconstructs boards incorrectly but reconstructs them well enough so that the legal next moves from the reconstructed board are correct. This suggests that foundation models may only recover enough of state to calculate next tokens. So in this multiple task setting, we've use a definition of a world model to be functions that obey a state space. We developed inductive bias probes as evaluation metrics and used planetary orbits as a test bed that also extended to other problems. Now I want to take a few moments to go over some related ideas. The metrics I've described have taken a functional approach to evaluation. They've evaluated models by their performance on input output pairs. But there's another possible approach that's mechanistic. Evaluating a model by its inner workings. The field of mechanistic interpretability works on developing tools for understanding the inner workings of neural networks. There are a few goals, but a big one is using this understanding to steer model performance in some way. For example, the Claude team at Enthropic released a demo of a Claude model, which would always divert conversations to be about the Golden Gate Bridge. Of course, you wouldn't actually want a model that always talks about the Golden Gate Bridge, but you can use similar tools to steer models so that they're used safely. Now, there are many interesting results which I don't have time to get to about adapting these methods to study world models. For example, the researchers from the Athell paper I discussed earlier show that you can intervene on the model's activations so that the model predictably acts like a different board. Now, if we comprehensively understand the inner workings of a model, we could use this understanding to evaluate if a model has understood the world. But how feasible is comprehensive understanding? It turns out it's quite challenging. Chris Ola, one of the leaders of the field, has called this the dark matter of neural networks, a large fraction that cannot be easily understood or interpreted. Neil Nandanda, another leader of the field who's now a deep mind, had another way to say it. If you're aiming to explain 99.9% of a model's performance, there's probably going to be a long tale of random crap you need to care about. We can see why this makes it challenging to evaluate world models mechanistically. The original GBT paper found an emergent nonlinear internal representation of board state. A follow-up study found that there was actually a linear representation and that the reason for the discrepancy was that a fellow boards can be represented to humans in different but equivalent ways. But another follow-up found that a GBT actually learned a bag of huristics instead. not a coherent board, but a bundle of rules like if the move A4 was just played and B4 is occupied and C4 is occupied, update B4 plus C4 plus D4. A rule that doesn't generalize across the board. And I think this makes clear where mechanistic tools are useful because fortunately these tools can still be used to edit models to make specific improvements even if we don't have comprehensive understanding. But evaluating world models requires comprehensive measurement. We want to be able to take any general procedure and see if it improves a world model. If we only understand part of a model, we can't compare two models by only looking at the parts that we understand. Now, another related idea is to study model architectures by looking at their theoretic capacities. For example, a really interesting literature tries to understand which formal languages can theoretically be recognized by different architectures. And these results can guide the kinds of ways we use models. Another related idea is the use of world models in reinforcement learning or RL in RL. A world model has a technical definition that's somewhat different from how I've been using it. A world model in RL is a predictive model of an environment's dynamics. For example, if we want to train an agent to play a video game, we may have it develop a world model of the game or how its actions result in different outcomes. These world models are trained on state explicitly which differs from our goal of evaluating implicit states. And moreover, the goal in RL isn't necessarily recovering structure. It's primarily about making better predictions about an environment or improving an agent's planning capabilities. So, we've seen that generative models like transformers can do amazing things with incoherent world models, but this incoherence makes them fragile for other tasks. So, where should we go from here? One possibility is to accept the fact that our world models are imperfect. Fortunately, models don't need to have correct world models to be useful. So, one approach is to zoom in and evaluate models in specific places such as based on the places where people use them. But of course, we should also work on improving world models. And there are many ways of doing this. One is to design new architectures that are designed to have better world models. And we've seen some promising results from state space models. Another approach is a neurosy symbolic one, combining neural networks with formal reasoning modules like probabilistic programs. We don't need to stop at new architectures. We can also think about new training procedures such as those that go beyond next token prediction. Another possibility is to find better ways to incorporate human feedback into world model training. Or we can also try training models by taking inspiration from ideas from causality and causal representation learning. Overall, there are many promising ways we can work on improving world models, and evaluation metrics will help get us there.

Andrej Karpathy — AGI is still a decade away

Why AI will spark exponential economic growth | Cathie Wood

What will happen to marketing in the age of AI? | Jessica Apotheker

#461 – ThePrimeagen: Programming, AI, ADHD, Productivity, Addiction, and God

Dylan Patel — Deep dive on the 3 big bottlenecks to scaling AI compute

The most important question nobody's asking about AI

Elon Musk — "In 36 months, the cheapest place to put AI will be space”

Adam Marblestone — AI is missing something fundamental about the brain