Library / In focus
AXRPTechnical alignment and control
Mechanistic Interpretability with Neel Nanda

Why this matters
Frontier capability progress is outpacing confidence in control; this episode focuses on methods that can close that reliability gap.
Summary
This conversation examines technical alignment through Mechanistic Interpretability with Neel Nanda, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Perspective map
MixedTechnicalMedium confidenceTranscript-informed
The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.
An explanation of the Perspective Map framework can be found here.
Episode arc by segment
Early → late · height = spectrum position · colour = band
Risk-forwardMixedOpportunity-forward
Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).
Showing 140 of 182 segments for display; stats use the full pass.
StartEnd
Across 182 full-transcript segments: median 0 · mean -1 · spread -15–0 (p10–p90 0–0) · 0% risk-forward, 100% mixed, 0% opportunity-forward slices.
Slice bands
182 slices · p10–p90 0–0
Mixed leaning, primarily in the Technical lens. Evidence mode: interview. Confidence: medium.
- Emphasizes alignment
- Emphasizes control
- Full transcript scored in 182 sequential slices (median slice 0).
Editor note
A high-leverage addition to the AI Safety Map that clarifies one important safety bottleneck.
ai-safetyaxrptechnical-alignmenttechnical
Play on sAIfe Hands
Episode transcript
YouTube captions (auto or uploaded) · video 3YbE7zybc5k · stored Apr 2, 2026 · 5,807 caption segments
Captions are an imperfect primary: they can mis-hear names and technical terms. Use them alongside the audio and publisher materials when verifying claims.
No editorial assessment file yet. Add content/resources/transcript-assessments/mechanistic-interpretability-with-neel-nanda.json when you have a listen-based summary.
Show full transcript
[Music] thank you hello everybody in this episode I'll be speaking with Neil Nanda after graduating from Cambridge Neil did a variety of research internships including one with me before joining anthropic where he worked with Chris Ola on the Transformer circuits agenda as we record this he is pursuing independent research and producing resources to help build the field of mechanistic interpretability but around when this episode will likely be released you'll be joining the language model interpretability team at deepmind in this episode we'll be talking about his mechanistic interpretability research and in particular the paper is a mathematical framework for Transformer circuits in cortex learning and induction heads and progress measures for grocking biomechanistic interpretability for links to what we're discussing you can check the description of this episode and you can read the transcripts at axrp.net first of all is it right to say that you understand yourself as a mechanistic interpretability researcher yep yeah it feels pretty accurate so what is this uh mechanistic interpretability thing sure so this is a annoyingly fuzzy question the starting point I'd go with is with an analogy to reverse engineering a compiled program binary to its source code neural networks are this thing that emerges from weird processes like static gradient descent that's learned a bunch of parameters that led it to do a task competently we have no idea how it works and I want to be able to understand what it has actually learned and why it does what it does but I'm holding myself to the extremely high standard of I want to actually understand what the internal cognition of the system is like what are the algorithms it's running such that it takes the inputs produce a series of intermediate things and then produces some legible output okay it's kind of like that was potentially a mix of you've got this like neural network that emerges from training one thing you could be trying to understand is like okay what are the algorithms encoded in that like final neural network what's sort of the quote unquote cognition going on in it and another thing that you were maybe hinting at was like okay during the training of the neural network like there's like maybe there's some reason that like some parts of this final Network got emerged as like the thing that would go into the network at the end of training so how much of this is about like what's happening in the final Network versus like what was incentivized during training and how that story all went so to A first order it's entirely about looking at the final Network okay to me the core goal of the fields is to be able to look at a final Network and be really good at understanding what it does and why I think that this does pretty naturally lead to being able to look at a thing during training my grocking work which I'll hopefully chat about is a pretty good example of this where I tried really hard on Southern Network at the end of training and then what happened during trading just fell out and I think a bunch of the more ambitious um ways to apply mechanistic interpretability is by influencing training Dynamics but to me I see the core of the field is can we take a single model checkpoint and understand it even if this takes a lot of time and a lot of efforts and I think that is feedback with training Dynamics I also think looking at it during training might be necessary to explain some things like weird vestigial organs that were useful early in training but are superseded by some later algorithm which are no longer needed but still stick around okay do you actually see that in practice I'm not aware of a concrete example of seeing that I would be very surprised if it doesn't happen okay all right sure so now that we know what uh what you mean by this mechanistic interpretability thing why do you do it what's the point so on a purely personal level it's just really fun and that actually does carry quite a lot away from me I also think it's actually very important so at a high level I think we're living in a world where we're gonna get computer programs that are human level intelligent or beyond that are doing things and that it's possibly going to be pretty hard to tell whether these things are doing things because they're actually aligned or because they're learned to be deceptive or to exploit flaws of the training process or just actually trying to be aligned a bit of mislearn some things and I would really prefer to live in a world where we actually understand these things and I think that mechanistic interpretability is not obviously the only path to get to a point where we can make some claim to understand systems but I think it's a promising path that to me achieves a pretty high level of rigor and reliability and understanding I also think that neural networks are a really complex system and it's really hard to make progress in a complex system without any grounding of something you actually understand that you can build off of and I just think that lots of things we're confused about the networks where they're just in ml as a whole in alignment just kind of like anything you might care about I expect it will get easier the more we can claim that we actually understand what's going on inside these models sure I also think it's kind of scientifically fascinating yeah yeah and like so there's there's this kind of definite vision of like okay we want to do mechanistic interpretability because at some point we're going to train a model and we're going to want to know a whole bunch of facts about the cognition of that model in order for us to evaluate like do we think it's good do we want to release it and there's this other broad story of like well you know we don't really know how things work if we like understood the what was happening this would like help us get better abstractions it would help us understand the science of deep learning and then we'd just be in a better place generally I'm wondering how much weight do you put on those two stories sir can you repeat the question so there are two stories of why mechanistic interpretability could be useful one is you want to do it so that you develop these mechanistic retrovertibility tools and the way you use them is one day you're going to train a model and you're going to going to want to know whether it's like a good model or a bad model in terms of How It's thinking about stuff and so you use mechanistic interpretability to understand its cognition story two is like look we just have mechanical interpretability lying around so that just in general when we do deep learning we want to know what's going on and we could say ah well because we've done so much about this interpretability in general we know that there's these like squeebles that always happen in neural networks and like probably this weird thing about your loss going down is is due to squeeble formation and you know you need to like tweak it to drive out the squeezles or something so there's one story where it's like uh you know we need to use it for this thing there's another story where you're like okay if I want to understand how like I don't know what this what value of squeeble is going to be important but I know that like we're eventually going to have to do good science of deep learning and effective interpretability is the way to do that uh so these are two ways that mechanistic interpretability can be valuable I'm wondering if you have a sense of like in your head is one of these stories like way more compelling than the other or like a little bit more compelling or do you think like no these are like two basically equally good reasons yeah I would say that the take a system and figure out whether it's doing good things or bad things is somewhat more compelling to me but it's a somewhat messy question and I want to try to disentangle a couple of things here the first is what do I think is actually going to push on reducing X risk where being able to take a specific system and ask questions around is this doing good things or bad things feels more well quite a bit more important to me than answering general science of deep learning Mysteries because I think it's not obvious to me they're getting good at signs of deep learning is actually not good for x-risk yeah and it wouldn't surprise me if it's that bad I also kind of think about theories of change in and the way I see a lot of mechanistic interpretability work at the moment is we are just trying to figure out the basic rules of networks and how they work and what kind of things seem to form and just like what are we even doing hmm and I think that it can be a mistake to be too goal directed when trying to do basic science and I expect that lots of things that get done will turn out to be useful and interesting in lots of unexpected ways for future things people want to do and I also think that just if we actually get good at interpreting systems this just seems like it should be really useful in loads of different areas and a bunch of ways I can't really predict the final clarification would be that I generally think about goals of mechanistic interpretability as kind of having a sliding scale of difficulty in terms of I don't know if we reach a point where we could take a single system and spend two years of time and a ton of researcher and like less Guild worker effort trying really hard to notice lack of alignment in it I'm like that's reasonably useful not incredibly useful like I feel reasonably good about that then there's a world where we can have the ability to take a system and in like a day come up with a reasonable idea of if it's deceptive or not and this seems more useful there's worlds where we could turn these into kind of automated robustness checks that we could give to our limited researchers so they can validate their techniques there's things where we can actually insert some metrics into training and like you do gradient updates on those there are cultural effects where if we could credibly say to every lab making an AGI here's a proto-agi that was made and is clearly deceptive in this really subtle and hard to see way that we can identify with this tool maybe you should be careful and take a limitable seriously and and it seems like theirs depending on how well we succeed there's just a sliding scale of how useful this all could be hmm yeah that seems reasonable I guess sort of related to some of that I think one reaction you could have to some of this uh mechanistic interpretability work is like oh or sometimes you're like really worried about accentual risk from AI like uh some people I know think like oh if we just do more AI research the like kind of standard default way it's going like maybe we're just all gonna die I think from that perspective there's some worry of like hmm all these like mechanistic interpretability papers that get published and that help the world understand like how deep learning is actually working and what's actually going on like maybe that's kind of helping AI get generically better even more than it's helping like any alignment specific stuff I'm wondering what you think of that concern and how much maybe like what your threshold what do you think the threshold should be for like just not publishing mechanistic interpretability work yeah so I think this is a pretty valid concern and it is definitely by far my strongest guess for ways my actions could turn out to be negative on the world and I don't know if I have like a confident and satisfying answer to this question I.E definitely think that if the ambitious claims I'm making that we could actually get to a point where we understand networks in any meaningful sense a true then this should clearly be useful for making them better I think it's very plausible we don't get good enough at that to like really matter for improving models but potentially get good enough at it to like help with auditing a system or help with understanding how to align it I also think it's perfectly plausible things are the other way around I partially just have a broad heuristic that it seems just obviously the case that a world where we actually have some insight into these weird black box systems is a world where they will be easier to align and we will be safer but I don't really know how to form careful principle answers to these questions one example um that maybe makes me feel a bit more a bit more on the concern side is uh so anthropic we have this induction heads paper that we're going to talk about a bit later and there was this recent paper called S4 that tried to introduce a new architecture based somewhat on lstms that was better at tracking long-range dependencies was trying to compete with the Transformer and a thing that was a significant part of the paper was comparing them to induction heads which are one of the main ways Transformers track flow rate dependencies and it sounded like induction heads might have inspired this architecture and they don't know if S4 is actually any good or the kind of thing it's actually going to matter but like ah it is plausible we live in a world where there was a important Insight that was inspired by some mechanistic interpretability work and like ah that doesn't seem great in terms of my takes on work that I think should versus shouldn't be published I generally think that anything which is deeply tied to something that could help us make Frontier models like Palm better I'm very very concerned about and like things which are kind of looking directly at this models noticing things that are off about them and trying to fix them I'm more relaxed about things that are kind of looking at smaller language models things that are kind of explaining specific circuits or giving better techniques and Frameworks for doing this or things that are more like General insights into deep learning that aren't specific to like the frontier models where my intuition here or something like actually understanding the systems uh actually understanding what the hell is going on in deep learning oh does Hell count I think you're like yeah you're allowed to say hell yes um yeah so my guess is that something that's more about understanding deep learning generally is safer than something that's kind of very focused on debugging large language models because my guess is basically just that understanding what's going on in deep learning seems like it should just be useful in both directions pretty broadly while understanding specific details of things go wrong in a lot of language models seems much more useful on the frontier making them better yeah but I'm just pretty confused about a lot of these questions sure a related question which uh you might be confused by um one thing I kind of Wonder is like suppose we do this kind of magnistic interpretability work uh largely on like on a smallish models or models where it's slightly easier or moles that exist today which are probably somehow qualitatively different from like super from future you know super crazy models um how scale invariant do you think we should think of the insides as being like you could imagine like you know there are all these things that are true of like these one layer Transformers that aren't true of you know gtn or like they're things that are true before you figure out like uh the meaning of life and then once you figure out the meaning of life like all your circuits change uh so I'm sure you'll be shocked to hear that I'm also confused about this one all right so in part I think this is just an empirical question that we don't currently have a ton of data on the main data is the induction heads paper which we'll talk about a bit later yeah but we found this simple circuit in two layer intentionally models and then found out that this was a really big deal and those models but also a really big deal and actually occurs in every Transformer we've looked at up to about 13 billion parameters and yeah for context how many parameters are in something like I don't know the chat GPT models that people can interact with on their browsers so we don't actually know for chat GPC because openai isn't public about this but gpt3 is 175 billion okay so 13 billion is like about one order of magnitude less than yes something that's maybe state of the art yes I will cancel that I think parameters are somewhat overrated as a way of gauging model capability um the chin deepminds chinchilla paper the main interesting result was that everyone was taking models that were too big and training them all too little data and they made up 70 billion parameter model that was about as good as Google brain's Palm which is 600 billion but with notably less compute but this might be splitting has so related so that actually brings up something I wanted to ask about so something that I see is somewhat analogous to mechanistic interpretability is work on scaling laws so I think the first work uh correct me if I'm wrong but I think the first scaling laws work was by Danny Hernandez and some collaborators at openai I don't believe Danny was an author on the paper oh he wasn't let me just check so I'm not nope uh Jared Kaplan Sam McCandless I believe the two lead authors oh I I think I'm thinking of another thing he was an author on a transfer a scaling laws for transfer learning paper that came out later on uh okay but um so I think that like a lot of the initial scaling laws work was by like perhaps the more AI alignment or AI Express focused part of opening eye but I kind of see it as like on the one hand an attempt to understand like what's going on with deep learning like how's it happening in a sort of similar vein well they're not the same as mechanistic interpretability I also think that it's ended up as like you know primarily used to like help The Wider World train models better so I'm wondering firstly do you think like that's a good analogy and secondly do you think it's a concerning analogy so I'm gonna weekly argue that it's a bad analogy okay I think it's pretty plausible that the scaling rules work has been fairly net negative and has been used a lot by people just trying to push the frontier of capabilities though I don't have great insight into these questions I think that to me scaling laws don't feel like they get me any closer to a question around is this model aligned or not I don't feel like I understand anything about the model's internals I feel like it could I feel like it pleasantly helps me forecast things and I do think that forecasting things is important but it feels important in a very different way to why I think mechanistic interpretability is important I guess I will also note that my best story about this getting lost my net positive is that it just gave it just meant the path from here to egi was a lot smoother and more continuous so we're better prepared and can do better alignment work along the way rather than someone just eventually realizing oh wait uh turns out that I already had all the computer would take Mick HEI and on that front I think that is a thing that is useful yeah to me it's getting almost feel like the kind of statistical physics what happens when we get lots and lots of parameters and average out lots of tiny interactions into some big smooth power and the mechanistic interpretability feels much more like let's zoom in and try to really understand the details of what's going on rather than zipping out too much that a lot of the details abstract away okay would it be right to understand your position as like scaling laws is like less useful for AI expert production or AI alignment than mechanistic interpretability might be but maybe they're like similarly useful for um just building AI in general or do you think that scaling was was sort of knowably more useful for building AI I would predict that scaling laws were knowably much more useful okay so this is partially hard to really engage with because I'm sitting here knowing the key results of scaling rules and their practical implications yeah like not sitting here knowing the key results mechanistic interpretability and if I came back to you tomorrow and was like oh with this one weird trick you can turn gpd3 into AGI and I found this by staring at its weights I'd probably be like ah that is worse than scaling laws okay I don't expect this to happen to be clear but yeah I think the difference I would make is that if I'm a company trying to build an AGI and I hear about scaling laws that I'm like oh that's so useful um I was gonna say it's so useful because I know kind of how to scale and like what my hype and some amount on um how big my model should be and how much data I need to collect and how good I can expect it to be for the amount of computes on reflection I actually think that the most important consequence was just this idea that scaling will continue to work and that we should just try really hard to make models bigger and it's seeming less and less likely that's the magic point where everything breaks and you've wasted a billion dollars hmm and that's kind of very tied to their ability to forecast and I mean it's plausible to me if you've actually got Goods at mcinturp we'd be good at forecasting um I'm probably going to call it macintut because mechanistic interpretability is a mouthful fair enough I don't know I think yeah I think the two things here are forecasting versus understanding where forecasting seems more important to changing the actions of the people who make the HEI understanding seems more important for disentangling what's going on inside of it and how happy we are that yeah okay so I guess I have a few questions about roughly like where you think mechanistic interpretability is as a field first one is intuitively I kind of think of there as being some Spectrum where at the low end of the spectrum is like basic I don't know like I'm detecting edges or I'm noticing that like something is text rather than picture and then maybe a little bit higher level is like I noticed that it's you know all caps a bit higher level as I noticed that it's written in English rather than French and then at like a very high level is something like reasoning where I'm like recalling you know how to do I don't know modus ponens or like in which situations mathematical induction is and like what uh or like what a useful like inductive hypothesis might be in this um in this case you know where where it's something like perception versus reasoning is the Spectrum and if we think that like to the extent that um language models can do things that are closer to reasoning as well as like basic you know knowing what language uh their input is being written in it seems like if we want to understand them we're going to have to like understand this whole spectrum of cognitive abilities I'm wondering so first I guess like this this understanding of there being one kind of spectrum is potentially contestable so feel free to contest that if you want um but if you don't I'm wondering like well sorry spec Spectrum being the kind of sensory to reasoning yeah the kind of sensory to reasoning um or you know sensory like cognitive algorithms or sensory bits of cognition to reasoning bits of cognition um yeah I'd actually argue there's three bits to the Spectrum oh yeah um there's Sensory neurons at the start that take the raw input like pixels or tokens and convert them to the software model actually wants reasoning that does a bunch of processing and then motor neurons which converts the processing to the actual desired outputs ah yes of course um I don't know you get neurons in language models that do fun things like uh so uh the word nappies or diapers if you're American it's delivered into two tokens but conceptually the model clearly figures out oh nap users here and there's a re-tokenization neuron which says Ah this token is the m in nappies and the nappies feature is there so I should output appease sure yeah so so okay if I think if there as being these three types of abilities maybe there's maybe their points on a triangle where you can go around how how much of that triangle do you think mechanistic interpretability is good at and do you think there are any regions of the triangle where it's going to have to like change Tech or do something different the simplistic answer I'd want to give is that we are good at the start and end and bad at the middle I'm not even sure how true this is in part because we're currently just kind of bad at everything okay and in part because so the way I normally think about reverse engineering a network is you start with a thing that is interpretable and ideally two things that are interpretable and you try to figure out how the model's gone from one to the other hmm um or at least you just look nearby the one that is interpretable look what's being done with it and try to move outputs some sort of induction thing yeah um notably nothing to do with the induction induction heads and we just like obviously the input to the model is interpretable the output of the model is interpretable we know exactly what these are interestingly in language models it's often been easier to understand things at the output than things near the inputs so so one important thing to bear in mind when thinking about language model interpretability which is most of what I'm going to be talking about is that unlike classic neural networks or convolutional networks where the key thing is you've got layers of neurons each layer's input is the output of the previous layer essentially and you can kind of think of it as a little iterative steps to processing there are two big differences with the Transformer the first is um the city of the residual stream and the second is this idea of attention heads so the idea of the residual stream is the input uh this this is somewhat used in some image models the idea is that the inputs to a layer is actually the accumulated output of all previous layers and the output of that layer just adds to this residual stream and in standard framings of things like residual image networks this is often framed as a skip connection around the layer that's like a random thing you add on but isn't super important but I think the way Transformers seem to work in practice is that every layer is kind of incrementally updating the central residual stream one demonstration of this is that there's this technique called the logic lens where you just cut off the like last several layers of the model and look at how good it does if you just set those to zero and then convert this bit to tokens and uh turns out models are like a lot of the time kind of decent at this though it varies between models and my interpretation so the reason this masses is it means it's harder to think of the model in terms of kind of early middle and late because processing can just so I normally think of the thing the model does as kind of having paths through from the input via some layers through to the outputs you know add these pods just tend to spend a long time in the residual stream which makes it harder to reason cleanly about things yeah so that's the residual stream the second messy thing to bear in mind is attention yep so Transformers are fundamentally sequence modeling networks their input is a sequence of tokens which you could basically think of as words or sub words and at each step they're doing the same processing in parallel for every element of the sequence there's a separate residual stream for each token position and it's using the same parameters in parallel and the main thing that so obviously it needs to move information between positions because we don't just want it to be a function of only token of only the current token and the way it does this is with these attention layers that are made up of several attention heads and each head it has some parameters devoted to figuring out which other positions of the sequence are most relevant to the current token and it's got a separate set of parameters that says once I figured out which bits are most relevant what information from that positions were to draw stream should I move to the current token and the reason this matters is that it means that a decent chunk of the Transformers computation comes down to routing information between different positions figuring out what information to root and this means that in practice a decent chunk of what we're good at in Transformers is understanding this information routing and how it works okay and some of the things you've understood it's like not obvious to me whether you should even consider this kind of sensory or reasoning stuff it's just like the model takes the raw input does some creative thinking where it figures out which tokens are most relevant to the output on a specific token and then it learns to root information from there to the end but once it's figured out which token is relevant getting the answer is kind of easy for example it's trying to figure out what the number at the start of a line is and it learns to look at the number at the start of the previous line and it's like the previous line began with five so this should be six the hard part of the task is figuring out where the five is or the fact that it's five yeah once you've figured that out it's like okay cool it's very easy task okay so so it sounds like what you're saying is like because of the way Transformers work it's not obvious that like you can divide cognition into like purely sensory purely reasoning or purely motor or like that might not just be the best breakdown kind of I do think that tracks something real about Transformers in part what I'm saying is just I think the sensory reasoning in motor motor is a good description of what the neurons in the model are doing but it also spends a significant amount of its computation on attention which is yeah I I generally break down what a Transformer does into these two separate components one of which is rooting the most relevant information between positions and sequence and the other one of which is processing the information once it's been brought there okay and notes that information can be rooted around like several times for example um there is this great work by David bound Kevin mang look in their room paper looking at how models did factual recall like give it a sentence like the Eiffel Tower is located in the city of in upwards Paris and they found reasonably suggestive evidence that the model first Roots the separate tokens of the phrase the Eiffel Tower to the final token it then realizes this is the Eiffel Tower and looks up the fact that it's in Paris and then that gets rooted to the final of token in order to predict that Paris comes next okay and how does it look up that the Eiffel Tower is located in Paris uh so this is a great question um to which the answer is my prediction for what's happening is that the model has so Transformer uh neurologics in general seem to represent um features that is things they have computed about the inputs as directions in space where they can look up whether a feature is there by projecting onto that direction and my prediction is that so um Eiffel Tower gets tokenized as capital e i f f El and Tower and I predict that the model is taking um the model has a tension that moves the E if L features to the tower token and then there are some neurons which say if tokens three back as e tokens two back is f token one back as L and current token is Tower then output Paris neuron Paris feature and then Blake heads later in the network move the is Paris feature to the city of token don't be of token another important thing to know about Transformers we feel may not be aware of is the way models like gpd3 are trained is they're just given a great deal of text from the Internet or books or whatever and the text is split up into tokens and these are trade the model is trained to predict the next token and the way the attention is set up information can only move forward in the sequence not backwards so for each position of the sequence the model is trying to predict the token after that yep and its value at that position of the sequence can only be a function of that token in earlier terms okay so we move back sorry going up a little bit I guess I kind of want to ask how do you characterize like the kind of thing or the kinds of cognition that mechanistic interpreter really currently has a good grasp of and do you see any walls towards like oh if we want to do this kind of cognition like we're gonna have to think differently so honestly to me it feels somewhat more tied to how things are mechanistically represented in the network than it does to exactly what the cognition is okay so both matter so a lot of the mental moves that I'm making when I'm trying to reverse into the network are around discovering the thing that things are localized within the model that there are some heads or some neurons or at least some directions in space that represent the key feature for this problem and that most other things do not matter for this input and can be safely ignored and the next thing that matters can be computed solely from the heads or neurons that I identified earlier that are important and one thing you need to be able to do here is to break down the model's internal representations of things into kind of independently understandable units like an attention head or a neuron being the classic thing we want to do here and if computation is kind of localized in this way then our life is much much easier and if it is not localized then we are much less good at this there's this core problem called superposition which we're probably not going to talk about that much later but in brief models want to represent features as directions in space and when they have uh sorry Transformers have these MLP layers where there's a linear map followed by a activation function called jellu where you just take each element in the vector and apply this weird function that's basically a smooth smooth out of relu to it and then you have a linear map back to the residual stream and so we expect the MLPs are being used to process information which means being used to compute some features we expect the features are going to be represented as directions in space and the space being like the the kind of vector space that's spanned by like these neurons right yes one way to think about it is just the uh middle bits of the MLP layer is just you've got this list of say 4000 neurons and each neuron has a scale has like a value and you can also think about this as a vector and 4 000 dimensional space yep and we kind of really hope that the model is using these neurons to represent uh the features that the directions the features correspond to are aligned with like one feature per neuron and this often seemed to be the case in image models and the main reason you might think it would be the case is that so models internally seem to care a lot about Computing features of the inputs things like this token as a verb or this is a proper noun that appears in a European capital or things like that and we expect that models want to reason about these independently but um and if you have a feature per neuron you can reason about this independently because the relu or the jello acts on each neuron independently but the this also forces the model to only have as many features as it has neurons and models do this thing called superposition which you can essentially think of them think of as them simulating a much larger and sparsome model where they have many more features and they have neurons and kind of squash them in in a way that has a bunch of interference messiness but which lets them represent a bunch more features and this does seem to happen in Transformers a bunch and it's really annoying and it's plausible to me that there's just some cognition that happens with aligned with neurons and some that happens that's not aligned with neurons and there may not be that big a difference in this cognition but one is currently much harder to interpret than the one without superposition the final answer to the question is we're just much better at interpreting the cognition around uh attention hits that we are about neurons and again attention ahead so the bit of a model that's due with routing information between the different token positions and in part because this is just much more legible um you can literally look at the attention patterns in the model and it's very easy sorry attention pattern being which previous positions does the head think is most relevant to the current position okay it was very easy to like be misled by these but it does give you quite a lot of information sorry why is that easy to be misled so so mechanistically what an attention head is doing is it's taking us a token for each token position um which I'll call the destination token it's looking over all possible previous tokens um including the current token let's call these Source tokens at identifying some weighted uh assigning weights to each of them such as the weights fall out up to one uh that this is actually an important detail and then copying information from each of the residual streams of those token positions to the current thing yep and importantly copying the information at the residual stream after layer um I don't know after layer 10 can potentially have basically nothing to do with the actual token inputs to that position at layer zero but if you naively look at an attention pattern you'll be like ah it's learning about this token with maybe some contextual information yeah but it's possible yeah it's possible it's entirely picking up on stuff that was brought in by previous attention heads yeah yeah and there's some suggestive evidence that models sometimes use kind of really really easy tokens as spare storage and stuff like that or uh another thing is attention patterns need to add up to one for boring reasons and sometimes the head wants to be off and it will attend to the first token to be off anyway look at this be like ah I gave it the sentence the cat said on the mat and all these heads really care about the token the that must be really important and it's like no okay so I have a few follow-up questions before I'm moving on a little bit um way back we said that uh one difference that uh language models had compared to these image classification networks is that there's this thing called the logic lens where you could like take the transformation that you're eventually going to do on the residuals stream at the end of the network to get out these Logics which are basically probabilities of various options for the next token and you could kind of apply that like in the middle of the network and you could see that it was sort of getting this representation of a decent guess at the answer it strikes me that you could you could try doing the same thing in image models right has somebody tried that and found that it doesn't work to your knowledge I am not aware of anyone trying this uh one thing I will note is that having the input format be the same is very important and the way that I believe image models even those with the residual stream tends to work is they take the input image which is big and then kind of progressively scale it down with these pulling layers and then at the end they just totally flatten out what they've got and [Music] maybe have another couple of fully connected layers before producing an output and like if the shape of the output is not the same as the thing if if the shape of an intermediate activation is not the same as the thing that is then unembed like mapped to the class outputs this will just technique just doesn't even work in principle oh uh why not so oh sorry so the naive technique of you just delete these layers doesn't work because oh because the layers are changing the number of dimensions in the residual stream yeah exactly and in a Transformer the residual stream is this consistent object the entire way through the network that each thing is incrementally updating there are a bunch of image models that I think do have something more like a residual stream in particular Vision Transformers uh all the rage and there is a bunch more image architectures I'm not aware of anyone doing this I would predict it would kind of work um a further caveat is that a thing you could try is by training another linear map on the earlier activation with a different shape to like class probabilities and this is the kind of thing I'm sure someone has tried yeah or you could imagine like deleting so so the way it normally works is you have some convolutional layers and then you like flatten it out or project down to some Vector something that like your MLP multi-layer perceptron layers can act on and you can imagine just like deleting some of the convolutional layers right so instead of it just being a simple matrix it's your your MLP layers doing the unembedding or the like getting to the cost probabilities um okay all right that's that's a tangent all right uh a final follow-up before I go to a second follow-up of the thing I said uh maybe half an hour ago is um yeah when we were talking about like these types of cognition in these language models right the place you started off was like okay you want to have like um kind of you want to have this sandwich where the the pieces of bread are like things you understand the filling is things you don't yet understand but you can kind of infer it from the things outside and you mentioned like the the very output of these language models was understandable basically and I'm wondering so I think this is roughly right in cases where like I don't know you're doing something like write like a typical person would write and that's what the language model is trying to do you could imagine like language models that are used for playing diplomacy or whatever like maybe the things that are being output like it's saying hey please invade German sorry diplomacy is like this uh board game where you like make allies and stab people in the back and like you know fight Wars in Europe basically and in diplomacy you send messages to people like hey I'm definitely not going to invade you do you want to team up on uh invading Germany instead and like the meaning of that might not be so obvious right so I'm wondering like yeah how how understandable do you think these like language model outputs are sure so first off to clarify when I say the output is understandable I literally mean when the model says hey that's going to be Germany it is outputting the words hey let's go Germany yeah and just like this is actually like a significant upgrade over looking in the middle of a transformer where I just what does this mean man yeah it's these like tensors of numbers and yeah I don't know what the 8.7 means yeah it's just you just have nowhere to get started unless you can ground in something that matters sure and yeah even if the model was outputting seemingly random gibberish the fact that I can say things like hmm it says flurgal rather than blurble why is the flugel logic higher than the blubble logic let's go look at what's happening nearby see if I can get some traction there a real thing that I can do in the particular case over the diplomacy playing Agent I would predict that I would predict that the diplomacy playing Agent is outputting things like let's go invade Germany because it has some internal representation of what the person that's talking to you wants what the agent wants and how to manipulate the person that's talking to you to achieve this yeah yeah excited that paper is terrifying yeah right and uh for for our audience you might not be terrified what was so scary about that paper ah so they trained this model to play diplomacy this game which is about you have a bunch of players and they form a shifting alliances and sometimes stab each other in the back and he needs to convince other players to go along with your plans so you can win out if everyone else and the model they trained despite their weird marketing claims that it was um honest and Cooperative player recently Liza manipulates two people successfully in a way that gets them to do what it wants and help it with the game and I'm like no this is this is the thing I'm most scared about AI learning how to do what are you doing yeah I think in their defense they're using a definition of lie where as as based on Seinfeld it's not a lie if you believe it temporarily even if you can make sure that you won't believe it later um yeah it's uh uh good thing uh we don't have uh world champions of diplomacy being really really relevant to how AI turns out huh no that might be mean um but uh and yeah I think something that is worth there's something like worth emphasizing here which is the if the model says let's go and bait Germany and it's lying if you were just doing some black box analysis it might be really hard to get any traction here because the model is lying it's saying things that seem kind of perfectly reasonable but aren't as track of what it believes but if you are aiming for the very ambitious goal of actually understanding its cognition behind what it said foreign it's like a very different question and if you want to be able to credibly claim to have understood it then you kind of must have been able to notice why it did and what the hidden things behind this were such that I think saying its output is not interpretable is kind of a category error in some sense you you mean like its output do you mean it's a category error in the sense that like look its output is understandable somehow and you could understand why it did what it did yeah I'm just kind of unconvinced there is such a thing as an uninterpretable output uh because there should always be a reason the model output it I mean but you could say the same thing about any activation or like anything happening inside a network right there's some reason that it's that way that's a fair point I think the thing that I'm pointing out here is something closer to what optimization pressure is placed upon the model where it's never optimized to have any specific activation be any specific thing that it is optimized to have the logit corresponding to the Daniel token be actually related to the Daniel token yeah yeah so you somehow know that like it's at least being graded on the output so somehow the outputs are going to be related to it getting high grades in the cases where it was tested yes yeah that seems fair okay uh I guess my penalty question about mechanistic interpretability generally is um at the very start you alluded to this idea that like one thing you could do is like okay we have a bunch of tools such that if you have some neural network a bunch of humans can spend two years like trying to understand what's going on and succeed but there's this different possibility you mentioned of like can we somehow automate this like uh can we get like I don't know progress measures or can we get like you know analysis tools that can just like take our take our neural Nets and like tell us what we would have concluded if we thought about it really hard I'm wondering yeah how one thing to add to that oh yeah is I think there's a broad spectrum of automation from like have a metric have like a tool that's kind of doing what we would have done anyway to actually being able to have an AI system take over more and more of the actual cognitive work and judgment calls and intuitions and a lot of my optimism for a while where we might be able to actually have fully reverse engineered and let's see what's going on in a Frontier Model like gpt5 is I expect it's most likely to happen if we have systems that are more capable than the language models we have today but maybe not quite a human level that can do a lot of the work for us just because something being really labor intensive is much less of an issue when all jobs are automated including Arts okay but that's very much an ambitious goal of the field not a thing we're near today sure sure how how close do you think we are to automation at any level of the spectrum and like what do you see any like promising Avenues so I actually recently wrote a post on open problems in mechanistic interpretability um sorry I'm running a sequence on open problems mechanistic interpretability and I recently wrote a post on techniques and tooling so I actually have I'm usually Chris the thoughts on this oh great where yeah so a couple of axiers that I break down kind of techniques and approaches on there's General versus specific so General techniques are a broad toolkit that should work for many circuits including ones we have identified charts that's probably actually defines circuits so when I say circuits what I mean is some sub part of a model's parameters and cognition that is doing some specific task for example Adam the Eiffel Tower example the bit of a model which takes the raw tokens efl Tower and which produces the this is an Eiffel Tower feature would be an example another circuit for induction heads as we'll discuss later and yeah so General techniques the broad toolkit that should work for many circuits the logit lens that I mentioned earlier is one example of this that in principle shoulders work for every sequence and then there's specific techniques which are focused on identifying a single type of circuit or a single Circuit Family one really dumb example is models often form previous token heads which attend to the previous token and surprise it's really easy to identify these because you just give a bunch of inputs and you look at the average attention page the previous token and some heads are really really high on this and they're obviously previous tokenheads and this is an example of a technique that is Trivial to automates and you could just run on every head in any model I think I would actually love someone to do is just to write all of these dumb metrics for basic kinds of heads that make a Wiki for like every hairs and a bunch of Open Source models just wouldn't be that hard would be a cool project yeah the next axis is exploratory versus confirmatory so exploratory techniques are like I'm confused this model is doing something what is going on I want to get more data I want to form hypotheses I want to get some evidence for or against my hypotheses I just like want to iterate a bunch and this tends to be pretty subjective and involves a bunch of human judgments and it can involve things like visualize a bunch of data and try to look for patterns in it do some kind of dimensionality reduction uh just generating a bunch of ad hoc ideas on the fly or just kind of quick and dirty tools with fast feedback loops that may rely heavily on human interpretation and then confirmatory techniques are things that let you take a circuit and say I think this circuit is what's going on behind this Behavior I now want to go and rigorously test and verify or falsify this belief Redwood research have this awesome new algorithm called causal scrubbing which I think is an excellent fairly automated confirmatory technique though this is all the Spectrum and they claim to me that causal scrubbing can also work as an exploratory technique because you just keep performing hypotheses and trying to verify them and sometimes see and sometimes can tell you which way in which ways they break yeah the final axis is uh rigorous versus suggestive where there's like techniques that I think really tell you something real and reliable about a network one example I like a lot here is activation patching it's the idea behind activation patching is you have two inputs to a model which give two different answers like the Eiffel Tower is in Paris and the Coliseum is in Rome you feed in the Colosseum inputs but you save all of the activations from the Eiffel Tower inputs and you pick one activation in the model and you edit it and replace it with the iPhone the relevant activation from the Eiffel Tower run okay for example you replace the residual stream on the final token at the first layer or something and you do this for just every activation you think is interesting and you look for ones where this patch is enough to flip the answer from Rome to Paris and it turns out when you do this that most things don't matter some things matter a ton and that just patching in a single activation can often be enough to like significantly flip things and somebody else is just like extremely good evidence of that thing I copied over is something real and it just contains the core information that differentiated the Eiffel Tower from the composite and a cool thing you find if you do this is it turns out that if you patch the activations on the armor token of the Colosseum then early on what the armor the um the final token oh the final token of the Coliseum um if you patch in the thing from the Eiffel Tower run uh early on the values on the final token of Coliseum are the ones that matter everything else doesn't matter and then there's a band of layers where it slowly moves to the final token and then the Colosseum stops Madrid at all and it's the final final token of the sentence that matters and this is not particularly surprising it's very cute sure sure and then there's a bunch of techniques that are much more kind of suggestive where it gives some evidence I think it should be part of a toolkit you use with heavy caution one example of this is a really dumb technique figuring out what a neuron does is you look at its Max activating data set examples yep you just run it on a bunch of data and you look at the data that most excites the neuron and sometimes when you do this you just look at the text you're like okay these top 10 texts are all about they're all recipes and it activates really strongly on the word um knife in carving knife this is a carving knife in your own sure sure and there are like a bunch of ways this can be misleading says this great paper called the interpretability illusion that found that if you take but and a neuron and butt and you give it a bunch of Wikipedia text it looks like it's about uh dates and facts if you get a bunch of other kinds of text it's about song lyrics and random stuff like that and the technique is a bunch of limitations but I think it tells you something real about the neuron okay and the final axis is how much things are kind of automatable and scalable versus super labor intensive and painstakingly staring at neurons and looking at weights you know that's like roughly how I break down the field of like how to think about techniques and tools okay and the main areas where I'm excited to see progress are I want us to get a lot better at finding circuits this involves a lot of things having good infrastructure so you can just find the common things you do a bunch and do them really fast uh one of my side projects at the moment has been making this Library called Transformer lens that tries to be decent infrastructure for doing all of the things I want to do when I'm trying to reverse engineer gpd2 like cash all internal activations edit whatever activations I want things like that then there's understanding the tools we already have where do they break what are the ways you might apply them and then actually shoot yourself in the foot uh there was this really great paper from redwoods on indirect object identification well they found a circuit in gpd2 small to solve this simple grammatical task of indirect object identification among Wild Thing They found is that there were these heads that mattered a lot but if they a police hits the head just like deleted it set it to zero then another head and the next layer took over and did that head's task for it so the head was really important for this backup head meant that if you delete it to the head it would look kind of uninformed hmm um interesting my personal guess sorry how did they know that the head was important if like when you deleted it a backup had took over uh so the techniques they were using there one of them was activation patching you you like copy the heads over to run on a different input that's not to flip things right other um technique is because this head was at the end of the network you can use this variant of the logit lens called direct logic attribution where the idea is that so the way you compute the logits from the network the kind of essentially the log probabilities over next tokens is you apply a linear map to the final residual stream and this the residual stream is the sum of the output of every layer of the network and you can often break the output of a layer down into the sum of different components like different heads and so you can look at a head's impact on the logits by just applying this linear map to that head's output and if you do this on the indirect object identification task you're like oh heads nine and layer 9 massively boosts the correct token relative to the incorrected I guess this had masses I need to lease it and it's like ah the model's about as good but if we look at these other heads then Monahans which used to be suppressing the correct output now does nothing and there's another head that wasn't doing much is now like way more insignificant and um I don't expect this that's a really useful Insight yeah yeah yeah just kind of having a better toolkit we both understand things but also we have more techniques um then having good gold standards of evidence what is it this both looks like conceptual understanding of what does it even mean to have found a circuit but it also looks like good tooling an infrastructure to like really verify that your hypothesis is correct and like one of the most important skills you need if you want to do good Mac interpret research is just the ability to aggressively red team your results and look at all of the many flaws in them and just try really hard to break them and one thing that's pretty hard especially on a grower field is having it be the case that other researchers can produce results that you think are compelling and just having contestants on the field and what does it even mean to have identified a circuit what are the standard talk that I should use to be really sure what I'm doing is correct um redwoods causal scrubbing work is the best type I've seen here by far um even that has like limitations and shortcomings um ways that multiple hypotheses could seem kind of the same but not be I think uh they're having engaged with it hard enough so that could be completely wrong apologies or Redwood in advance and yeah then the final thing is just automation taking all of these things that we know how to do where I know what I would do if I wanted to spend a thousand hours reversed during tptu small and passing over as much as as possible to code and trading research a time for GPU time and at the moment we're like not even at the point where there's a lot of productive things I could just get a python code program to do let alone stuff that I think I can reasonably hand off to something like gpt3 but this seems like a thing that is very important to try and make happen at some point and a thing that I just expect to actually become easier is the field matures and we do more things Okay cool so I think this is has been a sort of overview of mechanistic interpretability as a whole or at least your view of it if people are interested in like like before we go to your work specifically if people are interested in getting into the field do you have any advice or resources for them yeah so uh actually one of my big side projects at the moment has been trying to make it easier to get into the fields all right because I was annoyed how many things I had to learn for myself and I've recently written this post called concrete steps to get started and mechanistic interpretability that tries to just be a definitive point to get into the field and learn about things and tries to guide you through the prerequisite skills how to learn more and how to get yourself to the point where you're actually working on an open problem some accompanying things um I have this comprehensive mechanistic interpretability explainer where I decided I wanted to write down every concept I thought people should know about mechanistic interpretability with a bunch of examples and contexts and intuitions and surrounding things around Transformers and machine learning that I thought were underrated I got kind of carried away and it's 33 000 words but people tell me that it's very insightful words to read okay and it might be useful just a reference to look things up as we're talking in this interview I also have the sequence that I mentioned called 200 concrete different problems in macintup where I try to map out what I think are the big areas of urban problems in the fields like how I think about them why they matter how it approached doing research there and then list a bunch of problems attempting to rank them by difficulty and trying to aim for a thing that someone who's excited about the field and there's not a bit of Skilling up could pick one and maybe get some traction okay cool yeah and we should have links to those in the description or if you're reading the transcripts uh you just click on the words he said all right so before we dive into your so we're going to talk about uh three papers you've helped right before we dive into those I'm wondering if you can give an overview of like perhaps as opposed to mechanistic interpretability as a whole do you have a vision of like your particular line of research in it on a somewhat higher level than just talking about the individual like what did I contribute to the paper or uh no like um like why does the paper exist yeah what what like if there were a subfield of mechanistic interpretability that these papers like comprise like what is that subfield what's it trying to do sure okay so the three papers we want to discuss are a mathematical framework for Transformer circuits which was this excellent work I was part of while I was working at anthropic and in context learning and induction heads another paper I was involved in but I was philanthropic and um progress measures for grocking Via mechanistic and capability which is this active independent research I did after leaving I should give the general caveats that I was extremely involved in the grocking work and did quite a lot of the mainline research on my own and I feel well placed to authoritatively speak about this and claim onto the credits I was somewhat involved in a mathematical framework but the my collaborators Chris Ola Nelson elhaj and Catherine Olsen did far more of the work that I did so to serve a very large amount of credit and I was even less involved in in conduct learning and induction heads and Catherine Nelson Chris and the rest of anthropic deserve a lot of credit there and I can say things about these papers but these are very much my takes and also I should thank the co-authors of my grocking work Lawrence Chan Jess Smith Tom Librium and Jacob steinhots okay who contributed a bunch General epistemic caveats and credit sharing out of the way so a mathematical framework is in my opinion as Far and Away the coolest paper I've ever been involved in okay so the reason this paper exists fundamentally is that if you are trying to reverse engineer a network you need to understand what the network is as a mechanistic object like mathematically what is it as a function in terms of code what code would you write to make it and conceptually what are the kind of moving parts inside of it and in what ways as a principle to break it down into sub-components such that you could refer to a circuit made up of several bits in which way is it is it not and a mathematical framework basically just tries to do this for a for Transformers and in my opinion has dramatically clarified my personal intuitions for how to think about all this in a way that's just pretty fundamental for first engineering a Transformer adult the paper particularly focuses on attention because as noted that's one of the biggest differences between Transformers and image models where the earlier work happened uh sorry it's one of the biggest differences between the image model architectures on which the early work happens a bunch of image models use attention nowadays yep um there are all system results in the paper where we reverse engineered zero one and two layer attentionally models and found some interesting circuits notably induction heads but to me this is secondary to just the framework of being able to get traction at all in a way that was grounded in printables and looking at some of the fundamental bits of the model in context learning and induction heads it's a very different style of paper to me the core themes of the paper are around universality which is this idea that the same circuits and cognition happen in models and very different scales and different data training Dynamics where we looked in a model during training and emergence or phase transitions where we found sudden changes in the model during training so the headline results in the paper are so he found this circuit called an induction heads the thing an induction head does is it detects and continues repeated text and we found this in these like toy two layer intentionally models it turns out that induction heads seem to be a big part of the reason that models can do in context learning which is jargon for use text very far back in the prompt to predict the next token it's kind of surprising that models can't do this at all uh we're talking like you've got a 10 page words document and it learns to use some text on like page two to usefully predict what happens after page 10. you can show Transformers are good at this by looking at their performance per token by token position and show that it's much better related to tokens and in a way that's because it's using the earlier tokens obviously because later tokens are inherently different okay and it turns out that this capability seems to significantly rely on these induction heads these induction heads occur in all models we looked at up to 13 billion parameters and the models and the induction heads all appear at around the same time in what we call the phase transition during training where there's a fairly short period where you go from no induction heads to induction heads and exactly coincides with the period where it gets in good in context learning and yeah there's like a bunch more evidence in the paper about exactly why we believe this but that's the headline claim sure and to me the family is just using the mechanistic insights we found to understand a real fundamental property of networks that they can do this in complex learning and so the third category might dropping work I'd put somewhere in the science of deep learning cap where I'm trying to understand some underlying principles of deep learning models and how they work with some focus on Fierce Transitions and emergence so grocking is this mystery in deep learning uh famously found in this open AI paper from the start of last year where they found that if they train a small model uh Transformer in their case on a simple algorithmic task uh such as modular Edition or composition of the permutation group or modular multiplication that if they gave it um say a third of the data to train on you know and two-thirds of the held up test set and kept training it on that third of the data for a really really long time it would initially memorize the training data but then if you trained it for long enough it would suddenly generalize and what they called grocked and go from being really really bad on the Unseen data to be very good on the Unseen data and in particular this is kind of strange because like uh it's not like during this transition it's not getting that much better at the task that it's actually being trained on right yeah it's like messy Nuance in a way we can get into later but it's not getting that much better is a reasonable thing to say it's like achieved perfect accuracy on that task is a thing we can confidently say a perfect accuracy on the training data and so what I did is I trained a smaller model to grock modular Edition a somewhat simplified one layer Transformer and replicated grocking and it clearly grocked and then did uh some painstaking and high effort Traverse engineering where I discovered it learned this wild algorithm that involved triggered entities and Discrete Fourier transforms to do the modular audition and that I could just with pretty high confidence describe how that was implemented in the model with a bunch of caveats and holes as we make it into later and then could look at it during training using this using this understanding and designed so what we call progress measures based on this mechanistic understanding and I'm looking at it during training I saw that the sudden grocking actually broke down into three phases of training that I call memorization where the model just memorizes the training data circuit formation where it slowly transitions from the memorized solution to the trick-based generalizing solution preserving train performance the entire time and then clean up where it suddenly got so good at train it's right uh it suddenly got so good at generalization that it's no longer worth keeping around the um memorization parameters um importantly these models are trained with weights Decay that incentivizes them to be simpler so it decides to get rid of it and this this is what results in the kind of sudden grocking because when the model is partially memorized and partially generalized the generalizing bit is really good on the test data the memorizing bit is like even more bad such that it on net is just bad performing okay and the kind of high level principles from this are I think that's a good proof of concept that a promising way to do science of deep learning and understand these models is by building a model organism like a kind of specific model to study that exhibits a mysterious phenomena clearly understanding it and then using this understanding to ground an exploration of the deeper principles and the second reason I think is exciting is emergence or phase transitions are this thing that just seems to recur in a bunch of models on a bunch of model scales and across a bunch of different dimensions like GT3 you can do addition in a way that smaller ones can't for example or how yeah induction heads suddenly form during training and we can totally imagine that some very limited relevant things might suddenly emerge there's this fascinating paper from Alexander Pan about how reward hacking can emerge if you scale a model up and where where reward hacking is like getting higher reward in a way that you didn't think it would roughly yes like uh to take an extremely um modern example a student getting a good Mark on their essay because chat GPT read it for them rather than because they actually love the content sure and so the so one hoped for solution here this is your progress measures which are some metrics you can run on the model which track progress towards the eventual emergent phenomena and the work is a proof of concept for if you really understand some example of the behavior you care about you can design mechanistic interpretability inspired progress measures okay and I think the work has many limitations including as models of these both of these two things but I also think it was like a very cool and fun project that has some useful insights to teach cool so now we've uh had a quick summary of each paper um how about we dive into each one individually sounds good all right so the first paper is a mathematical framework for Transformer circuits it's got a lot of authors um I guess the the first few are Nelson elhaj is that how it's pronounced I think so okay Nelson elhaj Neil Nanda Catherine Olsen um Tom Hennigan Nicholas Joseph benmann sorry if I mispronounce anyone's names and uh then the course several people are anthropic and then Chris Ola being the corresponding author and this is basically about there's a contributional statement at the end of people want to decipher the long impenetrable list yeah I actually really yeah one thing I actually like about uh anthropic is yeah I I found these these author contribution statements to be like yeah good good reading I don't know just give some insight into what this list means um yeah I think it's nice yeah Chris has an amazing blog post about credit and how he thinks about the importance of sharing credit generously and fairly with academic work sure he puts a lot of thought into contribution statements I respect him a lot for that yeah so so it's kind of taking these versions of Transformers that don't have the multi-layer perceptron parts that are just attention heads up to two layers and kind of building a mathematical framework for them and yeah talking about the circuits so one thing that struck me about this paper is like it kind of emphasizes this idea that like you've got you've got these like different attention heads which are cold but you can also have paths between attention heads and it kind of like puts the emphasis on the pots rather than like the individual heads importantly you can have pods through attention heads as well so what do you mean by that distinction so for example let's say you've got a head which just says I'm going to predict that the previous token uh is gonna come next then his head would attend to the previous token and it would have this it would then take the thing it attempts to apply a linear map and that gets contributed the residual stream at the destination and I have this map called the OB circuit and then you have a path from the previous token input through the OV Circuit of the head to the current token outputs yep and I would say that that entire path is the correct thing to try to interpret yeah so one thing that kind of scares me here is like if you have some collection of things there's going to be a lot more pods um through the various things than there are like individual things to think through right and like presumably the paths there's some commonality between all the parts that run through a particular location namely that location right so I'm wondering like ultimately is this path analysis going to be like too unwieldy to be useful sure so yeah so I definitely should say that a bunch of the things in the paper I think were either just fairly early techniques that haven't necessarily turned out to be incredibly useful or like that scalable or what kind of proofs of concept for how to think about things and specifically there's a bunch of equations where we fully expand out every path in the network and it's this god-awful combinatorial explosion of things and I completely agree that no that is not a reasonable thing to try to contemporate so so okay so to get some framing the way I think about what is going on inside of neural network you have these you have bits of the network that are interpretable not necessarily kind of like by definition interpretable but like we can put some meaning onto them and the network cares about them having a certain meaning and then there are bits of the network that are kind of hard or messy or in the particular case of a transformer are things we expect to be highly compressed and highly under superposition so as a concrete example of this uh let's say we've got a zero layer Transformer this is the dumbest possible Transformer that just embeds the input tokens as vectors in space does nothing to them that immediately applies a linear map to convert them to Output tones all output logists and importantly this can't move information between positions because there's no attention so it is essentially uh just mapping the current token to guesses for the next token so it's just a big memorized table of background statistics and if we just go through the data we can just generate a big table background systems it's not very hot but the way the model has been forced to implement it is that it's had to map the 50 000 input tokens to this tiny say 500 dimensional bottleneck space and then back up to 50 000 and it's presumably learned to compress this enormous table of stuff into something that can be done via this pretty narrow linear map and one of the claims in the paper is the correct way to interpret some circuit some circuit like this is that we don't try to interpret what the things of the 500 dimensional bottleneck mean we try to interpret the start and the end and we assume that the stuff in the middle is like some heavily compressed nonsense yeah yeah so tomorrow is a lot of the stuff about pods is saying cool we want to focus on things that start and end at things that have meaning where the things that have meaning are input tokens attention patterns MLP activations and outputs output logits and we want to mostly focus on paths between these yeah so the goal is not to study every path the goal is to find things we want to understand and look for the paths leading to those that matter yeah so this actually gets to another kind of worry I had about this paper so take the zero layer Transformer case right where you have this like embedding Matrix that's like you know 50 000 inputs to 50 outputs or something like that and then an unembedding Matrix which is 50 inputs to 50 000 upwards and you're sort of just like it's sort of the paper basically says okay we're just going to multiply them and think of this like 50 000 inputs 50 000 output function and just think about that function and it seems like this is going to kind of forget the fact that it's this very special type of 50 000 input to 50 000 output function namely the type that's it's called low rank in mathematics it's a type that can be decomposed this way into this like intermediate 50 component representation how like how scary is it that we're forgetting this like really this like kilo rank structure and these matrices this is a good question the honest answer is I don't really know but my guess is it's not that big a deal so the reason I guess it's not that big a deal is so one of the things you need to do if you're trying to get anywhere in reverse engineering and network is distinguishing between the things that matter and the things that it's trying to do in some sense from the things that kind of don't matter or are just noise or small errors I'm sure it really affects the underlying cognition and to me a lot of the losses that happen when you smush things through a bottleneck more of the flavor noise or things that aren't really important to the underlying computation than they are things that are kind of like key to it and obviously this is kind of dangerous reasoning especially if you're worried about a system that is adversarially trying to defeat our tools and which might think through things like ah people are going to look at all of my multiplied out matrices but by carefully choosing which bits I leave out in the low rank decomposition icon smuggle in subtle flaws and it's like yes there's a perfect world totally missed that but my prediction is just that that isn't a thing that matters that much at least for the kinds of network problems we're dealing with at the moment okay the second point would be one important caveat to the bigrams case is that the model is not learning a background table it's not learning the metrics that is the background table it's adding a matrix that once fed through a soft Max will be a good approximation to a byground table yep where uh if we haven't said this already bygram means like probability of the next thing given the current thing uh yes like how Nanda is more likely to follow Neil than uh floorball is to follow Neil sure and so so it's a it's a matrix that when you add a soft Max it's a bigram table yes and this adds a ton more degrees of freedom especially when you appreciate that there's just lots of tokens where the model is just like so unlikely they occur the model just doesn't really care and one way to think about it is the model is trying to compress in information about each of the 50 000 tokens to the bottleneck and then decompress this and there's going to be a bunch of noise and the model needs to do something non-linear to clean up the noise and a softmax is just a really powerful way of cleaning up a bunch of noise if you know what you're doing all right cool yeah so getting into the mathematical framework um yeah actually the first question is yeah it mentions that it only it really tries to understand the attention heads um and just like basically ignores the multi-layer perception layers for now so there's this paper that I read once his title was Transformer feed forward layers are key value memories that basically argued that look we know what uh the MLP layers are doing they're just like basically you take you take some like token representation match to various Keys you have stored and then like say ah for this key here's this value Vector that I'm going to produce and basically implying that like this was done in a way that like was relatively human understandable and like that was the answer I'm curious like to what extent do you think that's true and if it's true do you think that it like in that case why uh delete it from uh this this paper sure so first just to de-confuse a annoying notation Clash um when that paper says key and value this is nothing to do with keys and values and attention heads yes and as I said in the computer science sense of like you've got a dictionary you look up an entry in the dictionary and you return whatever store to that entry yep yeah so is this true so I haven't engaged deeply with the results of that paper I've anecdotally heard some claims that some of the results replicate and some don't seem to replicate very well but obviously this is like not particularly reliable data um my Gap my personal intuition would be that this is part of the story but a pretty limited part and where it's very easy to get some for this to be like a small part of what a neuron is doing and for this to seem like it's a whole part of what it is doing one thing to emphasize is just that a lot of this is just stuff that can be done with a lookup table like in some sense if you want to do an if statement like if this Vector is present in the residual stream put this other Vector into the residual stream then like when you have the first Vector there why didn't you just put in the second Vector there at the same time yeah it like needs to be using the non-linearity in some interesting way or possibly some other dumb inside baseball thing models need to care about like the residual stream gets bigger over time and there's this thing called a leonorm that scales it down to the uniform size So This Means Everything decays over time and so maybe you want signal boosting neurons that boost important things or something yeah yeah but like other than kind of which stuff like that that very plausibly is a good fraction of what Transformer neurons do um yeah I expect quite a lot of what Transformers are doing the Transformer neurons are doing is kind of internal processing that just doesn't really capture within the framework we had earlier of the way you would reverse engineer a thing is you find interpretable things at the start and at the end uh you find interpretable things and you like use that as the sandwich to figure out the middle this is specifically saying the inputs to the model and the outputs to the model are the kind of interesting interpretable things and everything else is only interesting as in it connects one to the other and like this is probably true of some neurons I don't think it's true most neurons okay so uh let's head back to what was in the paper yes I I should yeah I do kind of want to just re-emphasize that uh these are my off the calf hot takes about this paper I think the actual reason we did not discuss that paper much is just like that wasn't the point of a mathematical framework okay the goal is to like understand attention plausibly this paper has some insights about MLPs um but like the focus of the paper was not on understanding MLPs okay fair enough though also MLPs are pretty hot So speaking of understanding attention one thing that is in this paper is this idea of there being three kinds of composition of attention heads where they can like Q compose K compose and V compose right to be honest the explanation of the paper it left me a little bit mystified so can you tell us like what are q k and V composition sure yes I'm very sad that one was misleading because this is one of the bits of the paper I feel like I can claim some credit for giving those names um all right so what's going on here so first I should just actually explain how an attention head works all right so rough so the way I think about attention is so the model is full of attentional layers each layer consists of several attention heads each with our own parameters acting in parallel and kind of independently and in parallel and the output of the layer is just the sum of the output of each head and so the way things work is that the model the model is doing two tasks one of the tasks is so you've got this destination token which is the current position it's figuring out which sources to pull information from and figure out once it's figured out where to move things from it's trying to figure out what information to move from those to the current position and the where to move things from and two is determined by these two pots um by these two sets of web addresses called uh the query weights WQ and the key weights W key WK and what happens is the model Maps the destination residual stream to a query with this linear map okay the model Maps every Source residual stream to a key um with the second linear map and then takes the dot product of every pair of source key and destination query and then it wants the positions where the key aligns most with the query to have high attention pattern but it also wants the attention patterns to all be positive reels that add up to one so it does for each destination it does a soft Max over the previous things to get an attention pattern the key thing to take away from all this is that in contrast to simpler sequence modeling architectures like convolutional Networks where the model needs to solve this fundamental problem of which sequence positions are most relevant to the current position in convolutional networks they just hard code nearby things relevant far away things not relevant yep in image combo Nets it's kind of a 2d thing that's the same principle in rnn's uh recurrent neural networks the thing that gets baked in is that like it's a kind of recursively going through the sequence so nearby stuff is obviously more relevant yep the way Transformers work is we let them spend some parameters figuring out which information is most relevant and heads tend to specialize in looking for different kinds of previous tokens there are like previous token aheads there are a 10 to the most recent full stop heads there's a tend to the subject of the sentence heads there's just a bunch of heads that do a bunch of different things and Transformers spend about a twelfth of their parameters doing this where to move information from computation voluntuition of thinking about queries and keys is a query is like what am I looking for within the context of this head a key is like what do I have to offer and then these are both directions in this internal space and if they're aligned it attends from that from the destination to that source so uh the right way to think about neural networks as always is it has parameters that let it do a thing it figures out a reasonable way to make that happen it does not need to conform to my expectations of how this is reasonable sure and all things I'm saying are Just So Stories okay so once it's figured out this attention pattern it then needs to figure out what information to move and the way it does this is it's got um another set of parameters WV that map the sources of your stream to this value vector this is a small internal hedge dimension it then averages all of the values using the attention pattern to get a kind of mixed value so values have gone from having a source position axis to having a destination position axis and then we apply a final set of Weights w o to get the output of the heads okay and we add up every head's output and add it to the residual stream okay some important things to note about this decomposition like this thing I just outlined also you should really go stare at some of the algebra in the paper because this is not good audio content sorry yeah I mean people might be listening and reading at the same time possibly great yes I also have a mechanistic interpretability explainer with a section on this paper that might be helpful okay and so so some things are draw out the first is that most operations here are linear the only things that are not linear are the soft Max where we took the querying key dots and converted them to a distribution in particular you can think of the if you think of the keys as a big tensor with the source position axis and the queries is a big tensor with a destination position axis the dot product is actually just a matrix multiplied between these two tensors and it turns out that you can think of this as taking the source residual stream multiplying it by The Matrix uh w k transpose WQ and then multiplying and then dotting that with the destination residual stream okay or equivalently thinking this is a bilinear form that takes in these two residual streams and the key takeaway from this is that the Matrix WK transfers WQ is the main thing that matters and we could apply we could I know double WK and half WQ and it wouldn't make a difference yeah or we could apply some really weird transformation to the internal space but make it so the product of the two is the same and this wouldn't matter you could like rotate one and derotate the other for instance yeah exactly and so one of the takeaways for me from this work is that the way we think about attention is just as there is a parameter there's a parameterized matrix that determines how attention works for this head and keys and queries are not like intrinsically of their own important things the next thing to draw out is that the value so the value steps interesting so what's going on is you could start with a tensor with a source residual stream Dimension and a D model Dimension which is just like the actual content of the residual stream yep then WV acts on the content of the residual stream dimension yeah the attention pattern acts on the position dimension and the wo acts on the residual stream contents dimension okay and Matrix multipliers or different axes of a tensor it doesn't matter which order they go so the two consonants of this the first is we've got another low rank factorized Matrix wvwo what you call wov a really annoying thing is that in maths matrices multiply on the left encode the multiple on the right they always get confused over which is the right convention to use when talking about it and I apologize in advance and so this means that values are not intrinsically meaningful because we could rotate WV and derotate to w o and it would be exactly the same the second important thing is that because these are different axes it doesn't matter which destination positions are take are getting this information every destination position gets the exact same information from a source position it can only choose how much to weight it and finally this just kind of reinforces the idea that attention is about routing information because we're multiplying by the attention pattern on the source Dimension axis cool and you can uh one thing that's kind of helpful to sometimes do is to imagine just freezing the attention patterns so you just like save and cache them and no longer compute them and then think about editing the network or like changing the inputs and you can think of the attention patterns as this kind of dynamic wiring that gets set up of how to move information around and if you freeze it then every attention has actually just linear yeah and it just lets information flow through the network okay and long tangent on how to think about attention heads okay so qk and V composition so there are kind of three things the model is doing there's where should I move information from where should I move it to uh from is key to is query and there's what information should I move from The Source once I figured out where the source is which is value and these corresponds to q k and V composition okay where so and Q comes from the destination residual stream K and B come from Source residual stream okay and so the input to the attention head is the residual stream which has the sum of the outputable previous layers and one component in here is the token embedding it's like what is the actual input to the model at this position like what's the input text and your null hypothesis should be that's the main thing that masses that's like the thing that obviously distinguishes its position from all other positions but it's also the case that the model is just has access to all of the other bits of the residual stream that are not just the token embeddings and presumably sometimes heads are significantly using those and sometimes they're basically just using the token embeddings and one of the surprising things that turns out to often be true about models though it's far from universally true is that this computation tends to be kind of sparse there are some bits of the input that matter a lot and many bits of the input that completely don't matter specifically for that heads computation okay for a given heads computation yes yeah yeah we're giving heads computation and that the head kind of picks up on those and not on others okay and so uh for example Q composition is when the Q input the where should we move information two is um significantly influenced by something that is not the token of benefits okay composition here being usefully using the output of a previous component just because that's like yeah like that's the output of a previous function it's being able to do a new function these are composed okay so a non-obvious thing here is why am I separating K and V composition given that they're both from The Source residual stream yeah and the key thing here is that these are low rank matrices which means they're essentially identifying a slow rank Subspace of the residual stream yeah and or a small Dimension or some space yeah yeah yes small small dimensional you can kind of think of this as the residual stream as say a thousand dimensional vexer and we just cut off the first 50 ignore everything else yep but the network can choose whatever coordinate system it wants yeah and this means that the head can choose different subspaces of the destination or Source residual streams to pick up on if it's so cheeses and yeah this is great this this lets the model kind of choose different bits to compose with and this matters because so one of the things that mechanism interpretability tries hard to exploit is the fact that model computation often seems kind of sparse where it just isn't using a lot of the available information because only some masses and so the hope is that often bits will be kind of using some specific information of the residual stream for the query some for the key some for the value and so let's try to build intuition with an example of what each of these might actually look like so like the kind of easier situation is thinking about things in terms of uh using contextual information about the current two hmm so let's take the sentence the Eiffel Tower is located in the city of and let's say the model has computed on the tower token uh the fact that it's the Eiffel Tower yeah it's computed a this is a famous European Capital like a famous tour assassination a European Capital feature and its computer day is in Paris feature and at the end of the is located in is located in the city of bit has computed a feature saying I am looking for a famous city like I'm looking for a city you know then you could imagine a head which who which uses Q composition to pick up on the I am looking for something in the city of feature which importantly you can't get just from of it's like pretty important to have the full context it uses K composition to look for the I am a famous um tourist destination and a European Capital thing yeah because it's like Ah that's clearly relevant and so it learns to look at the tower token and then it uses V composition to specifically move the information is located in the city of Paris rather than like Tower sure so am I right to kind of think that Q composition is when your choice of like what um of sort of what questions to ask of previous tokens is influenced by previous attention heads that K composition is when like it can be MLPs as well yeah in this case we hate MLPs and pretend they don't exist yeah pretending MLPs don't exist that uh that K composition is saying like Okay what what information I'm going to try to get out of previous tokens or previous token streams to tell me which one to care about is going to be influenced by the output of previous layers and V composition is when what information that I'm going to extract rather than like what it like how I choose which token to attend to what information I'm going to extract from previous tokens when that is influenced by the output of previous layers yes yep um I should also say in the example I gave the header is in qk and V composition uh which maybe made it not the best example in general I expected in practice every head is probably going to be using all of them to like some degree but it's firstly just useful to conceptually disentangle them to yourself and secondly it's useful to um be able to reason for examples where it would like rely on some of these final points would be uh in all in a lot of the path analysis stuff covered in the paper where they do things like have these ridiculous sums of tensor products that multiply with each other yep oh yeah I should say they don't read the paper uh you can just totally skip all of the tensor product notation I think the confusion liked it I mean I like it but I think the confusion per unit effort is like really bad on that and on the eigenvalue score stuff oh I like that too um I mean they're very fun if you have a maths background but like what I've mentioned people learning back and turf they've all been like ah I understood everything in the paper but then I spent like many hours on tensor products eigenvalues and there's also this bit where they talk about virtual weights in the introduction oh yeah uh where they like they have like w i times w o and it's kind of weird notation and people just seem to spend more effort on those three bits than the rest of paper put together that's not essential okay round over um so yeah in the kind of path analysis stuff um everything being described there is V composition because so the way to think about a head is got these three inputs qk and V but q and K both terminate in producing the attention pattern which is kind of the uh intrinsically meaningful thing in the head and then V goes out and can be composed with future things and the idea of path analysis is like that's kind of ignore the attention patterns for now and like how they're computed and just look at how information is routed through the network and how it can go through any head each layer and it's like ah um we could totally have a path it goes through like head five and layer zero had seven and layer one and had eight in layer two and that's like three levels of V composition so now now that we've sort of got that uh under control one thing that it says in the paper is that small two air models seem to often though not always have a very simple structure of composition where the only type of composition is K composition between a single first layer head and some of its second layer heads so there's something that's still kind of surprising to me about the fact that like yeah turns out like only one of these types of composition mattered I should give the caveat but I think that statement is probably just wrong oh you think that's wrong uh so I have yet to like fully explore and understand this but so redwoods research should be doing this work with causal scrubbing they're trying to build these rigorous technique this like rigorous approach to really understand what a circuit is doing and they found uh so okay so I should explain what an induction head is before we jump into this I think uh okay sure yeah so an induction head is this circuit that appears in two layer attentionally models that looks at the current token and it says has this token appeared in the past if yes then let's assume the thing that came after it is going to come next yeah so you could imagine that like if you've got a token James and you want to figure out what comes next you're like is this a piece about James Bond if this is about James Cameron is a piece about just a random dude called James and you look in the past and you're like okay so last time James happened Bond came after Almost Human bombed context sure and uh the way it's implemented really needs two heads and so the intuition for what is that attention is fundamentally about looking at all pairs of source and destination tokens and finding the pairs that are kind of most aligned and moving information from one of those to the other but each token and isolation has no access to its surrounding context except via just knowing its position and yeah yeah we want the induction head to attend from James to the first occurrence of bond because bond is preceded by James which is a copy of the current token but this is fundamentally contextual info and so the way it's implemented is there's a previous token head and layer 0 which for each token kind of writes to some hidden Subspace saying this is what the previous token was yep so now the bond position has the I am the bond token and a I am proceeded by James feature and then the induction heads it's key is whatever was written in that hidden Subspace so uh James is the key uh of my previous tokens James is the key and the query is just I am James and it aligns and it looks and it realizes it should look at the bond token because it is preceded by James and then it predicts the bomb becomes next yeah so somehow like there's information is moving from like from every token like like information is moving from the token previously and that that token's residual stream gets a little bit that says hey my last thing was this and then the attention head you're like looking for tokens who's most previous thing was like the thing you're on right now and then it has to pause for the value of okay what actually is the value of like this token that I'm attending to in the past um is that roughly right yep and yeah and this is example of K composition and not Q or V composition because the query is just James the value is just Bond the hard part is saying attend to the thing that is immediately preceded by myself yep it's also worth just like briefly staring at this as a central example of how Transformers differ from kind of networks without attention or with like really baked in ways of moving information between positions this was like some genuinely non-trivial computation about which previous positions were most relevant the current position anyway so if you actually look at the thing I just explained of James James Bond it's like actually kind of a bad use case for induction because what if James is followed by a comma or James said I don't really want to predict that comes next yeah yeah uh so one thing you want to check is what how consistent the thing that comes after is and actually but probably better use case of an induction head is kind of just repeated strings like you've got a forum post and uh someone lowered and someone further on the thread quotes and above comments and like that text on its own is kind of weird and hard to predict but if you know it's about a period previously it's like really easy yep and but like knowing just that a single word like the was repeated is much less useful than knowing that like the previous five words are repeated and so what redwood found was that even in a two-layer attention only model the induction head kind of didn't actually do amazingly if you only if you literally only let it use strict induction uh certain being the uh attend to token immediately after copy of current token but did significantly better if you allowed it to have a kind of fuzzier window okay of like what are the several tokens nearest the current one use that for Q composition and then when you do the uh so I use the contextual information of the current token of what the nearby things are and then expand your cake composition to use the contextual info of what the like recent tokens there are okay but no view you don't need decomposition okay so sorry the queue composition is like it's using previous information to figure previously computed information to figure out that it should be asking for sorry how how is that Q composition um so let's say you've got a newspaper headline which then gets repeated just before the main text yep that's like um Fred Cat found in Berkeley yep the word in being repeated later is like not actually that interest because ends are common word but red cats found in being repeated is interesting you know so the in token is going to have early heads which copy like Red Cat found okay and then does something mechanistically to get the to look at the head headline where um was like Berkeley the Berkeley token is preceded by Redcat Fountain okay is the key composition that like okay at the end token some previous heads told me that like the last few words were Redcat found so that tells me that I'm looking for things in the past that were preceded by Redcat found yes exactly okay and this is yeah for like high level intuition is just Q composition is when you use context on the destination which you are so using Redcat found as well as in cake composition is when you're using it on the source yep and to be clear I have not personally verified these results or played around with them that much uh but I find this kind of like on the face of it pretty plausible and yeah I think that uh to be clear I think the takeaway from all this is that the strict induction circuit that I described is correct and is important but that's not necessarily the entirety of all that is going on sure so I had two questions of similar kind where one of which was like why do we only see K composition and other being like why is it just these strict induction heads that are popping up it sounds like the answer to both is like that's not true there are other things going on yes also obviously uh the models we started in that paper and all the same models are the one Redwood sorry which is not the same as the essentially models that I've trained in open sourced and it could totally be different between all of these and the exact way you embed positional information can matter and everything's annoying okay but cool yeah my guess is that induction is just such an important thing for the model to be doing that it is worth its while to figure this out yeah and I guess our related question is like yeah why do you think induction was the first thing that you found and do you think that's just because it's like a very basic ability or something else um like why was it the first thing a two lamb why is it the only interesting thing a two level does or why should we find it before we found out the things um Let's uh let's ask why did why did he find it before he found other things sure uh so I joined anthropic after they discovered induction hints so I can't like really speak to the historiography here but my guess would basically just be the induction is really useful if you want to predict the next token so there's this great game from redwoods the predict the next token game but you can just like actually get some language like some natural language interpret next token and it's really hot um highly recommend the experience and so one of the things that's really hard about it is there's lots of words where it's inherently pretty uncertain what should come next like there's a full stop what comes next I mean I can make guesses begins with the capital it's probably a common word like it's pretty hard to better than that even just like if I really know what I'm doing and it's just like not that often the case that you can just very confidently say what should come next and if you're a language model funny in case where you can confidently say what comes next it's really useful and it just is the case the text often has repeated strings and if you can identify like a 30 token repeated string after like the first five tokens yep you've got like 25 perfect scores for free this is just like actually really useful and so models learn to spend a lot of resources on this okay it's also a pretty distinctive thing to spot if you're looking at an attention pattern because so if you've got uh this if like say token zero matches token 100 Etc until token 30 and token 130. if you look at the attention pattern it's going to have this stripe because every token is attending 99 tokens back and so this constant offset means you get like a diagonal uh if you just visualize the attention pattern as a heat map they're just like this notable diagonal stripe this is very distinctive when you're looking at hits okay or a bunch of data turn off so yeah combination of visually distinctive and it's just so useful to the model that it devotes a bunch of parameters to it and it's pretty clean and sharp okay yeah now that we're talking about induction heads uh do you want to move to the next paper in context learning and deduction hits so uh this paper again many authors the ones that are designated as core researcher infrastructure contributors or Catherine Olsen Nelson elhaj Neil Nanda Nicholas Joseph Nova dasarma Tom Hennigan and Ben man and Chris Ola is the corresponding author so I think a while ago you gave us a brief summary of always in this paper but uh that was a while ago so can you remind us what's what's going on in this paper sure so busy paper lots of things the core claim is the circuit induction heads recur in models at all scales we've studied up to about 13 billion premises and they seem to be core to this capability of in context learning using far back words in the text to usefully predict the next token and as far as you can tell they are the main mechanism by which this happens it is also the case that they all seem to occur during a fairly narrow band of training in this phenomena we call the superposition we call a phase transition not superposition um uh that this is such a big deal there's a visible bump in the loss curve and this happens and a bunch of suggestive evidence of interesting things like these underlying more complex Model Behavior like we found a few shot learning heads that was also an induction head that's when given a bunch of examples in text like looked at the example most relevant to the current problem huh that was like interesting but how is this thing yeah yeah I have a bunch of questions about this maybe you sort of related to my first question which is like when you just like the The Bare Bones definition of an induction head which is like yeah look at like the last time you saw this thing and say what the next thing is or something that like helps you do that task I I think this paper you define an induction head by something which does that on uh like random string of tokens that's been um that's been repeated so things which just uh look back to the last occurrence of this random token and show you the next one and if you just think of that it's like not clear why you need more than one of these right like it feels like one task so yeah why why are there more than one induction head this is a great question if you anyone listening to this figures it out please let me know um so yeah I don't really know here are some guesses so guess one would be that it's one hypothesis is just that it's kind of useful to the model to have a higher rank OV circuit that is it so we force the model to squash all of the information it moves into this 64 dimensional head Dimension but if you've got two heads that have exactly the same attention pack then it de facto is compressing the residual stream into 128 dimensional thing right and this is just higher Fidelity and like maybe that's useful this to to do the strict induction behavior of literally predict the thing you attend to is going to come next you don't even need 64 Dimensions um in theory you can do it with two Dimensions uh which is a very cute result I found recently the how can you because you need to copy over the the data of the last word right uh ah so the construction is uh you um find each element of the 50 000 word of vocab map them to a point evenly spaced around the unit circle and then project onto the to get the um output for the case token you project onto the kth direction yeah and and you can multiply by something and then softmax gets you like just the yeah it's like soft Max uh can basically act as an ARG Max and it's like biggest on the right direction uh there was this yeah I noticed this when I was writing a a review uh review comment on anthropic's new toy models and memorization paper and they have this fun notebook they link where they like go through this more rigorously and look at how asymptotically the weights Norm you need to do this grows a lot yes in practice um when I tried training a model to do this with a thousand data points it kind of struggled but it could do 100 really easily my prediction is just like uh when you add more Dimensions it can get a ton more mileage out of them by making them like evenly spaced on the unit sphere or something anyway yes interestingly this is very different from anthropic's work on like superposition where it seemed really hard to compress features the key difference is that if you're trying to memorize the range of activation is just a single point rather than like a kind of long rather than like a full range and it's much easier to squash lots of points into the same space okay anyway total tensions uh but like if the model is doing things that aren't literally copying it's pretty useful or if it just like needs to deal with a bunch of noise and other garbage yeah secondly the model probably wants to do a lot of subtle variants on induction so I mentioned the one that was like that's Tech of the most several most recent tokens do this you could totally imagine a thing that's like check if the most recent token does this if so do X then the second thing which checks strictly to the most recent 10 tokens to this if yes let's be way more confident in our answer then there's a deeper question of like so in my opinion induction heads are actually a deep and fundamental mechanism of how models do things because so this is kind of skipping to the bit of the paper I'm like why are induction heads relevant to in context learning sure where so my guess is that fundamentally the way you're going to do in context learning is you're going to distill out the bits in each like section of the text into like what exactly is going on here and they're going to do some kind of search to find which distilled things from earlier on the text are relevant to where you are now and there's kind of two ways you could Implement an algorithm like this one of them is asymmetric where you learn a lookup table or you're like if a is happening right now and B occurred in the past um B is relevant to a one example might be if you're like reading a paper then like the relevant thing in the past would be uh like the abstract of the paper or the title and you could just fill out some features from that and then like most things in the text care about those but like not vice versa but this can get kind of complicated because you just need to like learn a pretty big lookup table but then the other kind of thing you can do is symmetric where you say if a is relevant to B then B is also relevant to a and this is just a much more efficient algorithm because if you want to learn kind of 10 symmetric relations you can just map everything to the same hidden latent space and just be like sorry that was too chocolatey let's say you think that all names are relevant to all other names rather than learning a kind of nil to Daniel Daniel to Neil Daniel to Jane Jane to Daniel Etc massive lookup table of links you can just map every name to an is name feature and then just look up matches between the is name sure and to me this just seems like a kind of fundamental algorithm for how you would do search when you are storing things as directions in space and looking for matches and importantly this scales linearly uh with the number of features rather than quadratically like lookup tables do hmm so this is much more efficient and the kind of induction we've been discussing so far is the thing where the features you're matching are just literally token values yep but you can totally imagine this being done more abstractly for example we found this translation heads which could attend from a word and a sentence of English to this relevant to the thing immediately after the Relevant Word in the front sentence and the natural guess what's going on here is that the model has learned to distill what's going on and to saying okay this is French like this is just the semantic content of the text rather than like the actual tokens and then French and English get mapped to the same space and it's easy to match them note that I'm what I'm describing is actually more like a duplicate tokenhead whereas attending from like a token to copies of itself but like if you have that extending it to an induction head seems pretty easy no and I believe we also found heads that attended from a word in English to the same word in French and an interesting thing about these translation hats is that not only are they kind of English or French they're also French to English German to English English to German French to German Etc and they also seem to work as induction heads sorry is is it that the same head does French to English English to French to French German term to English okay yes there's a fun thing in the paper where you can go play around with the attachment pattern and I found a couple of heads like that in GPT J and figure out what's up with those is on my long term to-do list okay also one of the problems May concur different problem sequence so then I want to go try that I'm very excited to see what you find anyway yeah and the translation head is also an induction head and my guess is it's just the same fundamental algorithm of map things to latent space look for matches look at the thing immediately after match and that the model has just learned how to do something sensible here and yeah so going back to the original question my guess would be that the reason it has one of the reasons it has many induction heads is doing subtly different things on varying degrees of fanciness along this algorithm yeah like how subtle do you think the differences are like you might imagine that there's just like one kind of thing which is induction heads and like there are small very branches between them but like in this paper they're sort of defined functionally as like on this type of data they do this thing and you could imagine that there are like several kinds of things that like have this ability and sort of grouping this like heterogeneous collection so yeah how homogeneous or heterogeneous do you think induction heads are ah yes so a another important fact about induction heads is you get anti-induction heads oh really an anti-induction hat is a head which attends to which like has the induction pattern but it suppresses the token it tends to huh like it says okay this is James I mean to look at what comes after James in the past it's Bond bond is now less likely yeah and why does this exist I guess you you could sort of imagine it for patterns right so imagine like I use a bunch of phrases like get the salt and get the pepper or get the pepper get the salt and I say that all the time well get the it could be like Pepper or salt and like if you know that the last time it was pepper it's definitely not pepper this time and so that leaves salts that that kind of thing strikes me as one possibility I like that yeah it's an interesting idea one idea so in this indirect identification paper they found this interesting phenomena of there were nameover heads which attempted to the name that was the correct answer and negative name Uber heads which I think attended to also the correct name but suppressed it and then when you uh ablated the name moving head some of the negative name viewers kind of acted as backups and significantly reduced their negative behavior and I'm my guess is that that was a result of dropouts which gbt2 was trained with I have no idea whether the models in our paper were trying to drop out or not but I can imagine this being a thing that masses I could also imagine what you said being relevant I could also imagine so one of the the strengths and problems of induction is just the model is very induction is one of the things where you can be really confident that what you're doing is correct if you're a model and just get like very high probabilities yep but this means you often drown out everything else because like most things you can't be incredibly confident and it's just incremental shifts on the margin so if you want to like turn off induction that's like also hard um and yeah but going back a bit there's this question like how like are induction heads all basically the same kind of thing are there like lots of different types of things that are induction hits so you brought up anti-induction heads that can sometimes act as induction heads yes yeah I [Music] don't really know if I've got a better answer Things Are messy another form of messiness is that kind of if a bunch of heads have the same induction ear tension patterns then [Music] it's like in some sense the model doesn't care about the fact that the OV matrices are like different between the two heads it like only cares about the Sun like it only cares about the combined like twice the rank OV Matrix yeah and if the attention patterns are purely fixed it will just like optimize that yeah yeah and um this is weird it's possible the model just decided to have a bunch and then distributes some more sophisticated computation among them it's also plausible the model just like wants to do different kinds of induction uh like want to have different attention patterns yeah one cute side project is I made this thing I call an attention Mosaic with a heat map of the induction heads in like 40 open source models and this is a fun thing that makes visceral there are so many induction heads man and larger models have way more more like uh like a like a constant fraction of their capacity or like is it sort of super linear or sublinear in like how many induction hit how many heads they have that is a great question and I have not checked let me just go eyeball the heat map yeah I mean looking at these I'm like I would broadly argue that the fraction of heads that are induction e goes down a bit as you go up but it's like kind of sketchy and I don't trust my ability to eyeball this properly also different model architectures seem to potentially have different rates of induction heads even at the same size oh it's very cute Let's uh if I want tons of induction heads what architecture should I use see GPT near seems to have a lot oh gpd2 XL has tons go use the big gpt2 architecture that's my advice okay yeah my next like big question about the big picture of the paper is like you know the argument is that like induction heads there's just like this particular phase during training where like they start occurring and I sort of to me it seems sudden and not gradual and that strikes me as kind of weird like like why why do you think they appear suddenly and not just like gradually throughout training more and more this is a great question one thing to point out is that there just hasn't been that much investigation of different circuits and their training Dynamics and there's like the kind of intuitive null hypothesis there's like the kind of intuition of like obviously or things are gradual we're doing great intercent it's kind of smooth and continuous but I kind of asserts that I don't think people have checked that hard and I have the toy hypothesis that actually most circuits happen and face changes and induction heads were just like sufficiently obvious enough of a big deal that we noticed yeah so like if things do suddenly occur to me that implies that like something has to happen before they occur right like if they if they occurred gradually over training you'd think like ah there's just always some pressure for them to occur because they if they spend a while not happening and then like suddenly they appear out of nowhere presumably there's a trigger right this may be a better left to discuss my crocking work since I think it fits more naturally in there that's fair enough yeah we can talk more about this during the rocking paper yeah I think my my answers for why does grocking happen overlap a bunch of my answers for why induction heads might happen as a phase transition yeah yeah I I did group these papers for a reason um so so more on on the in context learning and induction ads paper so on a methodological note like the structure of the paper is like there are these uh how many like six arguments six arguments there are argument six is kind of cheesy oh okay yeah okay so let's say five uh argument six is arguments one to five are a thing they were way stronger for smaller models than large models but like you know man this obviously generalizes which I think is a correct argument but doesn't deserve a number yeah or it's like different maybe let's say five different kinds of evidence yes and I know sometimes like I'll read these papers from anthropic where they're like yeah we have like these K lines of evidence for this thing being true and I rarely see papers where like yeah there are three lines of evidence that this thing was true but the fourth ruled it out so how yeah here I guess my question is uh how much is it in fact the case that like the fourth line of evidence often rules a thing out or like when I see these Publications were they just like okay we need to confirm this five different ways to like satisfy all the reviewers I don't know if I've got enough data that I can really give a answer about what tends to happen in practice my gifts from a kind of more sociological perspective is that what happens is people do exploratory research as you're doing research you're accumulating a bunch of evidence for or against the evidence is often murky and confusing and it's easy to trick yourself and shoot yourself in the foots and as you go you're trying to be like what am I missing how could this like beautiful elegant hypothesis be completely wrong or like missing a really simple explanation but it's less pretty as has happened to multiple projects that I've mentored and yeah at this point you aren't really thinking in terms of kind of polished lines of evidence it's more like here is just like this collection of evidence it's like a bit janky and a bit weird but kind of gets the thing that I care about yeah and then you kind of smoothly-ish transition to a point where you're like okay I've got enough circumstantial evidence I've tried to red team in a bunch and I've mostly failed to break it I'm like pretty confident this is what's happening it's like let's discreet than this uh normally it would be like okay I'm pretty sure this particular bit is what's happening but I'm so confused about all of this other stuff and then eventually reaching a point you're like okay I think I've just got a sufficiently clear picture of what's going on that I'm convinced let's go and do a bunch of confirmatory work sure and like really try to nail things down get a bunch of evidence Etc and yeah which makes it hard to fit into the kind of lines of evidence ontology because yeah to me each of these is like a kind of polished investigation in its own rights uh including multiple times when we had like major bugs on the codes that invalidated the results and we had to like rerun a bunch of things hmm and things like that okay so so part of where I'm coming from with this question is like wondering if like the more of these I read in the paper the more like uh convinced I should be by the results which like should I think of these arguments as like they're sort of like earlier versions of thumb that could like tell you that you were wrong about like earlier stage results and therefore like they're all kind of adding evidentiary weight because like they could have killed the thing in it's infancy I don't quite understand the question sorry yeah I think so the question is suppose I'm reading a paper right and like it's offering different lines of evidence or different Arguments for like you know if this circuit does this thing or something like I'm wondering like should I be impressed by the fact there are like more than three or should I be like look they're just piling it on at this point like they're to the extent that like this is wrong that this could have been wrong it's not going to be caught by like arguments four through six hmm this is a good question so I feel like the thing you want to track here is the correlation between the types of evidence yeah so there's one example there's the question of exactly what you mean by induction heads well as I outlined there's this kind of fuzzier induction that uses a bit of cue composition as well as K composition to like tech for longer prefix matches there's also a totally different way of implementing an induction heads where you look at the positional embedding at copies of the current token and then you you look at the token with the positional embedding one after those and this is equivalent to induction but it's totally different mechanism and I would say that the results of the paper hold equally well if the things we're calling induction heads are implemented with either of those two mechanisms or like fuzzy variants of either of those and I think that this is an important thing to bear in mind okay but just like I think that okay no that's actually not at that's I actually take that back slightly um there's the argument where argument 2 we had an architecture shift that made strict induction heads doable and one layer models which made a big difference okay and would that have only helped to one of the implementations uh yes okay yes specifically the idea was um rather than the key being just the value of the source token yep the key is now a linear combination of the source residual stream and the one before with a parameters learned parameter awaiting the two yeah yeah and and that helps with the like copy the previous value but it doesn't help with the like shift the positional embedding yeah um and it also doesn't help with fuzzy induction but you have like a longer map prefix match yeah yeah but um I think you were you're about to get to another Point yeah I think I've just disproved the point I was gonna make which is great but they're a kind of some unpleasant assumptions and correlations that go into things uh where like what do they mean by induction heads as kind of an implicit correlative a bunch of the evidence of the paper and the thing you want to track is like how correlated are the arguments and what are the shared assumptions they make we're piling on a bunch of things with the same shed assumptions is boring but planning on things with like very different angles is much more interesting sure and in this case the like smeared key stuff is just quite different because it's not relying on some intuition around what exactly is an induction it's not relying on the fact that we can label heads in a model as induction heads or not and then analyze how important they are it's instead relying on our mechanistic understanding uh the concrete result is they found that smeared key models do not have an induction bump and do not have a phase transition where they form induction heads or like get way better in context learning and the one layer models are like still pretty good at in context learning if they have smooth keys okay and are much better than if they don't and yeah so the like interesting thing there is this gives like pretty strong Credence to the r mechanism our mechanism Claim about the circuit is like pretty important okay though it does not prove that all induction heads and large models are because of this circuit or that this is actually the way it is used in large models just that it's important part anyway so like the key thing here is just like look at the lines of evidence and say do these feel diverse to me sure I mean looking at them they're like pretty diverse all right cool next I have uh a methodological question about like one particular part of the paper that I found kind of interesting so sure I should add the caveats that I did not do the methods in this paper but I'll do my best all right uh so you might not know so there's this part where um you take models and represent them by the loss they get on various tokens in like various like parts of text right yep and then there's like this step where you do principal component analysis where you say like ah here's like one way models there's one axis in which models can vary in like lost space Here's just another one here's the third and two most important um principal components and then you plot like how do models like vary during training like in these along these two axes yep and you show that like when like induction heads start to form it like turns a corner in this space yep such a cute result yeah I've okay I had a bunch of questions about this so I guess my first question was like there's this principle component analysis on the losses yeah it's struck me strange that the principal component analysis was on like the space of losses of the model rather than just like it's the outputs of the Transformer do you know why that choice was made as in like what what's on the log probabilities rather than the logits uh well it was it was on philosophy it's not the log probability so uh so that's the same in cross-entropy loss pretty sure oh okay so like cross entropy loss is the average log prop paid to the correct next token oh yeah so per token loss is just the correct next log problem yeah but but um but the losses depend on what like the actual next token was uh yes yes it is not it is not an intern yes it's not all log probs it's specifically the correct next look from yep oh I mean I haven't thought that deeply about this to me it just seems more intuitive like the thing optimization pressure is applied to is the log probe of the correct next token okay and so like if I'm thinking about like things during training where training pushes things I want to think about the losses is that roughly it yes so so like one one of the uncertain questions to me here is like so there's things the model doesn't need to care about and it has a lot of Freedom Movement things like what is the average value of my logins it could add 50 to every logic and it's completely fine it doesn't change the log properly and I'm like I do not care if the model size 550 is all the logits this is just an arbitrary to go freedom sure it can also like fiddle around a lot with the incorrect lock prompts where like subtracts 50 from a bunch of Logics that just didn't matter and it's like this doesn't really change the log sum X so it doesn't really change the output of the log soft Max on the correct token but it changes a bunch on the incorrect ones and I just like don't care about those because they aren't things a model is trained to achieve a caveat to this is the specific circuits may make claims about what these should be this is actually one of the things we leveraged a lot in my grocking work but not okay it wouldn't surprise me if both give similar results but by taking the losses you're just definitely being invariant to stuff that doesn't matter yes okay so another question I have about this PCA is like I want to know what the principal axes were and first so first I want to guess I have a guess about what the principal axes were and I want to potentially know I'm right so my guess would be that there's one principal axis which is just like how much loss overall the models had and then maybe I don't know maybe this is just inspired by the paper but maybe there's another principal axis which is like do you improve your loss over um the course of training so were those principal access axes and what what were they so I completely don't know um I will say it can't just be that the model generally has good loss because PCA is about variance so like automatically normalizes the mean uh but this is different it's the PCI over different models right and so like some models could have better overall loss not I believe it's PCA for the same model across checkpoints across training checkpoints right yeah so so there could be an axis of like if you're low on the access and just like sorry yeah at the start of training and if you're high in the access it's late during training right when it's lower loss yeah yeah okay yeah I agree with that so I guess in that case you'd predict that one of these is just like generally positive for like all tokens or something yeah yes I have absolutely no idea one prediction I would make is that it's like probably not actually the case that each principal component is interpretable but rather that like there is an interactable basis that is kind of spans the first couple of principal components mostly yeah can I give you my argument for why that's the first principle component go for it if you look at uh these plots of how the model is moving along in this space the model is always moving to the left along the first principle component it's always moving in One Direction Over training so that that's my brief argument that this might be what the first principle component is I like it but I have no idea what's going on with the second principle component yeah uh I agree what it's also just like why does it like go down and then up and then basically be horizontal what yeah life is pain yeah this this is another thing because because like if I know that it turns around then I want to know like okay what's going on with this new direction right which is closely tied to what is the second principle component um yep I don't so if it were the case that the second principle component was like Improvement in loss over time you mean Improvement in loss over the context window over the context window yeah but then I don't think you would see change of the second principle component before the induction heads show up so I think that's proven me wrong on that guess yes if anything you'd see uh like a slight Improvement in the same direction because skip trigrams which are a thing while their models can do that we didn't get around to talking about in a mathematical framework are things that like improving context learning hmm right things that you and I did not get around to talking to about but are definitely in the paper and people should enjoy reading about them and also in a video we'll walk through it on the paper if you want to have me personally get politics about them all right those are my questions about the PCA I mean yeah the way I recommend thinking about the psea thing is just like oh that's really weird why is there a kink don't you I don't know I I still want to know it right like what is this direction like that seems so interesting um I mean I agree cool finally I want to ask so at the end of this paper there's like there's some Mysteries that you pose right so one of the Mysteries is that there's this like constant in context learning score right yeah what foreign that still confuses me so much so specifically what's going on is like when you look at like how much is the loss improving between the 50th token and the 500th token or like the 80th token and the 1000th token or whatever like once you fix like how much does the loss decrease between this token and this later token different models which look different in general they like tend to have the same amount of loss decrease at least or is this just at the end of training this might just be the end of train I think it's just a true statement like after the induction bump yeah sure so do you have with the benefit of time there's been a while since this paper was put out are we any more clear about why about what's going on here um not really it's weird I mean the paper does make the point that like if you compare the models per token loss at different points in the context to the smaller to the smallest model it like has gotten most of the benefit that it's ever going to get by like token 10. and doesn't obviously get better as or like so we compare the uh 13 billion parameter model to the 13 million parameter model and it's like 1.2 Nats better per token and by like token a hundred it's pretty good by token 8000 it's still like about as good as actually regressed slightly uh token 10 it's like one that's better rather than 1.2 or like 1.18 sure and I was like okay maybe it's just a lot better at doing the kind of short-term stuff this is just kind of weird and surprising like stuff like the translation heads I feel like must be relevant to improving loss one hypothesis is that just a lot of the stuff that models do is not actually that relevant to changing the loss because there's just so many tasks that get increasingly niche and so a bunch of the things right like obviously a tiny model could not track long range dependencies in this way just aren't important enough to really matter in the context of the loss but yeah no as I would I would not have predicted this all right and then there's this other mystery which is that um you can look at the loss derivative so these are derivatives like you have some architecture and you look at how the loss changes per like if you train on like one percent more tokens or something and yeah in this part you find that um during the phase change like this derivative starts going up and then once induction heads start appearing this derivative it like like I guess the derivative goes down but like loss should be negative so whatever it starts going up then it goes down suddenly while these like induction heads are forming then it goes up again still while the induction heads are forming and then it starts leveling off and also that like these derivatives they like if you compare modals to different sizes before the induction had started start appearing the small models improve faster but after the induction Head Start improving large models are improving faster so it's like it's a strange plot Uh something's going on around where these induction heads are popping up at the time this paper was published it was a mystery heavily is there any light shed on this mystery in the time since nope still a mystery okay good to know yeah I I'm finding it kind of interesting looking back on this and I'm like oh yeah I'm still confused about that and I just never felt like a priorities try to follow up on all right there's a lot of work that I really love someone to go do and mechanistic interpretability sure well on that note uh should we move to the work you did do on magnetic interpretability progress measures for parking fire and mechanistic interpretability so this uh we're actually recording while it is under review for iClear and I'm seeing the under review topic so I don't actually know the author list uh can you tell me who it is yes so the author list is uh me Lawrence Chan Tom lieberum Jess Smith and Jacob steinhot okay cool and yeah can you remind us of an overview of what's going on in this paper yeah so uh so grocking weird mysterious phenomena found in this open AI paper you train small models on algorithmic tasks like modular Edition and you give them access to a third of the data and you keep training it on that same third I initially memorized the data then if you keep training it on the same data for a really long time it suddenly generalizes even though it's seen nothing new yeah and the key when you say they memorize the data uh what you mean is that they like aren't at all accurate on data points that they haven't seen initially they are significantly worse than random worse than random yes so this is like not actually that surprising so the reason is in order to get random loss you need to Output a uniform distribution if you haven't seen before but this is like a non-trivial task yeah like you need to say I have not seen this before so be uniform but because you get perfect accuracy on all training data you never want you never care about happening uniform right right so this isn't an issue with non-stupid tasks because there's enough things you're confused about that you want to Output uniform by default okay anyway uh tangent so I was like I have no idea what is happening here also this is a tiny model I should be able to reverse engineer this and so we found that you could exhibit grocking in a one layer Transformer and you could further simplify it to have no layer Norm or no biases credits to Tom Librium for discovering this and then I reverse engineered what it was doing which was this wild Discrete Fourier transform and Trigger identity based algorithm which turns out to have deep links to representation Theory and character Theory and weird stuff I mean it makes I know to me it makes a lot of sense uh once once you see it it's like oh yeah of course that's how you should do it um I agree it took me a week and a half of staring at the weights until I figured out what it was doing and after another week or two of like refining it now I'm like oh yeah obviously this is how you do monitor audition if you're new on that book Yeah dance I do want to emphasize that this was like multiple weeks of effort and being confused and was emphatically not obvious in full sites yeah yeah that seems totally correct so I do know of a deep mind researcher who independently came up with the algorithm so it's not like a kind of genius thing or something I mean I don't know do you mind researchers can be pretty smart uh sorry let me uh that came out wrong what I mean is it's not like a flash of inspiration that like only one person could have had that was like incredibly idiosyncratic and lucky it was like yeah yeah no okay this is like not an unreasonable thing for people who engage with the problem they can touch upon yeah anyway yeah so I was essentially in the algorithm I then devised some um both qualitative ways I could just look at the model and inspect the circuits that had learned uh approaches but also some quantitative progress measures to measure progress towards the solution and then ran these on the model during training and figured out that during training what seemed to be happening was that it first just purely memorized and formed what I call the memorizing circuit yeah which just didn't at all engage with structure the training data but just like killing memories that and does terribly on the training day on the test data um importantly it does way worse than a random because it adds a ton of noise then there's this long phase that looks like a plateau in the train loss yeah but there's actually circuit for what we call circuit formation where it's slowly learning the generalizing algorithm and transitioning from a dependence on the on the memorizing algorithm to dependence on the generalizing algorithm I'm not fully confident on exactly what's going on at a weight-based level like how much it's just kind of got a module for memorization and a module for generalization and it's scaling up one and scaling down the other versus like literally cannibalizing the memorization weights for the generalization algorithm but it's transitioning between the two while preserving train loss and it's somewhat improving on test loss but it's so much worse than random okay and uh the reason for this is that it's uh the memorization component just adds so much noise and other garbage to the test performance that even though there's a pretty good generalizing circuit it's not sufficient to overcome the noise and then as it gets closer to the generalizing solution things someone accelerate and then eventually it gets to a point where it just decides to clean up the memorizing solution and it just gets rid of the component of that and the output and suddenly figures out how to generalize and this is when you see a grock but it's not that it suddenly just learns the correct solution is that the correct solution was like there or long and it just uncovered it okay and I can give some like interesting underlying intuitions yeah I'm gonna have a bunch of uh questions about like underlying intuitions shells but first just to get our stories straight so you notice like this algorithm that basically involves like taking these input numbers and like you know imagining like this unit circle and rotating some multiple of that many degrees and then using some trigonometry identities to to do what you wanted to end up doing um can I sort of give a better explanation of the algorithm uh yeah sure so the way to think about it is modular addition is fundamentally about rotation around the unit circle or at least it is equivalent to thinking about rotation around the unit circle of angle say 2 pi over m and you can think of the integer a as the rotate by the angle 2 pi a of n yep and you can represent this with cos 2 pi a over n and sine 2 by a Over N which kind of parametrize that rotation and you can take the two inputs A and B rotate by 2 pi a over n and 2 pi b over n and you can compose them to get the rotation 2 pi a plus b over n yeah and this is now the sum but it's also the sum mod n because it wraps around the circle if you get too big and you can compute this by just taking multiplication of pairs of the trick terms and like adding them using Trigon entities and then to get these Thief logits you rotate backwards by 2 pi C of M to get a rotation by a plus B minus C your times two pi over n and you look at what this does to the X you like project this onto the axis of the circle yep and this is one if you've done nothing I use C equals a plus b mod n and it's less than one if you've done some rotation circus is biggest at the correct answer and that's the that's the basis of this algorithm yes but there's a you do it uh sort of a different speeds of moving around the circle right uh yes the model spontaneously forms five to six sub-modules for different angles where each angle is some multiple of two pi over n and we call these the key frequencies the neurons spontaneously cluster into a cluster per frequency and the output logits are the sum of a COS a plus B minor c times frequency term for each of the five frequencies the frequencies chosen are basically arbitrary and the frequency chosen basically arbitrary the reason it wants multiple rather than just one is that it is technically true that softmax can act as an arc Max like it just takes the biggest element to take the index of the biggest element but it's like not very good at this if the gap between them is like very very tight yep which it is for like cosine but if you take the average of a bunch of waves at different frequencies they constructively interfere at zero but they're all one but they destructively interfere everywhere else so the gap between zero and everything else gets way bigger yeah or like I think a way I would think about this is like if you rotate around at different frequencies like the things that are sort of in second place along these axes uh they change because you're sort of like stretching out like when you multiply by a different frequency or sort of taking the circle stretching it out and wrapping it around itself like a few different times kind of I like I don't know how clear this is via audio but but you're basically changing which things are second place so yes so you can't have a thing that's in close second place Lawrence made a great diagram at the side of the paper people should look at it cool yeah and one one thing to highlight about this algorithm is that uh A and B are activations C is a is a parameter and the hard part of the algorithm is multiplying together the a the like trig term for a and the trick term for B to get the a plus b rotation and everything else is really easy okay right so I guess one question I have so in the story you said that um there was this phase early on when these like when somehow it was gaining the ability to do the generalizing solution even though it was also kind of memorizing can you remind us how you determined that if it wasn't by like looking at all the weights and analyzing them at like every single checkpoint yeah so okay so to answer this we need to dig a bit more into the details of how the algorithm is implemented so structure of a transformer is roughly there's the embedding Matrix which is a lookup table that Maps the vocabulary of input tokens to vectors in the residual stream there's the attention layer which moves information between tokens here we just got three tokens a b equals sign and attention are just moving things to the from A and B to the equal sign and doing a bit of computation which isn't that important but is accounted for on one of the appendices and then the MLPs happen which do a bunch of processing on the combined bits from the two input tokens and finally there's the upper widths of the neurons and then the unembedding which is a linear map from the neurons the logits yeah and there's some other architectural details of the Transformer but which are really massive for our purposes and so the algorithm has three steps one of them is mapping A and B to the sine wa and cos WB W being 2 pi K of n the frequency and this people here so they think this is like really hot it's actually really easy because you don't need to learn the general function sine you see to learn sign on a hundred and like 113 memorized values yeah notice that I studied mod 113 though the same algorithm seemed to transfer to everything else we've checked and so the embedded where it's 113 because uh it you want the thing you're taking the module itself to be prime for various reasons uh it does make things a bit cleaner I don't think it's actually necessary okay or like the algorithm the exact same algorithm just works yeah yeah but uh but you're learning the sine function on 113 data points and it's just equally easy to memorize any function because it's a helicopter table and this is done by the embedding and you can just check has the embedding done this there's a cute graph where you can just feed in every possible like sign and cause function into the embedding and it turns out there's just like 10 where it's big which are the five frequencies we care about and everything else is basically zero yep and then the neurons multiply things together there's a neuron cluster for each frequency w and each neuron in that cluster the output is some it's some linear combination of a constant uh terms like sine w a or calls WB like trig terms which are just like in one variable and constant across the other and then trig terms that are things like sine W A times cos WB so like product strict terms and so there are these like nine possible terms and every neuron of the cluster can be well approximated as a linear combination of these um if you delete everything that's not covered by these the model does about as well and interestingly these this means that each neuron cluster is rank nine even though there's often kind of a hundred or more neurons in the cluster which is kind of wild sure note that I'm referring to kind of directions over the batch Dimension like the kind of dimension of all inputs A and B uh so sorry what do you mean when you say you're referring to that Dimension sorry um when I say you can approximate a neuron with say sine a sine w a COS WB that is a direction in like the space of all pairs A and B right so you so you're saying that like uh on each A and B like the pattern of firing or this neuron is proportional to sine of wa cos WB yes okay and then um and then it further turns out that the only directions that actually matter for the outputs are the sine W A plus b and cos W A plus b directions in neuron space which I like uh sine W A plus b is like sine w a COS WB plus cos w a sine w b because trigger density is this is just like two of the nine directions the neurons represent other ones actually matter and the neuron logic weights pick up on those two for each cluster and map those to the COS W A Plus B minus C outputs okay and so there's like a lot of Cleanse This makes yeah and the particular Crux we use to distinguish between the generalizing and memorizing solution is the fact that we can make this claim that there are is that across the 12 000 odd input pairs there are exactly 10 directions in the input space that matter for producing the neurons hmm sine W A plus b cos or a plus b for the five frequencies yeah if you delete these directions this will completely kill the generalizing algorithm but it should basically not affect the memorizing algorithm because the memorizing algorithm is like and shows that we should be kind of diffuse across the frequencies across like all directions this statement is like not entirely obvious because maybe the memorization is also using those directions for some reason yeah in practice probably fine and if you delete everything that's not those directions it will probably completely kill the memorization algorithm but the generalization algorithm should be totally fine okay and so we Define designed these two metrics um restricted loss which is where we evaluate the test loss after getting rid of everything but these 10 directions in your own space and also I think snipping the residual connection around the neurons so like the output is literally just a combination of these 10 dimensions and then a like biased term which are just the average of everything and when we look at restricted loss what we see is that this goes down significantly before test loss crashes and this is kind of going down throughout the whole circuit formation period it also goes down a bunch more of cleanup which suggests that the circuit formation cleanup distinction is like not perfectly pure your butts the interpretation of my eyes is that if we're kind of cleaning up the memorization noise for the model and turns out when we do this that the model is just better uh the model like can generalize much faster and much more clean yeah and so so essentially these phases are phases of like dependence on the directions in like input token space that we that this like uh trigonometric algorithm depends on yes I will note that an important subtlety about everything I'm describing is I am not ablating inputs I am saying let's feed the inputs into the model let's produce the neuron activations which are a 512 dimensional Vector for every pair of inputs let's think of this as an enormous 113 squared by 512 tensor I no longer care about how these were computers but now I can start um like removing directions in 113 Square dimensional space but like I'm kind of running all input through the model and then doing fiddling I'm not doing fiddling then running things through the model okay which is a kind of weird thing to do and reasonable to take issue with sure okay so so at this point now that we understand the paper a bit better uh there's also a second progress measure oh sorry yes could be called excluded loss which is where you delete the 10 key directions on the training data and then you look at how much this harms train performance yeah and what we find there is a German memorization it like mostly tracks train loss but then over the course of circuit formation it diverges and gets worse and worse and this this tracks the claim that it's transitioning from track to claim that it's transitioning from memorization to generalization okay so yeah on a big picture I think of the story of this paper as hey see see this like sudden phenomenon that you thought was sudden there's actually this like relatively gradual underlying thing in the network that drives the sudden transition so I guess my first question is like how is it the case that like this gradual development and this like gradual dying down of the memorization results in like like at the end of it like like eventually suddenly like the Teslas goes down like where's the I don't know do you have some story for where that kind of discontinuity is coming from um sorry I didn't quite follow the question so if the underlying driver of like successful generalization is basically like these circuits and completely we're saying that these circuits are like gradually appearing and then like the the memorization parts of the network are gradually dying away right if like the underlying factors are gradual why is there a sudden decrease in Teslas at some point sure so first off there's another thing I consider us to have rigorously answered in the paper the following is a bunch of my guesses which I think are probably correct but definitely nowhere near the point where I'm like you should believe this because my paper is so cool so I guess there's several things going on so the first is just this whole cleanup versus generalization thing where it just seems kind of plausible to me that a there's some kind of threshold where the there's kind of like feedback loops where once the memorization stuff starts to like no longer be necessary then it's get suppressed rapidly and this uncovers the existing circuit and it's just like seems completely not crazy to me that these are things that just happen on different time scales so it looks kind of sudden so so like sorry the sadness is just that one of the things happens quicker than the other or uh yeah pretty much just like okay once it reaches the point where the memorization's usefulness is just outweighed by the way decay then just cleans it up Okay and like maybe this happens rapidly or like this kind of a snowballing effect where there's like the kind of force to keep it there's the force get rid of it the force to keep it is initially really really strong because then gets weaker and weaker until it slightly crosses the boundary and now it's starting to clean it up and then that accelerates because the more it's removed the weaker the force to keep it okay or something so uh hypothesis 2. there's just something weird about the optimizer we're using that like changes our results a bunch because optimizers are really really weird specifically we're using Adam W which is the variance of atom that uses weight decay in a principled way and the details of how it works were annoying but like very roughly Adam kind of has momentum so it tracks like all of the recent gradients and kind of points towards an average of those it also normalizes which means it tracks the kind of noise of the gradients the average squared gradient and kind of like scales down by that so if creating's been really noisy recently at like lowers things but then it does weight Decay without momentum meaning that it's just kind of reducing the weights at each time step independent of what the weights were at previous time steps you know and in a way that's just kind of out of sync with everything else Adam is doing and uh this is like a really weird Optimizer it's also basically the state of the optimize it everyone uses and there's a lot of weird things about grocking that make it a better Optimizer than others there's like plausible there's masses yeah there was this comment on a western uh post that accompanied this paper that basically had some like toy model for why you should expect this to happen with atom-based optimizers but you shouldn't expect it to happen with just vanilla stochastic gradient percent yes um I'm wondering what you made of that um so I believe it does happen with sdd but this is an experiment that one of my collaborators ran that I have not run so I can't claim with confidence and I think they were having issues I my recollection of that comment is the stuff that it was saying was pretty reliant on the basis of like the fact that the kind of relevant numbers that were moving around were like in the basis in which you were applying Adam like it wasn't that there was a meaningful Direction in space that was like spread across many coordinates it was aligned with a specific basis element yeah and that's not true in the modular edition case but I also have another comment in a while they could just be completely wrong the third the third reason yeah for what I think is going on is much more closely linked to phase transitions hmm so I have this hypothesis that just like basically all circuits that form in a model should form a space transitions and my intuition behind this is that sorry before you say your intuition what do you mean by all circuits form as phase transitions sorry yeah what I mean is that if you can identify if there is like some circuit in a model that like does some task this hypothesis predicts that if you look at the whether that circuit exists or like the components of that circuit that these form in a like sun shift where like you go from does not have two has pretty rapidly rather than as some like weird ad hoc thing rather than like a kind of a smooth gradual thing yeah I don't particularly claim this is like Universal and I think exactly what you mean by a circuit is going to kind of vary of where the induction heads are like a canonical example of this happens and it's really weird and unexpected so going back the thing that I think is interesting is that so so if you think about an induction head it's kind of wild that this happens at all because you have these two bits of the model you've got the previous token heads and the induction heads and like each of these kind of only really makes sense in the context of the other one being there induction heads in natural language as this is slightly untrue because looking at the previous token can be helpful but you also get induction heads if you just train on a synthetic task with random tokens with some random repeated substrings and in this task knowing the previous token is literally independent of the next of the current token yeah there's no incentive um get these form induction hints with a sharp phase change and there's this kind of interesting thing where each of these is only useful each part is very useful the other parts are there why is this form at all and my hypothesis is there's some kind of lottery ticket style thing going on here where there's some directions in weight space that are a line that some directions in weight space that kind of contain the previous token bits and some directions somewhere it's based that kind of contain the induction ebits and these reinforce each other because if each of them is stronger then the other is more effective at reducing the loss and there's lots and lots of other directions I've made space that are doing stuff that's either actively harmful or irrelevant and just like not reinforced and this hypothesis predicts that like there is going to be a gradient signal that slowly reinforces the previous token bit and the induction bit but that it's kind of spread out and initially very slow but the further the progress gets the stronger the gradients taken on either is hmm with the kind of if you just kind of jump through jump to the end if you just imagine that you insert a fully fledged induction head in the model at the start it's been really easy to film the previous tokenhead and vice versa yeah yeah Adam German and buck Sluggers have a good post on like s-shaped curves and phase transitions well they build some toy models of this yeah that might be worth checking out I've never I've always been a bit skeptical of the of this understanding of the lottery ticket hypothesis just because like it feels like so for one it actually it actually is a bit more detailed than like what they actually claim in this like lottery ticket hypothesis paper oh like if you read the paper it's like a much weaker claim than this or that experiments shows something much weaker than this oh exactly do they claim in the paper so in the paper the thing they show is that okay suppose you have uh so so this this gets modified a little bit and like you know there are future papers which elaborate on this but like the thing where they start is like yeah here's the thing you can do you can like randomly initialize a neural network then you train it then once you've trained it you look at the weights that like had the highest magnitude right and uh I should say this is off memory it could be slightly wrong here all you do is you say okay I think these weights like like these you know these weights they form like a sub network of like the whole neural network I think this is the sub Network that matters and so you select this sub Network like you basically like mosque out all the other whites at initialization so you say yeah the like places in the neural network that ended up being important at the end I'm going to like only include those at initialization and Moss got everything else keeping fixed like the random the random initialization right and then they show you that like that sub Network could be trained to completion faster than the original Network did like it doesn't show that for instance like that's that like I I think something that a lot of people get from this is like oh all training was really doing was locating the sub Network or like you know the the sub Network that trained is somehow the same as like the thing that the network actually gets and and none of those are shown in the thing and I think a lot of them are like like the rest of the network is is totally going to influence like gradients and so like like the idea that like oh it's just like finding this sub Network and initialization that this thing that was already there to initialization it's like well the like I'm pretty sure that the weights there evolve over time and I'm pretty sure that like the rest of the weights are going to matter and like change the gradient so I don't know I that yeah basically this explanation I I think maybe what's happened is I've just generalized I've just developed this like General reaction to people claiming things about lottery tickets where I'm skeptical but uh yes I probably should uh clarify that when I wrote the initial version of my broken paper I'd not actually read the lottery ticket hypothesis properly and just had a vague Vibe and was like this is kind of lottery ticket like yeah yeah and one important analogy I found when I actually read the paper was that they specifically focus on the kind of privileged basis of Weights where you've got like neurons on the input neurons or the output and each weight is between a pair of neurons that's kind of meaningful and this just does not work in a Transformer yep you're saying look the the whites just have some inner product with this algorithm and like the closer the weights get to Inner producting with this algorithm like that sort of feeds on itself because like this bit reinforces this bit that this bit reinforces this bit yeah which uh maybe that is more possible there's like some cute circumstantial evidence for this like if you so let's say at the end of training there's a direction that corresponds to sine 15x in the embedding time 15a if you look at the direction corresponding if you look at that direction at the start of training on the randomly initialized embedding and then you look at which trigger components it most represents there's like a spike on sine 15a so like this direction definitely had something to do with sine 15a at the startup train so like just like it just randomly happened to be that it was closest to sound 15a yeah okay I can't remember whether I checked uh is this how I can't remember which way around I did it it could have been what does sign 15a point to at the start how does that compare to what it points to at the end or what is the point with the end how much does that point to say 15a at the start but like there's a more alignment than I would have expected by chance and uh Eric Michelle who's a m-a-t-p-h-e student who did some work of his own on grocking has a great animation where he takes the principal components of the embedding at the end of training and then visualizes the embedding on those principle components throughout all of training and it kind of like circularly at the start yeah yeah I I did see that animation yeah anyway another reason why this base transition stuff might be going on is this idea of symmetries so let's imagine that you're trying to optimize uh the function x squared to be like as close to one as possible you know if you start at like a half it's pretty easy you just go to one pretty rapidly but if you start at like point zero zero zero one it's really hot and intuitively the reason it's hard is that you're so close to the two equally good Minima like Maxima minus one and one that the gradient signal pushing you towards the one you're closer to is like really really weak yep and this is kind of an issue and this means that the you like initially are kind of like very very slowly making progress and then the further you get towards the one rather than minus one the faster you make progress and it seems kind of plausible to me that when there's multiple different solutions that are kind of symmetric with each other the model yeah there's like some direction in model space where it's stuck between two Minima and it's really slow it's at first Yes Credit to bucked Legos and some emblem participants for this hypothesis by the way yeah it's also reminiscent of this Fallout paper to lottery ticket hypothesis I think it's called um linear mode connectivity and the lottery ticket I have about this where they basically say that like in general what actually happens instead of this thing where like you can just take the subset of whites of the model initialization in general what happens is you have to train for a bit and then then it'll converge like you can just pick the subset of white set initialization and their explanation is like for initially you're there's sort of some noise in stochastic gradient descent on like which which like examples you happen to see first or something but like once you're robust to the noise then you can be the alert you can take a trick and it seems very reminiscent of this like where like there's a little while where you're like oh maybe what side of zero am I on and then you like eventually like get to the phase where the signal is strong makes sense interesting cool so one way you can read this paper is like okay is there some like like suppose you're worried about like your model like rapidly developing some capability or like rapidly you know I guess a sharp left turn is the new hot phrase for this um probably like getting some thing yes sharp left turn is a specific phrase for a really specific immersion phenomena great like develops intelligence or something yeah but but like I don't know I think I think it's an example of this yes anyway and I see your paper like on the one hand you could read it as saying like oh we can like understand the underlying mechanisms and there's like some roughly gradual thing but like if it's the case that like this like restricted loss of like you know it keeps on going down gradually or like these circuits like at some point they appear but like maybe you don't know like which circuits you should be looking for um you know maybe some circuits are only going to appear after other circuits in general I'm wondering like how do you think this Bears on this question of like can we foresee like rapid changes in advance like before they happen yeah yeah so I think this is a pretty valid criticism it's not obviously criticism I think it's there's a question but you can take it as a criticism so the to me the criticism here is uh you're saying you built these progress measures but like what's the Threshold at which it matters um and again I have a threshold like what's a good of a progress measure yeah that's kind of a criticism yeah which I mean I think is just like a correct criticism one of the annoying things about trying to distill the paper and submit it is this meant we needed to like think hard about crafting a narrative and my opinion this is just less a less compelling narrative than other ones that which fits academic shovelifts better anyway um so I basically think this is just correct I think the progress measures are much more exciting when you a has some kind of threshold which interesting stuff happens and also B you can extrapolate outwards how long it will take until you get there I think like weak versions of this is interesting and having like a progress measure seems a lot better than having nothing and being totally blindsided but it's definitely like a lot less satisfying other major weaknesses are just we needed to like understand the final circuit before we could build these progress measures and you know if you need to reverse engineer the immersion model not great no I am pretty excited about things like fine tune a smaller model to do this and reverse engineer that or trainer toy task that's a synthetic replication of this and reverse engineer model trained to do that simple one one thing I'm really interested in is how much addition in large language models like uses some version of my grocking algorithm type of modular Direction algorithm because I found some suggestive evidence did so if you look at say uh base 10 Edition in a toy one layer Transformer I found some suggestive evidence that it was like sparse in the Fourier basis and it seems non-crazy to me that the right way to do it is kind of thinking about things as addition mod 10 but maybe you have a carry digit shot in the mix yeah yeah and in normal model in normal language models tokenizers make it so it's approximately addition about 100 which makes it even more relevant you know because 10 is kind of easy although tokenization sucks means that the number of digits per token it's not cuts and it's horrifying anyway I am very interested in whether this algorithm is relevant there because I would feel like strong evidence for my toy models good hypothesis I think that the the fact that you we needed to reverse engineer it and that we don't have like a clear notion of criticality is definitely a weakness I think that I mean I think it's like a pretty good proof of concept and I think having any notion of a metric towards this is just like pretty great I think there's like a ton of work if you want this to be something that can be really made precise or like be used for real forecasting I also wouldn't surprise me if you could refine these metrics and something which actually has significant predictive power I did some playing around where I did some dumb things like take the logits um as like an enormous flattened Vector stack to this across all checkpoints and then did PCA on that huh and uh then took the first two principal components and tried to extrapolate from like where I was to like how long would it take to grock I was like not terrible if you were an Epoch 8000 and it predicted you'd get there at 11 000 and got there at like ten and a half thousand and this method was totally an incredibly hacky mess and I might be misremembering it also it only worked when you have the principal components taken over all of training and if you do like kind of uh principal components only up to that point it doesn't work and I don't know why but I'm like yeah that's not great it wouldn't surprise me if this could be turned into a thing with criticality okay now we've talked about these papers is there any like feature follow-up on these that you're particularly excited by so my research interests have mostly moved on from grocking and kind of looking at these super toy models that are so far removed from language models which is like my main research interests like so I am currently mentoring a project where we found that you can generalize the algorithm to arbitrary group composition which is pretty exciting am I currently trying to write that one up okay so that's a call Direction generally I feel like the biggest thing missing in just a whole understanding of Transformers is what is up with the MLP layers how do we engage with them what is superposition what do they even mean how well do our conceptual Frameworks work with them One Direction I'm personally very excited about is reverse engineering neurons and just a one layer Transformer with MLP layers because they're just not very good at this and I feel like if we were a lot better that would probably teach us something and that's probably like the big category of things that I'm interested in the other big category is just I want at least 20 examples of well understood circuits and Real Models and like real language models we have like uh three Redwood is currently running this remix Sprint where they're trying to get a bunch more and this seems awesome but just like I would love as many circuits as possible ideally studied to like the level of detail and universality as induction heads so the reason this feels important to me is to me a lot of the vision of mechanistic interpretability is if we had some grounded understanding a lot of things will become less confusing and if we're trying to do things like say what are the principles underlying networks what are the techniques that actually scale or work or like are reliable just having a test set of like 20 seconds to try it on would just feel like such a massive upgrade over the current state of the Arts so that's a category of stuff that I'm pretty pumped about I think the direction following my blocking work that like feels most interesting to me would either be work that's building further on phase Transitions and Real Models somewhere at the intersection of this and the induction heads work and like looking for more of them trying to test these ideas that phase transitions are just like a core part of life in Real Models and the other would be so kind of my personal vision for the philosophy outlined in the grocking in my grocking work is mechanistic interpretability as a tour of attack on science and deep learning Mysteries where I think that if we're making like the claim that we can actually reverse engineer engineer networks is like very bold and ambitious claim if it is true and we're making real insights we should be in a much better position to understand deep learning Mysteries than people who kind of are forced to mostly view them as black boxes and to me the kind of philosophy online rocking work was I was confused I built a example model organism and then reverse engineered it exhaustively that I distilled this into like an actual understanding and you use this to figure out what was going on and to me this is felt like a pretty General model that can be applied to many other questions that confuse Us in deep learning and anthropics toy models are superposition feels like a good example of a similar philosophy applied to superposition this weird phenomena in Transformer MLPs and I just feel like there should be a lot of traction there I don't know I have written the sequence called 200 concrete open problems mechanistic interpretability you can find on the 11th yeah that is 200 over problems I think I would like to see people work up sure for many more Neil takes on things that should happen all right so that's like how we should think of um the relationship of this to other questions and mechanistic interpretability one thing I wonder is like so there are very subfields of AI alignment but I think they're not like totally isolated I think there are like relations and potential complementaries between them do you see any complementarities of like either of this work in particular or mechanistic interpreter building in general with like other things people are pursuing an airline much yeah so I think there's like quite a few complementarities though not obviously dramatically more than there is between mechanism interpretability and any other field of AI since again if we're really understanding networks that should just be relevant everywhere yeah yeah uh so some things appear particularly salient fundamentally what we're trying to do in alignment is making claims about model internals and using fuzzy what Richard new calls pre-formal ideas around how to reason about networks like goals and agency and planning and the idea that a goal can be internally represented and the idea that a model can be like aware of itself and have situational awareness and this is like not even obvious to me that these are coherent Concepts to talk about in models and it's not obvious to me how these map onto like actual model cognition and it's really easy to anthropomorphize and I feel like if we just had in particular good examples of reverse engineered agents on reinforcement learning problems the thing I'd really love is reverse engineering work on an rlhf Model A reinforcement learning from Human feedback it feels like if we had just examples we understood there that would just teach me a lot and really help ground a lot of the fields ideas another idea would be going the opposite direction so there's like all of these ideas around say capability robustness but objective Mis robustness that give interesting case studies of agents that do things we think are models for potential alignment failures and like reverse engineering models is hard there's only so much researcher time and effort and being able to focus on like problems that are particularly interesting to all elements and see if we can shed Insight onto like what is actually going on there how much is this like a real thing versus a kind of like correlational fuzziness kind of thing um unclear um a third thing would be trying to give better feedback loops to alignment research like if you're trying to find an alignment approach that actually results in systems that do what we want it can be hard to judge this by just looking at the system's actions you can plausibly do this with some techniques like kind of red teaming or adversarial training and stuff like that but having the ability to look into the system and be like is this thinking deceptive thoughts just seems like it should be a massive and I think the field is like moderately far from the point where we can do that but this just seems like something that should make alignment research much easier so I have not actually spoken enough for the lemon people about how concrete relevant this is their work for me to be highly confident in this then I think there's various varieties of more ambitious ideas about integrating interpretability into like actual alignment schemes or training schemes like if you have uh rally if you have human feedback givers who have interpretability tools to like notice weird thing the models are doing then maybe this is helpful obviously this is an extremely scary and suspicious thing to do because if you start training on a tool it's easy to break the tool and we're nowhere near the point where I think we've got interpretability tools that are like actually robust enough that I'm comfortable applying any amount of creatine descent against them hmm but that's like an ambitious complementarity um one thing I will emphasize uh kind of kind of bailing that is for people who are listening to this and being like should I go work in mechanistic interpretability or something uh is just that in my opinion mechanistic interpretability just has a really unique Vibe as a field that is nothing like what I experienced in other bits of ml that's very much like ah partially I have a mathematician dealing with this formal mathematical object and partially a natural scientist dealing with this confusing complex organism and running experiments to try to find form true beliefs but also writing codes so my experiments run really fast I understand them well and partially it's like running a bunch of code improving this objects this is like a really fun and fascinating workflow and my prediction there's a bunch of people who are bad fits for this among people who are good fits of this who'd be bad fits for others and alignments and it's worth checking cool so before I end the interview is there anything that you wish that I'd asked but I haven't yet this is a great question I did take opportunities to shoehorn some of those in earlier like the science of deep learning Narrative of my grocking I feel like there's a big theme we never really got into it just is any of this remotely a reasonable thing to be working on like isn't this just ludicrously ambitious never gonna work or like art networks just fundamentally not interpretable which is like a pretty common and strong line of criticism huh yeah I I guess this is a thing that people other than me think so uh yeah aren't uh aren't yeah I I forgot that that was the thing yeah so yeah aren't uh art networks just like fundamentally inscrutable like how dare you challenge the Gods um yeah nah seems completely durable uh that's my brief answer so I think a pretty legitimate criticism but I often hear leveled against the field is that there's a lot of cherry picking going on where induction heads are really nice algorithmic tasks that occurred in tiny models indirect object identification is a more complex task but still like pretty nice and algorithmic like rocking work is obviously incredibly toy and neat and algorithmic small models are positively easier than big models let's have a kind of scale Etc and [Music] I think it is totally consistent with all available evidence that there's some threshold difficulty we're going to be good at things up to there and then completely fail Beyond this point and this is just an extremely hard claim to falsify for obvious reasons I do think that it's like my internal experience of getting into this field was something like seems like neural networks are just kind of like an inscrutable massive linear algebra and like I don't see why they should be interpretable to seeing uh some of Chris Ola and collaborators work at openai on image circuits and I was like wait what you can interpret neurons and then they found some circuits and I was like what this is weird and then they found that you could hand code some curb detecting neurons and subset them in and it kind of worked and it was like what but then I was like ah it's never going to scale it's never going to work transfer to language it's gonna be really specific to each model and then turns out Transformers are pretty doable and in some ways much easier and otherwise much harder turns out that induction heads are an interesting non-trivial circuits that occur in like all model scales and I feel like I just keep being surprised in positive direction and you know by induction but yeah I think this is just kind of like an open scientific question that we just don't have enough data to bear on either way and I'm Pro they're being bets made that alike let's assume macintop is completely doomed what would we do then and that there are approaches that there are like bits of people make that are like mechan is the wrong way to do interruptability but adaptability is really important so like let's go do that let's go do these other things and I think Mecca turp is great and really promising I had a really good bet to make but like I'm biased and if an eye except there's like a decent chance I'm wrong and it's just completely doomed sure and then the final problem is just like the problem of scale of even if we are just actually very good at a principal reverse engineering models what do we do if it's just even if we're pretty good at reverse engineering models what do we do if reverse engineer each and you're on a gp3 and like a day each and then there's millions of neurons and [Music] I like Loki hold the position of this seems like a really nice problem to have we'll solve it when we get there um but we're not yet at that point more seriously I do think that just having any solid Baseline of even knowing in principle how we might reverse nca3 given a ton of time and effort seems like an enormous win I also think that in a world where we're near having like genuinely dangerous Transformers of AI is a world right it might just be a lot better at automating things and also where The mechanistic Interpreter policy researchers are hopefully very close to the people making the big scary apis and can figure out how to leverage them and finally that it's plausible to me the main thing we need to get done is noticing specific circuits to do with like deception and specific dangerous capabilities like that and situational awareness and like internally represented goals and that getting good at that is just a much easier problem but I don't know I think this is just like actually valid criticisms we don't know how to have enough data to answer them it's completely plausible that like the field is just doomed all right and I think this is true of all research objectives but it is worth knowing seems fair So speaking of it being worth knowing if people uh want to know more about uh your research and uh the work you do how can they follow you yeah so following me on Twitter is probably the best way to get all of my random musings okay what's your handle uh I think it's kneel down to five I spent far too much time on Twitter I also um generally all postings on the alignment forum and on the mechanistic interpretability section of my blog and I should release open mailing list for that at some point I haven't got around to it yet yeah so you should check out my work particularly notable things people might be interested in I made this Library called Transformer lens for just having good open source infrastructure for doing mechanistic interpretability of language models like dbd2 which is designed to make research fun and not a massive headache I have this post called concrete steps to get started and mechanistic interpretability that's my attempt to create a guide that can take someone who is excited but doesn't know anything about the fields to the point we are actually trying to do original research I have a YouTube channel with a bunch of paper walkthroughs and other interpretability tutorials and if you enjoyed hearing me ramble about a bunch of papers here you might enjoy a much of my paper walkthroughs and I have a sequence called 200 concrete open problems mechanistic interpretability if I try to lay out a map of the fields and what I think of the big areas of open problems with a bunch of ticks on why they matter followed by a bunch of the open problems and like my takes on the right way to orient them as a researcher and like pitfalls and how to do good work and finally I have a comprehensive mechanistic interpretability explainer where I try to just give a lot of Concepts and ideas and terms that I think people should understand they're trying to engage with the fields which you can either just like read through the whole thing and try to get a concentrated dump of a lot of my intuitions and ideas or use as a reference which includes sections on most of the papers for Scott here yeah it was already fun thanks for having me on yeah links to all of those will be in the description um yeah thanks for talking and to listener I hope this was a valuable episode for you this episode is edited by Jack Garrett and Amber Dawn Ace helps with the transcription the opening and closing themes are also about Jack Garrett financial support for this episode was provided by the long-term feature fund through the transcript of this episode or to learn how to support the podcast yourself you can visit axrp.net finally if you have any feedback about this podcast you can email me at feedback axrp.net [Music] foreign [Laughter] [Music] [Music]
Related conversations
AXRP
28 Mar 2025
Jason Gross on Compact Proofs and Interpretability

This conversation examines technical alignment through Jason Gross on Compact Proofs and Interpretability, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -1 · 139 segs
AXRP
1 Mar 2025
David Duvenaud on Sabotage Evaluations and the Post-AGI Future

This conversation examines technical alignment through David Duvenaud on Sabotage Evaluations and the Post-AGI Future, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med -9 · avg -7 · 21 segs
AXRP
1 Dec 2024
Evan Hubinger on Model Organisms of Misalignment

This conversation examines technical alignment through Evan Hubinger on Model Organisms of Misalignment, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med -6 · avg -7 · 120 segs
AXRP
27 Jul 2023
Superalignment with Jan Leike

This conversation examines technical alignment through Superalignment with Jan Leike, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med -10 · avg -7 · 112 segs
Counterbalance on this topic
Ranked with the mirror rule in the methodology: picks sit closer to the opposite side of your score on the same axis (lens alignment preferred). Each card plots you and the pick together.
Mirror pick 1
AXRP
3 Jan 2026
David Rein on METR Time Horizons

This conversation examines core safety through David Rein on METR Time Horizons, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Spectrum vs this page
This page -14.44This pick -10.64Δ +3.799999999999999
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -0 · 108 segs
Mirror pick 2
AXRP
7 Aug 2025
Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Spectrum vs this page
This page -14.44This pick -10.64Δ +3.799999999999999
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -5 · 133 segs
Mirror pick 3
AXRP
6 Jul 2025
Samuel Albanie on DeepMind's AGI Safety Approach

This conversation examines core safety through Samuel Albanie on DeepMind's AGI Safety Approach, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Spectrum vs this page
This page -14.44This pick -10.64Δ +3.799999999999999
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -4 · 72 segs