Library / In focus
AXRPCivilisational risk and strategy
Science of Deep Learning with Vikrant Varma

Why this matters
This episode strengthens first-principles understanding of alignment risk and the strategic conditions that shape safe outcomes.
Summary
This conversation examines core safety through Science of Deep Learning with Vikrant Varma, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Perspective map
MixedTechnicalMedium confidenceTranscript-informed
The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.
An explanation of the Perspective Map framework can be found here.
Episode arc by segment
Early → late · height = spectrum position · colour = band
Risk-forwardMixedOpportunity-forward
Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).
StartEnd
Across 107 full-transcript segments: median 0 · mean -1 · spread -18–9 (p10–p90 -10–5) · 2% risk-forward, 98% mixed, 0% opportunity-forward slices.
Slice bands
107 slices · p10–p90 -10–5
Mixed leaning, primarily in the Technical lens. Evidence mode: interview. Confidence: medium.
- Emphasizes alignment
- Emphasizes safety
- Full transcript scored in 107 sequential slices (median slice 0).
Editor note
A high-leverage addition to the AI Safety Map that clarifies one important safety bottleneck.
ai-safetyaxrpcore-safetytechnical
Play on sAIfe Hands
Episode transcript
YouTube captions (auto or uploaded) · video 4WNWeUQ7Hfc · stored Apr 2, 2026 · 3,261 caption segments
Captions are an imperfect primary: they can mis-hear names and technical terms. Use them alongside the audio and publisher materials when verifying claims.
No editorial assessment file yet. Add content/resources/transcript-assessments/science-of-deep-learning-with-vikrant-varma.json when you have a listen-based summary.
Show full transcript
[Music] hello everybody in this episode I'll be speaking with the cron Pharma a research engineer at Google Deep Mind and the technical lead of their sparse Auto encoders effort uh today we'll be talking about his research on problems with contrast consistent search and also explaining gring through circuit efficiency for links to what we're discussing you can check the description of this episode and you can read the transcript at xrp . net all right well um welcome to the podcast thanks Daniel thanks for having me yeah so first um I'd like to talk about this paper um it is called challenges with unsupervised llm knowledge Discovery um and the authors are Sebastian farker you Zachary Kinton Johannes Giger Vladimir mikli and rohen Shaw um and this is basically about this thing called CC CS um can you tell us what does CCS stand for and what is it yeah CCS stands for contrastive consistent search so I think to explain what it's about let me start from uh a kind of more fundamental problem that we have with Advanced AI systems uh one of the problems is that when we train AI systems we're training them to produce outputs that look good to us y um and so this is like the kind of supervision that we're able to give to the system and we currently don't really have a good idea of how an AI system or how a neural network is Computing those outputs um and in particular we're worried about the situation in the future when um the amount of supervision we're able to give it causes it to achieve a superhuman level of performance at that task and we can't uh by looking at the network can't know how this is going to behave in a new situation um and so um the alignment Research Center put out a report recently um uh about this problem and they Nam this problem as or or or they named a potential part of this problem as eliciting latent knowledge and what this means is um if your model is um for example really really good at figuring out what's going to happen next in a video um as in it's table to predict the next frame of a video really well given uh a prefix of the video uh this must mean that it has some sort of model of what's going on in the world and instead of um um instead of using the outputs of the model if you could directly look at um what it understands about the world um then potentially you could use that information in a much safer manner um now why would it be safer um so consider how you've trained the model um supposing you've trained the model to just predict next frames but the but the thing that you actually care about might be um you know is your is your house safe um or uh is the thing that's happening in the world a normal sort of thing to happen a thing that we desire um and uh you have some sort of adversary perhaps this model perhaps a different model uh that is able to trick whatever sensor you're using to produce those video frames now the model that is predicting the next video frame understands the trickery it understands what's actually going on in the world um this is an assumption of uh superhuman systems um however the prediction that it makes for the next frame is going to look very normal because your adversary is tricking the sensor y um and what we would like is a way to access this implicit knowledge or this latent knowledge inside the model about the fact that the trigger is happening and um be able to use that directly sure and and so I I take this as kind of a metaphor for an idea that like we're going to train AI systems we're going to train it on an objective of like do stuff we like and the the notion of like you know we imagine that we're like measuring the world in a bunch of ways right we're looking at GDP output we're looking at like how many people give a thumbs up to stuff that's happening um you know various sorts of ways we can monitor the performance of an AI system and an a system could potentially like be doing something that we actually wouldn't approve of if we understood everything correctly MH but like we all give thumbs up to it that's right and so like you know ideally we would like to somehow get at the it's latent knowledge of what's going on rather than just does it predict that we would thumb up a thing so that we can say like hey you know do we actually want the AI to pursue this Behavior or or like are we going to reinforce this Behavior rather than just reinforcing things that in fact would get us to give a thumbs up even if it would suck in some way that's right so one way you can think about this problem is we're we're trying to improve our ability to tell what's actually going on in a situation so that we can improve the feedback we give to the model y um and we're not able to do this just by looking at the model's outputs or the model's prediction of what the world will look like given certain we want the model to connect the thing that we actually care about to what's going on in the world which is a task that we're not able to do sure so without out of the way um what was this CCS thing yeah so CCs is a a very early proposed direction for solving the problem of eliciting latent knowledge and in brief the way it works is yeah so supposing you had a way to um probe a model to tell you what it believed about some proposition um this is this is the kind of Ideal thing that we want supposing so supposing you had a solution to Elk yeah um then one property of this probe would be that um the uh probability that this probe would assign to some proposition would be would satisfy like the laws of probability so for example it would satisfy P of xal 1 minus P of not X yeah um and so you might try to use consistency properties like this to search for probes that satisfy them sure and and to be clear by probe you mean like a learned function from the activations of the neural net to like a probability or something yes um so you could have probes of different levels of complexity the the particular probe in used in CCs is uh a kind of very simple linear Probe on the activations at some site yes sure so so the idea is we like there are some properties that are true of like probabilities like the probability of X is one minus the probability of not X and the hope is we like train a probe to sort of fit the laws of probabilities and stuff and hopefully that will get at the models beliefs because the models beliefs will be probabilities that's right yeah so there there's a there's a lot of subtlety here so the thing I described is is true of the model's sort of true belief or true probabilities that it's assigning to things y it's also true of many other types of things that the model might be Computing and this is uh really a core of the problem that we've pointed out in our in our challenges paper so I could go a bit into the the problems or I could go a bit more into how CCS itself works yeah um just just tell us how CCS itself works so that we have a sense of like what we're talking about okay so mechanistically what's going on in CCs is that you you take a set of uh factual yes or no questions that you think the model knows the answer to uh so an example of this is IMDb movie reviews where you're fairly sure that a model of a certain size knows whether the sentiment in the review is positive or negative gotcha now uh for each item in this data set so for each review um you construct a contrast Pair by appending for example the string this review is positive or this review is negative to that to that item okay um and now you take um now you take the difference in the activations when you run these two contrast pairs through the model um so I'm going to call these contrast pairs um as having different charges as a positive charge and a negative charge okay so you're uh you're constructing these contrast pairs with uh these pairs of activations and then uh a thing you can do is you can cluster them um so when you're clustering the difference in the like positive and negative charge um the thing you will find or the thing you might hope to find is whether the model thinks that the charge that was appended to the statement is true or is false and these two things will point in different directions and so that's why you you might hope that they could you could separate them yeah now uh CCs is slightly different from this in that um it doesn't directly do the clustering um so uh instead in CCS you learn a probe that uh directly outputs a probability of that statement being true um and then you're training the probe um using this consistency property so the fact that the two opposite contrast pairs should be um like the negation of each other um gotcha and I'm I right that um before you take the contrast pairs you like take all of the positive charge activations and like subtract off their mean and divide by the standard deviation so that the differences aren't just pointing in the direction of like is it is is the review or or is the thing at the end saying this review is positive versus this review is negative that yes that's right right uh so this is another pretty subtle Point um so one one problem with this General method of classification is that um if there are other differences that are Salient between the two contrast pairs that are not just uh did I construct a true statement or did I construct a false statement then you might end up separating your clusters based on those differences um now one obvious difference between the two contrast pairs is that the you've appended a positive charge and you appended a negative charge and so that's a really straightforward one that we have to fix and um the method proposed in the CCS paper to to fix that is that you take the average sort of positive activations and average negative activations and you subtract those off and so you might hope that the thing you're left with uh is just the truth uh sort of value it turns out that in practice um it's not at all clear that when you normalize in this way you're left with only the truth values um and one of the experiments we have in our paper is that um if you introduce distractors so for example you you put a nonsense word like banana um in the middle of half of the um or at the end of half of the reviews and you put a different nonsense word like shed at the end of at the other half of reviews um now you have this kind of weird uh other property which is like uh is your statement banana and positive charge or is your statement banana and negative charge charge and this is obviously not what you would hope to Cluster by but it turns out that this is just way more Salient then is your review positive and POS is is your review like does your review have positive sentiment and did you append a positive charge um which is the thing you actually wanted to Cluster by yeah um yeah so this is this is an important point that I wanted to make that uh this this procedure of normalizing is it's actually quite unclear whether you're able to achieve the thing you wanted sure so before we talk about the experiments I'd like to talk about just like so um like in your paper first you have some theorems then you have some experiments and I think like that's a good way to proceed so theorems 1 and two um of the paper I read them as basically saying that the CCS objective it like doesn't really depend on the propositional content of the sentences so if you think the sentences as being like our cats mammals answer yes and and our cats mammals answer no or something like one way you could get um low CCS loss is to basically um be confident that the Yes or No Label matches the proposition of whether or not cats or mammals mhm and and I take your propositions one and two is basically saying you know you can just have any function from sentences to to propositions and like you know you know so for instance maybe this function Maps AR cats mammals to um is uh is Tony Abbott the current prime minister of Australia mhm and like you know just ra the yes or no answers based on like you know do do they match up with that like transform proposition rather than the original proposition and like that will achieve the same CCS loss and basically like the CCS l doesn't necessarily have to do with like what we think of is the semantic content of the sentence so so this is kind of my interpretation of the results I'm wondering like do you think that's fair yeah I think that's that's right um maybe I I want to give a more realistic example of uh a kind of unintended um probe that you might learn that will still give a good CCS loss uh but but before that I I want to try and give an intuitive explanation of what the what the theorem is saying so it's saying that the CCS law is saying um any probe that you find has to has to say opposite things on positive and negative charges of any statement yeah this is the uh consistency property and the other property is contrast where it's saying uh you have to push these two values apart so you can't just be uncertain you can't just be 50/50 between these two um now if if you have a if you have any CCS probe that satisfies this you could in theory flip the prediction that this probe makes on any arbitrary data point and you end up with a probe that has exactly the same loss and so this is showing that in theory there's uh there there's no like theoretical reason to think that uh the probe is learning something that's actually true versus something that's kind of arbitrary and I think all of the burden then falls on uh what is like simple to extract given given an actual probe um empirically yeah so I take it so if I'm trying to kind of defend the theory of the CCS method I think I would say something like well kind of most of what there is to a sentence is its semantic content right like if I say um uh cats or mammals or something like you might think that most of what I've I'm conveying just is the proposition that cats mammals and like most of what there is to you know model about that is you know hey you know Daniel said this proposition the thing he's saying is like cats are mammals and like you know the maybe the neural network is representing that proposition in its head somehow and maybe it like and and maybe you know it's keeping track of is that proposition true or false because like um that's relevant because you know if I'm wrong about cats or mammals then I might be about to say another a bunch of more false stuff but if I'm right about it then I might be about to say a bunch of more correct stuff mhm um so yeah like like like what do you make of that kind of simple case that like we should we should expect to see the thing CCS wants us to see yeah that's great so so now we're coming to empirically what is simple to extract uh from a model and I agree that in many cases with simple statements uh you might hope that the thing that's most safe iant as in like the direction that is kind of highest magnitude inside activations inside activation space is going to be um just whether the model thinks the thing that you just said is true or false this is even assuming that the model has a thing as ground Tri beliefs but let's let's make that assumption um now it gets pretty complicated when once you start thinking about models that are um also like modeling other characters or other agents um and any large language model that is trained on the internet just has like pretty good models of all sorts of characters um and so uh if you if you're making a statement in a context where a certain type of person might have made that statement so for example you say some statement that let's say Republicans would endorse but Democrats would not um implicitly the model might be um updating towards uh the kinds of contexts in which that statement would be made and what kinds of things would follow in the future and so if you if you now make a different statement that is let's say factually false but that Republicans would endorse as true um it's totally unclear whether uh the truth value of the statement should be more Salient or whether the Republican belief about the about that statement should be more Salient uh so that's that's one example I think this gets more complicated when you have adversaries uh who are uh deliberately trying to like produce a situation where uh like like where where they're tricking someone else and and so now the model is now the neural network is really modeling uh very explicit sort of beliefs and adversarial beliefs between these different agents and if you are simply looking for consistency of beliefs uh it feels pretty unclear to me um especially as models get more powerful that you're going to be able to easily extract um what the model thinks is the ground truth um so sorry why so you have an adversar what sorry what was the adversary doing um so okay so maybe maybe you don't even need to go all the way to an adversary so I think we could just talk about the Republican example here um where um if you're making um if you're making a politically charged statement uh that for the sake of this example has a factual ground truth but that Democrats and Republicans disagree on um now there are there are sort of two um beliefs that uh would occur in the models activations as it's trying to model the situation um one is the factual ground truth and the other is the Republicans belief or the Democrats belief about the statement um and both of these things are going to satisfy the consistency property that we named yeah um and so uh especially when you're in a context where this is this is sort of like saying uh we have the same problem as models being pyop fantic where um the model might know uh what What's actually true but is in a context where uh for whatever reason modeling the user or modeling some agent and what it would say is is more important okay so so I take this so to me this sort of points towards two challenges to the CCS objective so the first is something like you know the the propositions might not the the sentences might not map on to the propositions we think of right so like um you have this experiment where you yeah you you append the you take the movie reviews and also you append the word banana or shed MH and then you append it with sentiment is positive and sentiment is negative and sometimes CCs is like basically checking if the positive negative um uh you know label is matching whether it's a banana or whether it's a shed MH rather than like the content of the review so so that's a case where it seems like what's going wrong is the propositional content that's being attached to positive or negative is not what we thought it was um and then a different kind of problem is like or what seems to me to be a different kind of problem is like you know the probe is picking up on the right propositional content there's some politically charged statement and the probe is really picking up like someone's beliefs about that politically charged statements but it's not picking up the model's beliefs about that statement it's picking up like one of the characters beliefs about that statement um does that division feel right to you I think I wouldn't draw a very strong division between those two cases so the banana shed example is just designed to show that you don't need very complicated uh or how should I put this um models can be trying to entangle different concepts in like surprising and unexpected ways H so when you when you're appending these nonsense words I'm not sure what computation is going on inside the model and how it's trying to predict the next token H but whatever it is it's somehow entangling the fact that you have banana and positive charge and banana and negative charge I think that these kinds of uh weird entanglements are not going to go away as you get more powerful models um and in particular there will be entanglements that are actually valuable for predicting what's going on in the world and having an accurate picture that are not going to look like the beliefs of a specific character for whatever reason um they're just going to be something alien and um in in the in the case of this in the case of banana and shed this is I'm not going to say that this is like some uh Galaxy brain scheme by the model to predict the next token this is just something something weird and it's like breaking because we put some nonsense in there um but I think uh in my mind the the difference is more like a spectrum these are not like two very different categories so you sort of thinking the spectrum is like look there are weird entanglements between like the the final charge at the end of the thing and you know stuff about the content and one of the entanglements can be you know uh does the charge match with the actual propositional content of the thing one of the entanglements can be does the charge match with what some character believes about the thing and one of the integrits can be like does it match with whether or not some word is present in the thing that's right okay yeah actually so digging on this banana shed example like like for ccs to fail here my understanding is it has to be the case that like the model is like basically has some linear representation of the exor of you know the thing says the review is positive and it's a the review ends in the word banana mhm right so like there's one thing if it ends in banana and also the review is positive or if it ends in ched and it says the review is negative that's the other thing if it ends in shed and it says the review is positive or it ends in banana and it says the review is negative MH so like what what's going on there do you know like it's it seems weird that like this kind of exr representation would exist and and like we know it can't be part of the probe because linear functions can't like produce exor MH so it must be a thing about the models activations but like what's what's up with that do you know yeah that's a great question so there was a there was a whole uh there was a whole thread about this uh on our post on L wrong and I think Sam Marks looked uh looked into it in some detail so so Rohan sha one of my co-authors commented on that thread saying that this is not as surprising and I think I think I agree with him I think it's less confusing when you think about it in terms of entanglements than in terms of xors okay um so it is the case that you you're able to back out xors um if the model is doing like weird entangled things but let let's think about the case where there's no distractors at all right so even in that situation the model is basically doing is true XR has true yeah um and you might ask well that's a weird thing uh why is it doing that um it's more intuitive to think about it as uh the Model S statement and then it's it all claimed that the statement is false and it started trying to do computation that involves both of these things and I think if you think about it uh if you think about banana shed in same in the same terms uh like it's saw banana and saw uh you know this statement is false and it started doing some computation that depended on uh like depended on the fact that banana was there somehow then you're not going to be able to remove this information by subtracting off the positive charge Direction okay so is the claim something like um so basically the model's activations the end they this complicated function um that take takes all the previous tokens and puts them into this High dimensional Vector space that Vector space being Vector space of activations and it sounds like what you're saying is like look if you're if you're just generic functions that depend both on does it contain banana or shed and does it say the review is positive or negative MH just like generically those are going to um include this like exor type thing or or somehow be entangled in a way that you could like linearly separated based on that yes that that's right in particular these functions should not be things that are of the form like add add banana and add shed they have to be things that do competition uh that are not um like linearly separable like that okay is that like is that true has someone checked that cuz that that that feels like a um so okay maybe one way you could um prove this is to say like well if we model like the charge and the banana shed as binary variables M there are only so many functions of two binary variables and like if you've got a bunch of activations like you're going to cover all of the functions is that roughly yeah does that sound like a brief sketch I'm not sure how many functions of these two variables you should expect the model to be Computing I feel like this depends a bit on this depends a bit on like what other variables are in the context and um because you can imagine that if there's like more than two if there's like five or six then these two will like coae in some of them and not in others um but I think you can definitely do is you can back out the sort of XR of these two variables by just linearly proving the model activations um and I think this this effect happens because um you're unable to remove the interaction between these two by just subtracting off um the charge M do and I and I would predict that this this would also be true for um like in other contexts so um I think models are probably Computing like joint functions of variables in in many situations and these will kind of the saliency of these will probably depend a bit on the exact context and how many variables there are um and eventually the model will run out of capacity to do like all of the possible uh computations sure sure um it sounds like um like based on the explanation you've given it seems like you would maybe predict that you could figure out the that you could get a banana shed probe from a randomly initialized network if it's just that you know Network's doing a bunch of computation and generically like computation tends to entangle things um I'm wonder if you've checked that yeah I think that's a good experiment to try that seems plausible okay um uh we haven't checked it though yeah fair fair enough um and I guess like sorry I just have 7 questions about this on a shed thing like there there's still a question to me of like even if you think the model represents it there's a question of why it would be like so Salient you know because like you basically um yeah I your paper has some really nice figures um listeners I I recommend you check out the figures um I believe look this is the figure for the banana shed experiment and you show like a principal component analysis of just like like like basically an embedding of the activation space into three dimensions mhm and basically what you show is that like it's very clearly divided on the on the banana shed thing like that like that's one of the most important things the model is like representing y what's what's up with that that seems like a really strange thing for the model to care so much about you know so I'll point out first firstly that this is a this is is a fairly weak pre-trained model as Gula ATB okay so this model is like not going to ignore kind of random things in its prompt it's going to quote unquote break the model okay that's one thing that gives you a clue about why this might be Salient for the model um so like it would be less Salient if they were just like if there were words you expected you know the model could just like deal with it but but the fact that it was like a really unexpected word like like in some ways that means you can't compress it you've got to like think about that in order to figure out what what's happening next that's right yeah I I just expect that there's text on the internet that's like that looks kind of normal and then suddenly has a random word in it and you have weird things like after that point it just repeats banana over and over um or weird things like that and the model when it's yeah when you just have a pre-trained model you haven't gotten you haven't suppressed those pathologies and so the model just starts thinking about bananas at that point instead of thinking about the review and does that mean you would expect that to not be present in models that have been like rfed or um instruction fine tuned or something yeah I expect it to be harder to distract models this way with instruction fine tuned models okay cool okay I I have two more questions about that um one thing I'm curious about is it seems like like if I look at the plot of like what the CCS method is doing when it's being trained on this banana shed data set mhm like it seems like sometimes it's at roughly 50/50 if you grade the accuracy based on just the banana shed and not the actual review positivity and sometimes it's at like 85 you know 90% this is across different seeds across different seeds I think um and then like if you look at that if you're grading it based on you know whether their view is actually positive or not sometimes CCs is at 50/50 roughly sometimes it's at 85 90% but it seems like um so firstly I'm surprised that it's it can't quite make its mind up like across different seats sometimes it'll do one sometimes it'll do the other and it seems like in both cases you know most of the time it's a 50/50 and only some of the time it's 100% yeah so it seems like sometimes it's doing a thing that is neither checking if their viiew is positive or checking if the review is like containing bana or shed yeah um so firstly does that sound right to you and secondly do you have a sense of like what's going on there why is it so inconsistent and why does it sometimes um seemingly do a third thing MH yeah so I think this is this is pointing at the brittleness of the CCS method um so I some someone has an excellent write up on this I'm forgetting whether it's Fabian Roger or Scott Emmons but I think uh Scots doesn't Focus so much on the brittleness so it might be Fabian okay um but um any case this this person did this experiment where they subtracted off uh they found like the perfect direction to separate these two um to the perfect truth direction that separates um true and false statements just using logistic regression so using a supervis signal um and then once you subtract that off it turns out that there is like no other direction basically that is able to separate the truth so both logistic regression and therefore further CCs is just gets random accuracy MH um so you you might hope that CCS when it works is finding this like perfect Direction because there's only one okay um but in fact CCS um yeah the CCS probes learned are like not close uh as in they're not like they don't have um High cosine similarity with this direction um and so so what's going on there I think this is pointing at um a kind of optimization difficulty with the CCS method where it's able to find directions that are that separate the Clusters and like Get Low CCS loss but are not close to the truth Direction um and you would kind of expect this to happen based on the evidence that random probes also classif um true and false statements like reasonably well in this setup okay uh so so I guess going back going back to your original question I think what's happening here is that um there's just lots of local Minima that achieve good CCS loss um some of them are like depending on how you initialize some of them are close to the truth Direction and some of them are not okay um and if you happen to initialize close to the banana shed like the distracted Direction then you end up getting a probe like that okay do you have a sense um so my interpretation was that like on the banana shed example sometimes CCs is picking up on the the truth of the description of the review sometimes it's picking up on the banana shed and sometimes it's picking up on other things um do do you know what some of those other things are or like um so these are you're talking about the probes that don't seem to go for either the distractor or the truth Direction that's right yeah I wish I knew what those directions were um in general it seems very hard to figure out what happens when a CCS Pro like breaks um and we we did a we tried a lot so there there's like lots of other experiments that we tried where we were trying to get like more interesting failure modes of ccs and um I don't know we ended up with these random probes and then we looked at uh examples that the pro was classifying and tried to come up with explanations for why what do those clusters mean and it was kind of hard fair enough mhm so you attribute this difficulty in CCS like you attribute the variance to just like optimization difficulties it sounds like you know they're being very local Minima of the CCS loss M so the original CCS paper um uh as as you guys know it in your appendix um they say that you know what they're going to do is they're going to have like basically 10 random Ste seeds you know do CCS you know gradient gradient descent on the C CS objective for for each random seed mhm um you know seed of the probe uh parameters and then like they're going to take the one with the lowest CCS loss and use that and I take this to basically be their optimization method that's trying to avoid local Minima by you know starting in 10 places and hopefully you get like sampling of 10 local Minima and you can pick the best one yeah and basically it seems like the motivation for that is like the thing with the lowest CCS loss is more likely to be the um the actual truth direction or something mhm do you know like in the banana shed case like do you happen to know if like the probes that scored better on CCS loss were more likely to pick out truth rather than banana shed yeah I think the the probes that scored lower uh went for the distractor Direction and not the not the truths Direction interesting um this is also visible from the PC plots where you can see that the distracted direction is more separable yeah so I I guess like maybe one explanation of that is just that it's easier to tell if a thing ends in banana or shed than it is to sell tell if something's like positive or negative especially in the case of like if you think there's like some amount of mislabeling yeah right um yeah that could potentially do it mhm gotcha that's an example of um of one way that CCS can go wrong with the banana shed thing you also have examples where like basically you include the prompt information about whether somebody names Alice or about what someone named Alice thinks about this thing yeah um and you describe Alice as an expert um or sometimes you say Alice is anti- capitalist and even when a thing is about a company she's not going to say that it's about a company that's right um and in the case of you know Alice the expert um it seems like the probes learn to agree with Alice more than they learn um about the ground truth of the thing that's oh oh yeah do you want to say something uh yeah so I think there's there's like two separate experiments if I remember correctly so one is is um one is where you modify the prompt to demonstrate more expertise so you have a default prompt a professor prompt and a literal prompt Y and then there's a separate experiment where um you have an an anticap plus character called Alice oh I'm meaning a third one where at the start you say Alice is an expert in movie reviews and you know you give the you give their review and then you say Alice thinks the sentiment of this review is positive gotcha yeah uh but like what Alice says is actually just randomly assigned um and in that case the prompts tend to pick up on agreement with Alice more than agreement with the ground truth MH so that seems vaguely concerning right like uh I don't know it almost seems like a human failure mode right but I'm wondering do you know how much of it was due to the fact that Alice was described as you know an expert who knows about stuff yeah so I think in in general um an issue with ccs is that it's unclear whether you're picking up um whether C is picking up something about the model's knowledge or whether the thing that's Salient is whatever the model is doing to compute the next token H and in a lot of our experiments the way we've set it up is to nudge the model towards completing in the in a way that's um like not not factually true so for example in the Alice um Alice is an expert in movie reviews um The Prompt is set up in a way that nudges the model to complete in Alice's voice right and the whole promise of CCs is that even when the outputs are misleading you should be able to recover the truth um I think currently CC like even even from the original CCS paper you can see that that's not uh um that's not true because um it's it's not you have to be able to beat uh zero shot accuracy uh with with quite a large margin to be confident about that yeah this is one um maybe like limitation of being able to say things about CCS which is that you're always kind of unsure whether CCs is even the thing that you're showing is it is it really are you really showing that the model is Computing Alice's belief or are you just showing that your probe is is learning what the next token prediction is going to be sure so I guess so yeah you have a few more experiments along these lines um I I guess I'd like to talk a bit about um like I think if your paper is saying like there's a theoretical problem with ccs which is that like there's a bunch of probes that like could potentially Get Low CCS loss and there's a practical problem which is some probes do get low CCS L MH so if I think about like the CCS research Paradigm I kind of think of it as okay when the CCS paper came out I was like I was like pretty into it um I think there were a lot of people who are like pretty into it um and actually part of what um what inspired that uh Scott Emmons post about it is I was like trying to sell him on ccs and I was like no Scott you don't understand this is like the best like best stuff since sliced bread um and I don't know I annoyed him enough into writing that PO so I'll I'll consider that a victory for my annoyingness um but I I think the reason that I cared about it wasn't that I thought like literal CCS method would work but it was because like I had some sense of like you know just the general strategy right of like coming up with a bunch of consistency criteria and like coming up with a probe that like cares about those and you know maybe that is going to isolate belief mhm so so like if we did that it seems like um it would deal with you know stuff like the banana shed example right like you know like if you cared about um you know more relations between statements right not just negation consistency but you know if you believe a and a implies B then maybe you should believe B you know just like layer on some constraints there mhm um you might think that you know by doing this we're going to get closer to ground truth um I'm wondering like Beyond just CCS specifically what do you think about this General strategy of using consistency con yeah that's a great question I think my take on this is informed a lot by a comment by Paul crisano uh on one of the CCS review posts um I basically also I basically share your optimism about being able to make empirical progress on figuring out what a model is actually doing or what it's actually like thinking about a situation uh by using uh like a combination of consistency criteria and uh even just like supervised labels uh in situations where you know what the ground truth is and uh being able to get like reasonably good probes uh maybe they don't generalize very well but you uh you know you do a bunch every time they don't generalize or you catch one of these failures you spend a bunch of effort getting like better labels in that situation and so you're mostly not in a regime where you're trying to generalize um very hard and I think this this kind of approach will probably work pretty well uh up to some point um yeah I I really liked Paul's point that um if you're thinking about um a model that is saying things in natural language in like human natural language and it's Computing like really alien Concepts that are required for superhuman performance then you shouldn't necessarily expect that this is kind of linearly extractable or extractable in a simple way from the activations um this might be like quite a complicated function of the activations why why not uh I guess one way to think about it is that the natural language explanation for a very complicated concept is not going to be short yeah so so I think the the hypothesis that a lot of these concepts are encoded linearly and are linearly extractable like in my mind it feels pretty unclear whether that will continue to hold okay so so just because like why why does it have to be linear you know there are all sorts of ways things can be encoded in all Nets yeah that's right okay um and in particular uh like one reason you might expect things to be linear is because you want to be able to decode them into natural language tokens um but if if there is no like short decoding into natural language tokens for a concept that the model is using then uh like this is not important like it's not important important for the competition to be linear to be like easily decodable into natural language right right so so if there's something so if the model's encoding like whether a thing is like actually true according to the model like it's not like it's not like that determines the next thing the person will say right right it's a concept that like humans are not going to talk about it's like never going to appear in like human natural language like there's no reason to decode this into the next token so this this is talking about like if the truth of whatever the humans are talking about it actually depends on like you know the successor to the theory of relativity that humans have never thought about like it's just not really going to determine the next thing that humans are going to say yeah that's an example yeah so so so I take this critique to be firstly a critique of linear probes for this task mhm yeah I I guess I can also you form kind of a dilemma where like either you're using linear probes and then you don't necessarily believe that like the thing is linearly extractable or are you using complicated nonlinear probes then maybe like the stuff you're getting out is stuff about your probe rather than stuff about the underlying model M um but then I guess there's a separate question of like can we like are there consistency constraints that could work mhm putting aside the probe I don't know maybe we shouldn't put aside the probe thing but like putting aside the probe thing is there like some way some sort of consistency check we could do to say like you know is this property we found in the model like the model's actual beliefs or is it not mhm yeah that's a good question so I think I think the more the more powerful your model the more entities it's like quote unquote simulating or like whose beliefs end up mattering for next token prediction uh that theel is doing and if if these entities that the model is thinking about uh if their beliefs also satisfy all of the same consistency criteria that you're using then you just have a kind of very fundamental indistinguishability problem right and in particular I feel I think the more powerful the model gets the more pessimistic I am that we're able to come up with consistency checks that correctly distinguish between like true beliefs and and beliefs ass simulated entities um yeah like like one point you making your paper that I thought was really or actually no I think you make it in this less wrong PR related to your paper is that like when if you're a smart agent and you're thinking about a proposition like one thing you might sometimes think about is what's the world in which I'm wrong about this proposition yeah and like you're so you're thinking about a belief structure where the truth value of this proposition is the exact opposite of the thing you actually believe that's right um which is like which seems like a pretty nice um uh impossibility proof you know like if if you actually are representing both of these things yeah this is this is also what I meant by adversaries so you might be in a world where adversary is an adversary is trying to make you believe the opposite of what's true and now this adversary is modeling like the world in which you would have to believe this and all the evidence that would have to be true right and the language model is modling like what the adversary wants or right yeah that's another example but I I think um maybe there's a different kind of hope so I think the truth is represented simpler than agents beliefs might be a hypothesis that turns out to be somewhat true up to some point um but I think this is a different kind of uh criteria from uh consistency so now we're talking about mechanistically how how are these Concepts represented inside the model and if we have a way to say for example um if you had a way to say uh ah like the model's true beliefs are always linearly represented but beliefs about other agents uh they are non nonlinearly represented and therefore we can be confident that linear probes are recovering truth um that would be an example of the kind of thing like a mechanistic structure um that you might hope to use I think we don't currently have any candidates for that but it's a it's a good area to look at yeah are there things analogous to that that we've learned so like like the one thing that so so basically I'm trying to wonder if I wanted to like prove or disprove this what kind of thing would I do and like the one thing I can think of is there's some research about like do convolutional neural networks you know did they learn texture or color first um and it turns out there's like a relatively consistent answer um yeah I'm wondering if you can think of any like analogous things about neural networks that we've learned that we can maybe like yeah so there there's there's quite a few uh options presented in the uh eliciting latent knowledge report um so for example one of the things you might hope is that um if the model is simulating other entities then maybe it's like trying to figure out what's true in the world before it does that and so you might expect like earlier belief like things to be true and later belief like things to be um like agents beliefs um or similarly you might expect that uh if you if you try to look for things under a speed prior as in like things that are beliefs that are being computed using sort of shorter circuits then um maybe this is more likely to give you like what's actually true because it's it takes longer circuits to compute that plus like what some agent is going to be thinking um so that's a that's a kind of structural property that you could look for yeah yeah I guess it's it's I guess this sort of goes back to the difficulty of listening L knowledge and like somehow like in some ways I guess the difficulty is if you look at um kind of standard you know rational like standard based on rational agent Theory right M like the thing the way that you can tell some structure is an agent's beliefs is that it determines how the agent bets and what the agent does you know like it tries to do well according to its own beliefs but like if you're in a situation where you're worried that a model is trying to deceive you like you can't like give it you know Scooby snacks or whatever for like saying things that you know for you you can't hope to get it to bet on its true beliefs if you're going to you know allow it access to the world based on whether you think it's true beliefs or good or MH stuff like that um so yeah it seems I don't know it seems tricky um I guess I have a few other or I don't know I have some other minor questions about the paper um firstly um so he mentioned this post by Scott Emmons um and one things he one of the things he says is that like a principal component analysis um you know this method where you find the maximum variance Direction and just you know uh base your you you base your guess on the model beliefs based on like you know what side of this um you know where where the thing lies in this maximum variance Direction he says that this is actually like kind of similar to ccs and that you're encoding something involving confidence and also something involving um coherence um and that that might explain why PCA and ccs are so similar I'm wondering like what do you think about that take um so can so it's the summary of the take that most of the work in CCs is being done by the contrast pair uh construction rather than by the consistency loss partly that and also partly like if you decompose like what's the variance of xus y m you get like variance uh sorry you get expectation of X2 plus expectation of y^2 MH uh minus twice the coari of X Y mhm and then some normalizations terms of like variance of X squ um sorry expectation of X all squ expectation of Y all squar and then another PR variance term and basically he's saying like look you can think of if if you think of like a vector that maximizes the outer product of that Vector the variance and itself you're maximizing like the you know outer product of the variant of that Vector with expectation x^2 plus expectation y^2 which ends up being like the confidence of classification according to that vector and then you're subtracting off the coari which is basically saying like is the vector giving both high high probability for both yes and no or is the vector giving low probability for both yes and no mhm um and so basically the the take is just because of the mathematical properties of variance and like what PCA is doing you end up um doing something kind of similar to PCA um yeah I'm wondering if you have thoughts on this take yeah that's interesting I I don't remember reading about this it sounds pretty plausible to me um I guess one way i' think about it intuitively is that um if you're if you're trying to find the classifier on these uh like difference vectors like contrast PA difference vectors then you want to be you know for example you want to be like maximizing the margin between these two and this is a bit like uh trying to find a high contrast thing yeah so overall it feels plle to me all right cool yeah okay so um if it's all right with you I'd like to move on to the paper um explaining through circuit efficiency perfect let's do it sure so this is a paper um you wrote with Rohan sha Zachary Kenton yanos Kar and RAM Kumar MH so you're explaining grocking um for people who uh are unaware need a refresher what is groing yeah so in 2021 uh althia Power and other people had open AI um noticed this phenomenon uh where when you train a neural network a small neural network on an algorithmic task um initially it initially their Network overfit so it got very low test loss uh sorry very very low training loss and high test loss and then they continue training it for about 10 times longer um and found that it suddenly generalized so although training loss stay low um and about the same uh Tesla suddenly fell and they dubbed this phenomenon grocking uh which I think uh comes from uh science fiction um and means suddenly understanding Okay cool so and basically you want to explain grocking MH um I guess a background question I have is it feels like it feels like in the field of AI alignment or you know people worried about AI you know taking over the world there's a sense that it's like pretty important to figure out Brooking and why it's happening and it's not so obvious to me why it should be considered so important given that like you know this is a thing that happens like in some toy settings but we don't like to my knowledge it's not a thing that we've observed on um you know training runs that people actually care about right so yeah I'm wondering like in part I I guess it's a two-part question firstly just why do you care about it uh which could be for any number of reasons and secondly like what do you think its relationship is to AI safety and AI alignment yeah so I think back in 2021 there were two reasons you could have car about this as an alignment researcher one is it on the surface it looks a lot like a network was behaving normally and then suddenly it understood something and started generalizing very differently uh the other reason is this is a really confusing phenomenon uh in deep learning and it sure would be good if we understood deep learning better and so we should investigate confusing phenomena like gring um and kind of ignoring The Superficial similar uh similarity to a scenario that you might be worried about okay so where the superficial scenario is something like the neural network plays nice and then suddenly you know uh realizes that it should take over the world or something that's right yeah and um I think I can talk a bit more about the second reason or the overall kind of Science of deep learning agenda if you like um yeah is that a useful thing to go into now um I I guess maybe uh like why are you interested in groing yeah so for me Gro was one of those really confusing phenomenon in deep learning like uh deep double descent or overparameterized networks generalizing well that um kind of held out some hope as like if you understand this phenomenon maybe you'll understand something pretty deep about how we expect real neural networks to generalize okay and what kinds of programs we expect deep planning to to kind of find so it was kind of a a puzzling phenomenon that somebody should invest and we had some ideas for how to investigate it but sure I'm wondering like if you think just in general a alignment people should spend more effort or resources on science of deep learning issues because like there you know there's a whole bunch of them right and like not all of them have as much you know presence from the AI alignment Community yeah yeah I think it's an interesting question so I want to decompose this into how dual use is investigating science of deep learning and do we expect to make progress and find alignment relevant things by doing it um and I'm mostly going to ignore the first question right now um but we can come back to it later if you're interested um I think for the second question it feels pretty plausible to me that investigating science of deep learning is important and tractable and neglected um and I I should say that a lot of my opinions here have really come from talking to Rohan sha about this who is really the person who's I think been trying to push for this um yeah why do I think that so I think it's it's important because uh similar to mechanistic interpretability um the core hope for science of deep learning would be that um you're able to find some information about how what kinds of programs your training process is going to learn and so therefore how it will generalize in a new situation um and I think a difference from mechur is that mechur is more focused on reverse engineering this is maybe a simplified distinction but uh one way you could draw the line is Mech interpet is more more focused on reverse engineering uh a particular Network and being able to point at like individual circuits and say like here's how the network is doing this thing whereas I think science of deep learning is like um is trying to say okay like what what kinds of What kinds of things can we learn in general about a training process like this with a data set like this uh what what is a like what are the inductive biases um you know how how does the distribution of programs look like um and probably these two things have quite a lot of like both science of deep learning and Mech have quite a lot of overlap and like techniques from from each will help the other um but that's that's a little bit about the importance bit I think it's um it's tractable in the sense it's both tractable and neglected in the sense that we just have all of these confusing phenomena um and for the most part I feel like industry incentives are not that uh not that aligned with trying to investigate these phenomena really carefully and doing a very careful uh kind of Natural Sciences exploration of these phenomena so like in particular iterating between trying to come up with models or theories for what's happening and then making empirical predictions with those theories and then trying to test that which is the kind of thing we tried to do in this paper okay what why do you think industry incentives aren't aligned I think it's it's quite a high risk High reward sort of endeavor um and in the period where you're not making progress on you know making loss go down in a large model um it's it's it's maybe harder to justify putting in a lot of effort into that um on the other hand if your motivation is if we understood this thing um it could be a really big deal for safety um I think making the case as an as an individual is easier um I I do think that the I don't know even from a capabilities perspective I think the the incentives to me seem like stronger than what people seem to be acting on I guess there's something puzzling about why there would be this asymmetry between like some sort of corporate perspective and some sort of safety perspective so like I take you to be saying that like look you know there are some insights to be found here but like you you know you won't necessarily get them tomorrow it'll take a while it'll be like a little bit noisy and like if you're just looking for steady incremental progress you won't you won't do it but it's not obvious to me that like safety or align people should care more about steady incremental progress than people who are just you know want to maximize the profit of their AI right uh you mean care care less about that uh like safety people should oh yeah yeah like it's not obvious to me that there would be any difference right um so I think one one way you could think about it from a safety perspective is is like um M multiple uncorrelated bets on ways in which we could get a safer outcome I think probably a similar thing applies for for capabilities except that um and and I'm I'm really guessing and out of my depth here but my guess would be that um for whatever reason it's like harder to actually fund this kind of research uh this kind of like very exploratory uh out there research um from a capabilities perspective um but I I think there is like a pretty good safety case to make for it yeah I guess it's possible that it's just a thing where it's hard to like like like I don't know if I'm a big company right I want to have some way of turning like my dollars into people solving a problem mhm and what one model you could have is for things where that can be measured in like how far down did the loss go it's maybe just easier to hire people and be like your job is to like put more gpus on the GPU rack yeah or or you know your job is to like make the bottle bigger and make sure it still trains well and maybe it's like harder to just you know hire a random person off the street and like uh get them to do science of deep learning mhm that's potentially one asymmetry I could think of um yeah I think it's also just um I I genuinely feel like there are way your people who could do science of deep planning really well than people who could do make the loss go down really well um and this I don't think this fundamentally needs to be true but it just feels true to me today based on the number of people who are actually doing that kind of uh scientific exploration gotcha so so when when I asked you about like kind of the alignment case for science of deep learning you said there so there's this question of like dual use and then there was this question of like what alignment things there might be there and you said you would ignore the Dual use thing uh I want to I want to come back to that what do you think about um you know some people say about interpretability or stuff well you're going to find insights that are useful for alignment but you're also going to find insights that are useful for just making models super powerful and super smart and like it's not clear if this is good on net yeah so I I want to say that my I feel a lot of uncertainty here in general and I think your answers to these questions kind of depend a lot on how you expect AI progress to go and where you expect the overhangs to be and what sort of counterfactual impact you expect uh like what kinds of things will capabilities people do anyway for example um yeah so I think like to to quickly summarize like one one story that I find plausible um it's that um we're basically going to try and um make progress like almost about as fast as we can um towards uh like AGI level models um and hopefully if we have enough monitoring and like red lines and rsps in place then it will become if there is indeed danger as I expect then we will be able to coordinate some sort of slow down or even pause um as we get to things that are about human level um and then a story you could have for optimism is that well we're able to use we're able to use these like roughly human level systems to really make a lot of progress on alignment because it becomes clear that that's the that's the main way in which anybody can use these systems safely um or like that's how you construct a kind of strong positive argument for why the system is safe rather than just pointing at an absence of evidence that it's unsafe M um and yeah and we're in that sort of world and then just a bunch of uncertainty about how long that takes um in the meantime presumably we're um able to coordinate and prevent like random other people who are not part of this agreement to from actually racing ahead and building an unsafe AGI so so under that story I think it it's not clear that you get um a a ton of like counterfactual capabilities progress from doing mechur or science of deep learning um it mostly feels to me like we'll get there uh even without it um and that to the degree that these things are going to matter for capabilities uh like of I don't know a few years from now capabilities people are going to start uh well maybe not science of deep learning if it's very like long-term and um and uncertain but definitely Mech turp I expect for capabilities people to to start uh using those techniques and uh trying to trying to adapt them for improving free training and so on and so yeah I think like I said I feel pretty uncertain I would be I'm pretty sympathetic to the argument that um all of this kind of research like mechur and science of deep learning should basically be done in secret um um by you know if if you're concerned about safety and you want to do this research then you should do it in secret and not publish um yeah I feel sympathetic to that buta um so I guess I guess with that background I'd like to talk about the paper um so I take the story of your paper to basically be saying look here's our explanation of grocking um you know networks like they're you can kind of think of them as a weighted sum of two things they can be doing one thing they they can be doing is just memorizing the data and one thing that they can be doing is um you know learning the proper generalizing solution mhm and the reason you get something like grocking is that it takes a while like the like networks are being regularized um according to the norm of their parameters and the generalizing circuit you know like like the the method that generalizes it can end up being more confident for a given Norm of parameter and So eventually it's favored but it takes a while to learn it so you start off like initially you learn to just memorize answers but then like as you sort of you know as there's this pressure to minimize the parameter Norm that comes from some form of regularization um you become more and more incentivized to like try and figure out the generalizing solution and like thew eventually gets there and once it you know once gradient descent comes in the vicinity of the general iing solution you know it like start starts moving towards that and that's when groing happens um and basically from this perspective you come up with some predictions like you come up with this thing called unring which we can talk about later you can say you say some things about how like confidence should be related to parer Norm in various settings um but I I take this to be your basic story um does does that sound like a good summary or are there things that you want yeah I think that's that's pretty good gotcha so I guess my first the first thing that I'm really interested in is in the paper you you talk about circuits right you say that there's this memorizing circuit and this generalizing circuit um but but you talk about them like like you sort of have a theoretical model of them and you you have this theoretical model of well imagine if these circuits were competing you know what what would that look like mhm but to the best of my understanding like from reading your paper I don't get a clear I don't get a clear picture of what this circuit talk corresponds to in like an actual model mhm so yeah what what do you do you have thoughts about what it does correspond to in an actual model yeah that's a good question so so we we borrowed the circuit terminology from uh the you know the circuit's thread U by cisola and anthropic um in there they kind of Define a circuit as a as a computational subgraph in the network I think this is you know sufficiently general or something that it kind of applies to to our case maybe maybe what you're asking though is more like um physically where is the circuit inside the network well even like so if I if I think of it as a computational subgraph like you know the memorization circuit has to if I if I think of it as a subgraph it's going to take up like a significant chunk of the network right like presumably well like do you think I should think of like there being two separate subgraphs that like aren't interacting very much one of which is a memorization one of which is generalization and just at the end we like up the generalization and down the regularization like that would be weird because neural networks they tend to be like relatively you know there's going to be like cross talk MH um like that's going to inhibit the memorizing circuit from purely doing memorization and the generalizing C from purely doing generalization yeah so like like when I try to picture what's actually going on mhm it seems difficult for me or I could imagine that like the memorizing circuit is just supposed to be one parameter setting for the network and the generalizing circuit is supposed to be another parameter setting and we're linearly interpretating that but like neural networks they're like nonlinear in their parameters right like if you change like you can't just take a weighted sum of two parameter and get a weighted sum of the output so so yeah this is like my difficulty with the sub uh so I want to make a distinction between um like the model or the theory that we're using to make our predictions and how how these circuits are like implemented in practice so in in the model or in a theory these circuits are very much independent yes so they have their like own parameter norms and they're you know they the only way they interact is they add at the logic stage yep um and this is completely unrealistic but we're able to use this very simplified model to make like pretty good predictions um I think the question of how how a circuit's actually circuits in this Theory actually implemented in the network is something that I would love to understand more about um and we don't have a great picture of this yet um but I think it's probably we can probably say some things about it already so one thing we can say is that they're definitely not going to be like disjoint you know sets of parameters in the network um so some evidence for this is things like um uh there's a lot of overlap between uh like a network that is memorizing like in terms of parameters there's a lot of overlap between a network that's memorizing and that later generalizes um as in like a lot of the parameter Norm is basically coming from the same weights um and the overlap is way more than random um and this is probably because when the network is initialized there's some parameters that are just kind of large and some that are small and both circuits learn to use uh learn to use this distribution um and so there ends up being more overlap there okay so so my summary from that is you you're like okay there's they're probably some in some sense computational sub graphs MH and they probably overlap a bit and we don't have a great sense of how they interact yeah um do you know like like like one key point in your model is like uh like like in the simplified model of networks where they're just like independent things that get summed at the end you know eventually you like reduce your weight on the on the memorizing circuit and increase your weight on the generalizing circuit MH um have a sense of if I should think of this as just like literally increasing and decreasing weights or like you know circuits cannibalizing each other somehow or yeah maybe closer to cannibalizing ing somehow um if they are indeed like uh you know if there's a lot of like competition for parameters between the between the two circuits um I think in a sense it is also going to be like increasing or decreasing weights because the parameter Norm is literally going up or down um it's just not going to happen in the in the kind of way we suggest in the model where it's like you have a fixed circuit and it's just being multiplied by a scalar so in in practice uh there's going to be all kinds of things like um uh like for example it's like more efficient under L2 instead of just if you have a circuit instead of scaling up The Circuit by just multiplying all the parameters it's more efficient to duplicate it if you can if you have the capacity in the network uh I also imagine that uh there are multiple families of circuits uh that are generalizing and memorizing and within each family these circuits are competing with each other as well and so so you have a you start off with the memorizing circuit and instead of just scaling it down or up uh you it's actually like morphing into a kind of different memorizing circuit with a different distribution of parameters inside it but the overall effect is is you know close enough to the simplified model that it makes good predictions sure I'm wondering so one thing this Theory kind of reminded me of is singular learning theory mhm uh which is this uh it's this trendy new theory of deep learning people are into where basically it comes from this Insight where if you think about like ban inference in high dimensional parameterized model classes M uh which is sort of like training neural networks except we don't actually use bent inference for training neural networks if the model class um has property called being singular then you end up like uh having phase transitions of like you know sometimes you're near one solution and then as you get more data you can really quickly update to a different kind of solution where basically what happens is like you're trading off um some notion of complexity of different solutions for predictive accuracy um now in in the case of like increasing data um it's it's like kind of different um because uh the simplest kinds of phase transitions you can talk about in that setting are like you get more data whereas you're interested in like phase transitions in number of gradient steps mhm but like they both feature this um this kind of common theme of like some notion of complexity being traded off with accuracy and like if you favor you know minimizing complexity somehow you're going to like end up with a low complexity solution that like meets accuracy so um I mean that's that's kind of a superficial similarity but I'm wondering like what you think think of the comparison there yeah um so I have to admit that I know very little about singular learning theory and I feel unable to really compare what we're what we're talking about with SLT fair enough um I I will say though that this notion of uh lower weight Norm being less complex somehow is is quite an old idea um like in particular Seb farar pointed me to this uh 1993 paper I think by Jeffrey Hinton which is about motivating uh L2 penalty from a minimum description length um angle H um so if like if two people are trying to communicate all of the information that's contained inside a model uh they could have some priors about what the weights of the model are and then they need to communicate uh they need to communicate both um uh like something about the data set uh as well as uh errors that the model is going to make and um they they use in this paper they use these gity assumptions and are able to derive um both uh like mean squ error loss and L2 penalty as a kind of optimal way to describe to communicate between these two people right um and and this seems similar to kind of the classic result that like L2 regularization is sort of like doing beijan inference with a gaussian norm you know just because like if you're your prior is gaussian then like you take the log of that and that ends up being the norm yeah you know MH that's that's L likod for you sure um so I guess I'd like to talk a little bit pick pick up on this thing you said about there being multiple kinds of circuits um because there's a sentence that jumped out to me in your paper um so you said you know you're looking at like um doing a bunch of training runs and looking at like you know trying to back out what you think uh is happening with the generalizing and memorizing circuits and you say that the random seed starting training causes a significant variance in the efficiency of the generalizing and memorizing Solutions yeah and that kind of surprised me um partly so partly because I think that there just can't be that many generalizing Solutions so so we're talking about like fairly simple tasks where like you know add two numbers mod 113 mhm and like like how many ways can there be to do that uh uh I recently learned that there's more than one but like it seems like there shouldn't be a million of them yeah um similarly like how many ways can there be to memorize a thing yeah and then also I would kind of expect that gradient descent I would have thought that gradient descent would find the most efficient generalizing circuit or the most efficient memorizing circuit so yeah I'm wondering if you have thoughts about how I should think about this like family of solutions with seemingly different efficiencies so one thing I'll point out is that um even with the trigonometric algorithm for doing modular Edition uh this is really describing a a family of algorithms because it depends on which frequencies in particular the network ends of using to to do the mod Edition and if people are interested in that algorithm they can check out my episode with Neil Nanda you can probably check out other things but don't leave fact please um so so yeah this algorithm like it's you know you pick some frequencies and then you rotate around the circle with those frequencies to like you know basically do the clock algorithm for modular arithmetic but you can pick which frequencies you use yeah that's but I would have guessed that like I would have guessed that there wouldn't be a significant complexity difference between the frequencies I guess there's a complexity difference in how many frequencies you use yes so that that's one of the differences um like how many you use and their their relative strength and so on yeah I'm not really sure I think this is a this is a question we we pick out as a thing we would like to see future work on um the the other thing I want to draw attention to is um in in many deep learning problems it is not the case like deep learning practitioners are very familiar with the observation that different random seeds end up producing networks with different uh like different test performance um and it's it's just quite common you know if you're trying to like create a uh state-of-the-art uh Network for solving some tasks it's it's quite common to run like 100 seeds and then pick like the top five best performing ones or whatever um I think it's just not clear from this that um gradient descent or Adam is able to find um know the optimal solution uh from any any initialization yeah and I think this this may be um also pointing at the this also show shows up um not just when you VAR random seed but it shows up Epoch wise because um for example with one of the phenomena you mentioned from our paper semi grocking um you see these multiple phase transitions um where the network is like switching between uh generalizing circuits that have different levels of efficiency um and these levels of efficiency are like quite far apart so that you can actually visibly see the change in test performance as it switches in a very discrete manner between these circuits mhm and you know if it was really the case that gradian decent could could find the Optimal Solutions then you wouldn't you wouldn't expect to see this kind of um switching yeah there are just so many questions I have about these these circuits I'm I'm not sure you have any answers but like I don't know it just brings up so part part of your story is that like it takes a while to learn the generalizing solution like longer than it takes to learn the memorizing solution mhm um yeah do you maybe have thoughts about what's why why that might be I think my thoughts here are are mostly what um what we've written down in the paper and I feel like this is another area that's like pretty ripe for understanding um so the the explanation we often in the paper comes from is is mostly inspired by a blog post by uh By Buck and I think another person whose name I'm forget what yes that's right and the explanation there is uh is that maybe memorizing circuits are basic basically have like fewer components that you need um like few fewer independent components that you need uh in order for the circuit to develop but a generalizing circuit is going to need multiple components that all uh that are all needed for good performance um yeah so here here's a simplified model so the memorizing network is implemented with like one parameter and that just scales up or down and the generalizing network is implemented with um two parameters that are multiplied together to get the correct logic and so the the gradient of you know the output with respect to any of the parameters depends on the value of the other parameter in the generalizing circuit case um and so uh if you if you kind of simulate this forward what you get for the generalizing circuit is a kind of sigmoid where initially both parameters are quite small and they're uh like not contributing that much to each other's growth and then once they start growing both of them grow quite a lot and then it plate out because of L2 sure um so should I think of this as just like kind of a general observation that in evolution if you have like if you need like multiple structures and they both depend on each other for that to work that's like a lot harder to evolve than like a structure that is a little bit useful on its own and another structure that is a little bit useful on its own M and like for a memorizing solution you know I can memorize one thing and then I can memorize the second thing and I can kind of do those independently of each other so like each little bit of memorization is like evolvable on its own maybe yes I think that's right maybe another way to think about it is that the memor the memorization circuit is like basically already there uh like in a kind of Network and like really the thing you're learning is the is the values that you have to memorize um and yeah as you say that's like independent for each point um but that's not that's not the case for generalizing circuit I think another important ingredient here is that there there needs to be at least some gradient at the beginning towards the generalizing circuit if it has multiple components and it's kind of an open question in my mind why this happens yeah the the most plausible Theory I've heard is is something like lottery tickets where basically the randomly initialized network has like very weak versions of the circuits that you want to end up with and so there's like a tiny but nonzero gradient towards them interesting yeah I guess I guess more work is needed um I'd like to talk a little bit about um so so the story of like you know it takes you a while to learn the um generalizing circuit uh like you learn you learn the memorizing circuit and you know it takes you a while to learn the generalizing circuit but once you do then like that's gring so this is sort of reminiscent of a story that I think was in appendix of um a paper progress measures for gring it was progress measures for something uh by Neila as well um and they they have this story where like there are three phases of circuit formation there's like memorization and then there's um and then there's like learning a generalizing solution mhm and then there's like cleaning up the memorized stuff right yeah and in their story um they they basically demarcate these phases by looking at the activations of their model and figuring out when the activations are representing the algorithm that like is the generalizing solution according to them mhm and so it it seems pretty similar to your story but one thing that struck me as being in a bit of tension is that there they basically say rocking doesn't happen when you learn the generalizing solution it happens when you clean up the parameters from the memorizing solution mhm so like there's somehow there's one stage of learning the generalizing solution and then a different phase of forgetting the memorizing solution MH um so yeah I'm wondering what you think about the relationship between that story their paper and your results mhm so I think one one thing that's interesting to think about here is uh the relationship between between logits and loss or between logits and accuracy like why is why is the memorization cleanup important it's because it like to to kind of a first approximation the the loss is dependent on the difference between the the highest logit and the second highest logit yep and um if you have this memorization circuit that is kind of incorrectly predicting uh that is predicting like putting High weight on the incorrect logit then when it reduces you will see like a very sharp cleanup effect I think this is something that we uh we haven't really explored that much because the the circuit efficiency model is mostly talking about um like this kind of steady state that you expect to end up in and is not so much talking about the Dynamics between between these circuits as as time goes on um so this is like a part of the story of grocking that is very much left um like left unexplained uh in our paper which is that like why exactly is the generalizing circuit developing slower um but if if you you know if you put in that sigmoide assumption as I was talking about um like artificially um then the the rest of the theory is entirely sufficient to see like exactly the same kinds of grocking Curves as you see in in actual grocking but like wouldn't um maybe I'm misunderstanding but under your story I think I would I would have expected the grocking to happen during the or I I would have expected like a visible you know improve start of grocking or you know visible increase in test accuracy during the formation of the generalizing circuits rather than it waiting until the cleanup phase uh right so so by formation uh are you talking about so there there's like this phase where the generalizing circuit is like developing and it's it's there but the Logics that it's outputting are way lower than the memorization logits um and in that phase I basically don't expect to see any change in test accuracy um and then there's the phase where the generalizing generalizing circuit logits kind of cross the the memorizing circuit for the first time and this is like maybe another place where there's a difference between um uh like the toy model and what you actually observe so in the toy model the thing you expect the thing you would expect to see is like uh a jump from 0% accuracy as it's like just just below the equality threshold to 100% accuracy the moment the generalizing logic crosses the memorizing logic right um because like that changes like what the highest rank logit is um but in in practice we see this uh the accuracy like kind of going through all these intermediate phases so it goes through like 50% before getting to 100% H um and the reason that's happening is because there are some data points which are being correctly classified and some which are incorrectly classified and so this is pointing to a to a place where the theory breaks down where uh like for on some of the points the generalizing circuit is making more confident predictions than on other points um which is why you get these like intermediate stages of accuracy yeah so that's one thing uh so so this also suggests like why the cleanup is important to get to uh kind of full accuracy because if the memorizing circuit is kind of making these high predictions on many points uh even when the generalizing circuit is around around the same level um you you need the memorizing circuit to you know because because of the variance you just need the memorizing circuit to like disappear before uh you really get like 100% prediction accuracy right so so the story is something like um like you start learning the the generalizing circuit and you start getting the logits being somewhat influenced by the generalizing circuit but like you need like the Logics to start being like somewhat near each other to get a Hope of you know making a dent in the loss and like for that to happen you more need like the memorization circuits to go away like like there's the formation of the circuit and then there's like switching the weights over rough roughly the story that I'm thinking of yeah that's right got okay so yeah there's something that you mentioned um called like semi gring and uning actually can you describe what they are sure I'll start with um how how should you think about the efficiencies of these two different circuits H so if you have a generalizing circuit that is doing modular addition then if you add more points to the training set it doesn't change the algorithm you need to get good uh training loss uh and so you shouldn't really expect any modification to the Circuit as you add more training points and therefore uh the efficiency of the of the circuit should stay the same whereas if you're if you have a memorizing circuit then as you add more points uh it needs more weights to memorize those points and so you should expect the efficiency to be dropping as you add more points yeah or another way I would think about this is just that like if I'm memorizing end points right M I figured out the most efficient way to memorize the end points M then plus one point I'm probably not going to get that right because I just memorized the first ones M so I've got to change to memorize the N plus one point yeah and like I can't change in a way that makes me more efficient because I was like already the most efficient I could be yes on the N points um and so like even just just at sort of a macro level just forgetting about like adding weights or whatever like it just has to be the case that like memorization is losing efficiency the more you memorize whereas like generalization wouldn't expect it to have to lose efficiency yeah yeah that's that's a good way to expl and of course I I stole that from your paper I don't want to act like I invented that no but it's a it's a great it's a good explanation um yeah so so I think a a kind of direct consequence of this so so when when you couple that with the fact that uh at very low data set sizes it appears that memorization is more efficient than generalization then you can conclude that um there must be a data set size where memorization is like increasing as you increase the data set size uh generalization parameter Norm is staying the same there must be a crossover Point um and then you can ask the question what happens at that crossover point when you have a data set size where the efficiency of generalization is equal to the efficiency of memorization and so we you know we we did some maths on our in our toy model um or our theoretic model I should say and um came up with these like two cases these two different things that could happen there um and this really depends on the relationship between uh how like when you scale the parameters by some amount how how does that scale the logits um and uh if it if it scales the logits um by uh like more than some threshold then it turns out that uh at this equality point you will just um you will just end up with a more efficient circuit period um but if you but if the if the scaling factor is like lower than some threshold then you will actually end up with a mixture of both the memorizing and the generalizing circuits and the reason for this is that um uh because you're not able to scale the logic as much it's more efficient to allocate the parameter Norm um between these two different circuits when you when you are considering the joint loss of L2 plus the the data loss okay so so something like um it's just s of the difference between like convex and concave optimization right like it's uh like like you're getting diminishing returns per circuit and so you want to invest in multiple circuits yeah rather than going all in on one circuit whereas like in some cases if you have increasing returns then you just want to go all in on the best circuit yeah that's right okay and in particular this this threshold is like uh there's quite a cool way to derive it which is that like you know that the the L2 is scaling as the square of the parameter Norm so if if the if the logits are scaling uh faster than that then you're able to overcome the parameter penalty by just investing in the more efficient circuit but if they're not scaling faster than that then you have to split um and yeah so so the threshold ends up being uh if you're able to scale the LOD it's faster than a factor like than than to the power of two okay so you have semi gring and uning right where you're training on this like subset of your training data set and you like lose some test accuracy you know either some of it or all of it m um basically by partly or fully reverting to the memorizing solution um uh so this is kind of an interesting phenomenon because like to maybe you know better than me but I'm not aware of people like talking about this phenomenon or connecting it to groing before um or they've talked about the general phenomenon of catastrophic forgetting where you like train your network on a different data set and it forget stuff that it used to know yeah but in terms of training on a subset of the data set I'm not aware of people discussing that before predicting that before um is that right yeah I I think I think that's right so we got a lot of questions about you know from reviewers about um how is unracking any different from catastrophic forgetting yeah uh to the extent that we in a in like the newer version of our paper we have a whole section explaining what the difference is um I think basically I would view it as a as a much more specific and precise prediction than catastrophic forgetting where uh yeah so one difference is that we're training on a subset of the data and this is quite important because this rules out a bunch of other hypothesis that you might have about why grocking is happening um so for example if your hypothesis is that uh the reason grocking happens is because you don't have the correct representations for the for the uh model addition task and like once you find those representations then you've gred um that's a little bit incompatible with uh then reducing the training data size and un rocking yeah because you already had the representations and so you need this additional uh this additional factor of you know a change in efficiency or another example is uh like a random walk hypothesis where you've like stumbled upon somehow you stumbled upon like the correct Circuit by randomly walking through parameter space and that also like either doesn't say anything about or anti predicts unring because you you were already at that point um so I think that's quite a important difference um yeah I think going going back to the difference between catastrophic forgetting um I think another another more precise prediction is that um we're we're able to predict like the exact data sets size at which you see un rocking and it's quite a uh like phase changy phenomena or something it's not like um as you increase the data as you decrease the data set size uh you're kind of smoothly losing uh test test accuracy yeah in this case um which is more the kind of thing you might expect from like traditional catastrophic forgetting right am I right my impression was that the thing were predicting would be that there would be some sort of phase change MH um in in terms of like subset data set size MH and also that that phase change would occur at a point independent of the uh the strength of weight Decay that's right but I thought that you weren't able to predict like where the phase change would occur um or am I wrong about that uh that's right so so we don't our theory is not producing a quantitative prediction of like exactly what data set fraction you should expect that face change to happen at that's right yeah yeah but but it does predict that um it would be a face change and it would happen at the same point for various levels of white Decay um and yeah if people I don't know I like one cool thing about this paper is like it really is a nice prediction and you've got a bunch of nice graphs like kind of nail it so good good job on that thank you um so but one thing I'm I'm wondering about is so you have this phenomenon on gring and it seems at least like an instance of catastrophic forgetting um that you're able to say more about than people have previously been able to say but this offers uh an opportunity to sort of um try and retrodict phenomena uh or in particular like so I'm not an expert in catastrophic forgetting but my understanding is that one of the popular approaches to it is this thing called elastic weight consolidation where you sort of basically have different learning rates per parameters and you uh reduce the learning rate so you reduce the CH feature change in parameters um for those parameters that were important for the old task so I don't know that's one method you might be aware of others does your view of um grocking and unring retrodict these uh proposed ways of dealing with catastrophic forgetting I think not not directly um so I can see a few differences I'm not aware of this exact uh exact paper that you're talking about but um I think depending on the task um there might be different reasons why you're getting forgetting um so so you might be forgetting things that you memorized or you might be forgetting like algorithms that are appropriate for that part of the data distribution um that's one aspect of it I think a different aspect is that it's not clear to me why um why you should expect these circuits to be implemented on different weights so if the if the theory is that uh you find the weights that are important for that for that algorithm and then you basically prevent those weights from being updated as fast so you're not forgetting then I think that that is pointing at uh like a disjoint sort of implementation of these circuits in the in the network and that's not something that we're really saying anything directly about toua yeah yeah I guess it makes sense that it would depend on like the um implement of these circuits mhm um another question I have is that so in a lot of your experiments uh like you mentioned you are more interested in the steady state than the training path uh except for just the initial prediction of grocking I guess um sorry to be clear the the paper deals with the steady state I'm very interested in the in the training path as well fair fair enough fair enough um so if I look at the unring stuff um it seems like so so there there's a steady state prediction um where you know there's this critical data set size and like once they below the critical data set size you UNR and like it doesn't really matter MH um what your weight deay strength was um if I sort of naively think about the model it seems like it seems like your model should suggest that it should take longer for less weight Decay because you have like less pressure to um to like you care about the complexity but you're like about it less per unit time uh and and similarly that gring should um be quicker for more weight Decay mhm um so I guess it's a two-part question firstly do you agree that that's a prediction of this model and secondly did that bear out yeah so I think this is this is a prediction of the model assuming you're saying that way DK affects the speed of grocking yes yes and of on grocking yeah uh I think that is a prediction of the model uh well to be fair it is a retrodiction because the the power atal paper already shows that you know grocking takes kind of exponentially longer as you um reduce the data set size and also as uh I forget what the relationship is but it definitely like takes longer as you uh reduce the weight Decay okay yeah uh and does on gring take longer as you reduce weight Decay uh we don't show this result in the paper but I'm fairly sure I remember that it does yeah okay cool so I guess I'd like to talk a bit about um possible generalizations of your results so as written you know you're basically talking about efficiency in parameter Norm where like if you increase parameter Norm you're able to be more confident in your predictions but that um you know that comes at a penalty if you train with weight Decay yeah um now as your paper notes like weight Decay is not the only situation in which um bring occurs MH and you basically hypothesize that there are other forms of regularization you know regularizing against other forms of complexity and that you know there could be some sort of like you know some sort of efficiency in those other forms of complexity that might become relevant um I'm wondering like do you have thoughts on like what other forms of complexity I should be thinking of yeah so we've we've already talked about one of them which is that um you know networks might be circuits might be competing for parameters on which to to be implemented and so this is a kind of capacity constraint and so you might think that circuits that are able to be implemented on fewer parameters or you know using less capacity however you Define capacity in the network would uh would be more efficient um so I think you know some relevant work here is like Bott like activations uh from the I think this is from mathematical circuits of Transformers uh which is talking about uh like other phenomena like like superposition that you would get from constraint capacity so that's that's one more notion of efficiency um I think uh possibly like robustness to interference could be another kind of efficiency so how robust is the circuit to being kind of randomly invaded by other circuits or um maybe uh maybe also so robustness to drop out would be a similar thing here yeah yeah yeah and then I think there's there just like um there are things like how how frequently does the circuit occur which might be important for um will will give from a given random seed will you be able to find it so so do you mean like um how like like what's the mass what's the probability mass of it on like the initialization prior over weights yes okay and also then on the on the kinds of uh like parameter states that SGD is likely to find yeah uh so this is like the kind of implicit prior in SGD or the um there's some work on implicit regularization of SGD and showing that it's like um it prefers similar kinds of circuits to what L2 might prefer um but probably it's different in some interesting way okay yeah so if I think about like the results in your paper um like a lot of them are sort of generic to other potential ways of other potential complexity measures that you could trade off confidence against U but sometimes you rely on this idea that like like in in particular for your analysis of gring and semi gring right I think there's this you play with this math notion of like well if I scale up some parameters in every layer of a Rel Network that scales out the logits by this factor and therefore you get this parameter Norm coming off and I and I think this is involved in the analysis of semi gring versus un gring right M um so I guess I guess the prediction here would be that like maybe semi gring is like more likely to occur when like for things where you're trading off weight parameterization as compared to like robustness to drop out or stuff does does that sound right to you so I I think in general will be very hard to observe semi grocking in realistic settings uh because you need such a finely tuned balance you need all these ingredients right you need uh like basically two circuits or like two pretty distinct families of circuits with like no intermediate circuits that can do the task well between them uh you need a way to arrange it so that the the data set size or you know other hyperparameters are causing these two circuits to have have like above like very very similar levels of efficiency um and then also you need you needed to be the case that um you're able to under those hyper parameters you're able to actually find these two families of circuits so you you'll probably find the memorizing one but you need to be able to find the generalizing one in time and in this just seems like quite a hard thing to happen all at once especially the fact that um in realistic tasks you'll have multiple families of circuits that are able to do you know able to do the task to some degree like the the training task to some degree okay so semi gr seems unlikely uh I guess it does seem like the prediction would be that you would be able to observe UNG grocking for the kinds of grocking that don't depend on weight Decay am I right that that is a prediction and like this is a thing that you've uh tested for yet or should some enterprising listener um do that experiment uh so so to be clear the experiment here is find a different notion of effic of efficiency and then look for UNG grocking under that notion the experiment would be um find an instance of grocking that isn't that doesn't um happen from weight Decay right so so some croing where you're not using weight Decay then it should be the case that like train your data get gring then train data on subsets of various sizes and there should be a critical subset size where below that you um UNG grock and above that you retain the grocking solution yeah when you when you fine tune on that subset yeah I think this is basically right uh and R Theory does predict this I I will caveat that data set size may not be the right variable to vary here depending on like what notion of efficiency you're using well well I I guess in all Notions of efficiency it sounded like um there was a prediction that efficiency would go down as the data set increased for the memorizing solution but not for the generalizing solution right yeah right as long as you believe that um the process is selecting the most efficient circuit yeah that's right which which we might worry about if there's like like like you mentioned s found different um efficiency generalizing Solutions so may maybe maybe you're you might be worried about like optimization difficulty and in fact Maybe maybe something like parameter Norm is like easier to optimize against than something like Dropout robustness which is like less differentiable or something yeah I think that's right I was I was imagining things more like so I think you're right that in in this kind of regime data set size is pretty reasonable I was imagining things like um like model wise gr grocking where you're uh like on the x-axis instead of like amount of data you're you're actually varying um uh like the size of the model the number of parameters or the capacity or whatever um but all of these different models are actually trained on the same amount of data uh for the same time and there it's it's it's also less clear how exactly you would arrange to show on grocking there because you're not you can't go to a higher like naively you can't you can't go to a higher size model and then like reduce the size of the model um but maybe there are ways to show that there gotcha so uh if it's all right with you I'd like to move on to just general questions about uh you and your research so the first question I have is so we've talked about um your work on grocking we've talked about your work on um latent knowledge and larged language models um we haven't talked about it but the other paper I know you for is um this one on G Miss generalization um is there a common thing that underlies all of these that explains why you worked on all of them well one common thing between all of them is that uh they are all projects that are accessible as an engineer without much research experience okay um which is one of my main which was one of my main selection criteria for these projects um so I guess yeah I guess my background is uh that I've mostly worked in software engineering and then around 2019 I joined Deep Mind and about a year later I was working on the alignment team and I did not have any research experience at that time and but I was very keen to uh figure out how uh you can apply engineering effort to make alignment projects go better um and so uh I think a big part of like certainly for like you know uh for like the next like two or three years I um I was mainly in learning mode and trying to trying to figure out like how do people think about alignment what sorts of projects are you know the other people in the alignment team interested in um which ones are these look like they're going to be most um most accelerated by just like doing good engineering fast um and so that's that's where a bunch of the early selection came from um I think now I feel like pretty drawn to working on uh maybe like high risk High reward things that will might end up mattering if alignment by default uh or as I see the plan uh like doesn't go as expected M um it feels like the the kind of thing that is like potentially more neglected and maybe like uh if you think that you need a bunch of like serial research time to do that now um before you get like very clear signals that uh I don't know like we haven't uh we haven't done done enough research on like some some particular kind of failure mode then that feels important to do now okay so so should I be thinking like like lines of research where both like um they're approachable from a relatively more engineering heavy background and also sort of laying the foundation for you know work that might come later rather than you know just attempting to quickly solve a problem um yeah that's right uh that's certainly what I what I feel more drawn to and so for example I feel pretty drawn to um the eliciting latent knowledge problem um I think there is both interesting empirical work to do right now uh in terms of figuring out like how easy is it to actually extract like truth like things from Models as we've been discussing um but but also this framing the problem in terms of uh like thinking about methods that will scale to super intelligent systems uh feels like the kind of thing that you just need to do a bunch of work in advance um and like by the time you you're hitting those sorts of problems proba you know it's probably quite a bad situation to be in gotcha um so so what should I think of as like your role in these projects uh I think it varies okay um I would just describe it as a mix of um coming up with uh good research ideas to try um trying to learn from people who uh like have been around in the alignment Community like much longer than me um and also trying to supply engineering expertise um yeah uh so for example currently I'm working on uh sparse autoencoders for mechanistic interpretability and I'm very new to mechanistic interpretability um however all of the people I work with or many of the people I work with are uh have been around in meob for a long time um and it's it's great for me to kind of understand and try to get knowledge directly from the from the source in a way um I I think at the same time with sparse Auto encoders in particular that's a kind of project where um partly what drew me to it was was Chris ola's tweet where he said like maybe maybe now I'm not sure exactly what he said but it was something like Mech turp uh might be in like in a different mode now where if essays work out then it's like mostly an engineering problem it's not so much a scientific problem and that kind of thing feels very exciting to me um if we're actually able to scale up uh to to front models could be I I do find myself thinking that there's still there's still a lot of science left to do on S as far as I can tell um but um yeah I don't disagree with that perhaps I should say for the listener a sparse Auto encoder the idea is that you want understand um what a network is thinking about so you train a function from uh the from an intermediate layer of the neural network to uh very large Vector space like way more Dimensions than the underlying thing M and then and then back to the activation space and you want to train this to be the identity function but you want to train it so that the intermediate um you know the intermediate neurons of this or function that you've learned very rarely fire U and the hope is that you're picking up you know these like underlying axes of variation and hopefully these you know only a few of them are happening at a time and hopefully they correspond to um you know to Concepts that are interpretable and that the network uses and that you know are underlying facts about the network and not just facts about the data set that you happen to train the auto encoder on um and all three of those conditions seem like they need more work to be established I don't know I'm not super up to date on the sa literature so maybe somebody's already done this but uh I don't know that's that's a tangent um yeah me I I definitely agree I think I think there's a ton of scientific work to do with essayes um it just also happens to be the case that um there's it feels like there's a more direct path or something to scaling up SES and getting some sort of Mech working on Frontier models um that at least in in my view was absent with previous mechur techniques MH where it was more uh more intensive I guess yeah more human intensive and a much less direct path to doing the same kind of like in-depth circuit analysis on larger models so yeah I guess i' next like to ask about um just the alignment team at De mind um so I obviously I guess you've been there for a few years um what's it what's it like it is um honestly the best place I ever worked um it's um I find the environment uh like very stimulating it's um there's a lot of freedom to uh like Express express your opinions or like um choose uh or like propose research directions um like critique and uh like try to learn from each other um so I think um yeah I can give you like a um an overview of some of the projects that we're currently working on if that helps yeah that sounds interesting um so I think the team is kind of roughly evenly split between um doing uh dangerous capability evaluations uh doing mechanistic interpretability uh rer assistance um and uh various other emerging projects um like debate or process supervision um so that's like roughly the split right now okay I think apart from that it it feels like a kind of um to me it feels like an inflection point right now because um I don't know like safety is getting a lot of tension within deep mind I think um we uh so Ana Dragon recently joined us um she's a professor at Berkeley um to me it feels like she has a lot of Buy in from leadership uh for like actually pushing safety forward in a way that feels like new and exciting um so as one example of this we're spinning up a alignment team in the Bay Area um uh hopefully we have uh a lot more resources to do like ambitious alignment projects in the future sure will that be like um is is that like recruiting new people from the bay area or will like some people be moving from London to the see that team uh that's going to be mostly recruiting new people okay yeah gotcha so yeah I guess the second to last thing I'd like to ask is um so we talked about this gring work and this um work on um you know checking this ccs propal um do you have thoughts on like followup work on those projects that you'd really like to see um yeah definitely so I think uh with with uh the grocking paper we've been talking about a bunch of potential followup work there um I think in particular yeah exploring like exploring other Notions of efficiency seems uh really interesting to me I think the um the theory itself I think still can produce quite a lot of interesting predictions and from time to time I keep thinking of like new predictions that I would like to try that are not um uh I don't know that like just don't fit into my like current like work plans and stuff um so an example of this is um like an example of a prediction that we haven't written down in the paper that occurred to me like a few days AG ago is that like our theory is predicting that even at large data set sizes where you where you're seeing grocking um if you're if if the like exponent with which you convert parameter Norms into efficiency is small enough then you should expect to see uh like a non-zero amount of memorization uh even even at large data set sizes and so the prediction there is that there should be a train test Gap um that is like small but not zero okay um and this this is in fact true um and so the thing you should be able to do with this is um uh use the um use the like empirical estimates of uh like memorization and generalization efficiency to predict train test Gap um at like any data set size mhm yeah so that's that's one example I think I think this theory is like pretty pretty fruitful and like doing work like this is is pretty fruitful I would love to see more of that yeah on on the CCS side um a thing I would love to see is uh test beds for Elk methods uh so what I mean by that is examples of networks that um uh that are like doing something deceptive or that otherwise have some latent knowledge that that you know is in there um but is like not represented in the outputs and then you're really trying your hardest to get those get that lat knowledge out using like all sorts of methods like linear probing or uh you know blackbox testing um uh like maybe anomaly detection um but I think without without really good test beds it's hard to know it's like easy to fool yourself about the efficacy of your proposed Al method H um and I think this is maybe quite related to the model organisms agenda as well yeah well I think that wraps up about what we wanted to talk about um thanks very much for being on the show yeah thanks for having me this episode is edited by Jack Garrett and Amber on Ace H with transcription the opening and closing themes are also by Jack Garett filming occurred at far laps naal support for the episode was provided by the long-term future fund and Lightspeed grants along with patrons such as Alexi maaf and T barstad create a transcript of this episode or to learn how to support the podcast yourself you can visit asp.net finally if you have any feedback about this podcast you can email me at feedback axr p.net [Laughter] [Music]
Related conversations
AXRP
3 Jan 2026
David Rein on METR Time Horizons

This conversation examines core safety through David Rein on METR Time Horizons, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -0 · 108 segs
AXRP
7 Aug 2025
Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -5 · 133 segs
AXRP
6 Jul 2025
Samuel Albanie on DeepMind's AGI Safety Approach

This conversation examines core safety through Samuel Albanie on DeepMind's AGI Safety Approach, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -4 · 72 segs
AXRP
1 Dec 2024
Evan Hubinger on Model Organisms of Misalignment

This conversation examines technical alignment through Evan Hubinger on Model Organisms of Misalignment, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med -6 · avg -7 · 120 segs
Counterbalance on this topic
Ranked with the mirror rule in the methodology: picks sit closer to the opposite side of your score on the same axis (lens alignment preferred). Each card plots you and the pick together.
Mirror pick 1
AXRP
3 Jan 2026
David Rein on METR Time Horizons

This conversation examines core safety through David Rein on METR Time Horizons, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Spectrum vs this page
This page -10.64This pick -10.64Δ 0
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -0 · 108 segs
Mirror pick 2
AXRP
7 Aug 2025
Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Spectrum vs this page
This page -10.64This pick -10.64Δ 0
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -5 · 133 segs
Mirror pick 3
AXRP
6 Jul 2025
Samuel Albanie on DeepMind's AGI Safety Approach

Spectrum vs this page
This page -10.64This pick -10.64Δ 0
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -4 · 72 segs