Library / In focus
AXRPCivilisational risk and strategy
Risks from Learned Optimization with Evan Hubinger

Why this matters
Auto-discovered candidate. Editorial positioning to be finalized.
Summary
Auto-discovered from AXRP. Editorial summary pending review.
Perspective map
MixedGovernanceMedium confidenceTranscript-informed
The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.
An explanation of the Perspective Map framework can be found here.
Episode arc by segment
Early → late · height = spectrum position · colour = band
Risk-forwardMixedOpportunity-forward
Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).
Showing 140 of 148 segments for display; stats use the full pass.
StartEnd
Across 148 full-transcript segments: median 0 · mean -3 · spread -23–0 (p10–p90 -10–0) · 2% risk-forward, 98% mixed, 0% opportunity-forward slices.
Slice bands
148 slices · p10–p90 -10–0
Mixed leaning, primarily in the Governance lens. Evidence mode: interview. Confidence: medium.
- Emphasizes safety
- Emphasizes ubi
- Full transcript scored in 148 sequential slices (median slice 0).
Editor note
Auto-ingested from daily feed check. Review for editorial curation.
ai-safetyaxrp
Play on sAIfe Hands
Episode transcript
YouTube captions (auto or uploaded) · video b1nCGAzCrho · stored Apr 2, 2026 · 4,988 caption segments
Captions are an imperfect primary: they can mis-hear names and technical terms. Use them alongside the audio and publisher materials when verifying claims.
No editorial assessment file yet. Add content/resources/transcript-assessments/risks-from-learned-optimization-with-evan-hubinger.json when you have a listen-based summary.
Show full transcript
today i'll be talking to evan hubinger about risks from learned optimization evan is a research fellow at the machine intelligence research institute was previously an intern at openai working on theoretical ai safety research with paul christiana he's done software engineering work at google yelp and ripple and also designed the functional programming language coconut we're going to be talking about the paper risks from learned optimization in advanced machine learning systems the authors are evan heathinger chris von miravai vladimir myklique yours khalsa and scott garabrant so evan uh welcome to excerpt thank you for having me daniel yeah so um the first thing i wanted to say about this paper is it has a nice uh a glossary at the end with like what various uh terms mean and i think this is like sort of unusual in machine learning papers um so if like just straight off the bat i wanted to firstly thank you for that and let potential readers of the paper know that uh it's maybe easier to read than they might think yeah thank you yeah we worked hard on that i think one of the things that we sort of set out to do in the paper was define a lot of things and like create a lot of terminology and so in creating a lot of terminology it's useful to have a reference for that terminology and so we figured that would be that would be helpful for people trying to go back and look at all of the terms that we're using you know it turns bold over time you know it's not necessarily the case that the same terms that we use in our paper are going to be used in the future or people might use them differently but we want to have like a clear reference for we're going to use a lot of words here are the words like here's what they mean here's how we're defining them here's how we're using them so at least if you're looking at the paper you can like read it in context and understand what it's saying and what the terms mean and everything yeah that makes sense um so uh just a heads up for this uh for the listeners um in this episode i'm mostly going to be asking you questions about the paper rather than uh walking through the paper um that being said um just to start off with um i was wondering if you could briefly summarize like what's the paper about and what's it trying to do yeah for sure so essentially the paper is looking at trying to understand what happens when you train a machine learning system and you get some model and that model is itself doing optimization it is itself trying to do some search it's trying to you know accomplish some objective and it's really important to distinguish between and this is sort of one of the main distinctions that the paper sort of tries to draw to sort of draw is the distinction between the loss function that you train your model on and the actual objective that your model learns to sort of uh try to follow internally and so we make a bunch of distinctions we try to sort of define these terms where specifically we talk about these sorts of two alignment problems the outer alignment problem uh which is the problem of aligning the sort of human goals the thing you're trying to get it to do with the loss function and the interlining problem which is how do you align the actual objective that your model might end up pursuing with the loss function uh and we call this sort of actual objective the mason objective uh and we call a sort of model which is itself doing optimization a mesa optimizer what's mesa by the way yes that's a very good question so mesa is greek it's uh it's uh prefix and it's sort of the opposite of meta so meta is greek for sort of above mesa is sort of greek for below the idea is that uh if you could think about sort of meta optimization in the context of machine learning meta optimization is when i have some sort of optimizer a you know gradient process uh sort of training process or whatever and i create another optimizer above that optimizer a meta optimizer to sort of optimize over it um and we want to sort of look at the opposite of this we want to look at the situation in which i have a training process i have a sort of loss function i have this sort of gradient descent process what happens when that gradient descent process itself sort of becomes a meta optimizer what happens when the thing which is optimizing or the model the actual sort of trained neural network itself develops an optimization procedure and so we want to call that mesa optimization for it's sort of the opposite of meta optimization instead of jumping and having optimization one level above what we're sort of expecting we have optimization one level below cool um so one question i have about this paper is um like you're thinking about optimization like where's optimization happening you know what's it doing um what's optimization yeah so that's a great question and i think this is one of the sorts of things that is it's notoriously difficult to find and there's a lot of sort of definitions that have existed out there different people trying to define in different ways you have things like you know daniel dennis uh like intentional stance you have things like alex foote recently had a sort of post online form going into this in risk warning optimization we sort of define optimization in a very sort of mechanistic way where we're like look a system is an optimizer if it is internally doing some sort of search according to some objective searching through a space of possible plans strategies policies actions whatever and the argument for this is well um once you have a model which is doing search it has some large space of things and is trying to look over those things and select them according to some criterion that criterion has become a objective of sorts because now it is selecting the way in which its output is produced has a large amount of sort of optimization power a large amount of sort of selection going into choosing which things uh satisfy this criteria and so we really wanted to be the case that whatever that criterion that's that exists there is is one that you know that we like and furthermore it's the sort of um thing which you know we argue is you should actually expect that like you you will learn models which have this sort of search because search generalizes really well it's a relatively simple procedure and then finally also search having this sort of mechanistic definition lets us sort of make a lot of arguments and sort of get to a point where we can understand things in a way where if we just adopted a sort of behavioral definition like intentional stance for example we wouldn't be able to do because we want to be able to say things like how likely is it that when i run a machine learning process when i do a training process i will end up with a model that is that is sort of an optimizer and it's very difficult to say that if you don't know what an optimizer looks like internally because i can say well how likely is it that i'll get something which is well described as an optimizer and that's a much more difficult question to answer because you you don't know sort of you know the process of grading to set the process of machine learning it selects on the internals it selects on you know you know we have all of these sort of implicit inductive biases which might be pushing you in one direction and then you have this sort of loss function which is trying to ensure that the behavior of the model has you know particular properties and to be able to answer whether or not those inductive biases are going to push you in one question or another you need to know things like uh how simple is the uh you know the thing which you're looking for according to those inductive biases how likely is it that you know the process of brainy descent is going to end up in a situation where you end up with something to satisfy that and i think the sort of point where we ended up was not that there's something wrong necessarily with the sort of behavioral stance of optimization but just that it's very difficult to actually understand from a concrete perspective the extent to which you will get optimizers when you do machine learning if you're only thinking of optimizers behaviorally because you don't really know to what extent those off measures will be simple well which what extent will they be you know favored by grainy descent and those deductive biases and stuff and so by having mechanistic definition we can actually sort of investigate those questions in a much more concrete way so one thing that strikes me there is like um if i were thinking about what's easiest to learn about um about what you know what neural networks will be produced by gradient descent i would think like well you know gradient descent like it's mostly selecting on behavior right it's like okay is this how can i tweak this neural network to produce like behavior that's more like what i want or you know outputs or you know um whatever is appropriate to the relevant problem and famously uh we currently like don't can't say very much about like how the internals of neural networks work hopefully that's changing but uh not that much absolutely so so it strikes me that it might it might actually be easier to say like well behaviorally we know we're gonna get things to behave to optimize an objective function because like you know that's that's the thing that is optimal for the objective function um so i was wondering like like it seems like um in your response you you put a lot of weight on like well you know networks will select simple or optimization we'll select simple networks or simple algorithms and i'm wondering like um why why do why do you have that kind of emphasis instead of the emphasis on like selecting for behavior yeah so that's a really great question so i think the answer to this comes from so so first of all let me just establish i totally agree and i think that it's in fact a problem that the way in which we currently do machine learning is so focused on behavioral incentives one of the things that in my research i actually sort of have spent a lot of time looking into and trying to sort of figure out is how we can properly do mechanistic incentives how could we create ways to train neural networks that are focused on making them more transparent or making them uh you know able to sort of be you know satisfy some sort of internal properties we might want that being said the reason that you can't this i i think it's very difficult to sort of analyze this question behaviorally is because fundamentally we're looking at a problem of unidentifiability if you only look at behavior you can't distinguish between the model which is internally like doing some optimization process it has some objective and like produces the sort of like particular results versus a sort of model which doesn't have any sort of internal objective um it's just sort of produces the same but it still produces the same results on the training process it has a bunch of like heuristics but those two models have very different generalization behavior and it's the generalization behavior that we care about we're looking for situations in which and we sort of talk about this a little bit there's this um there's this post by vlad that had that sort of talks about this sort of 2d robustness picture where the the thing that we're fundamentally concerned about is a situation where you have a model and the model is very robust in the sense of uh sort of competitiveness this sort of um uh sort of it's very uh sort of capability robust it has the ability to uh pursue whatever objective is pursuing very effectively off distribution and on disputing but it's not very objective robust so it learns some sort of proxy objective maybe it works sort of pretty well on the training distribution but doesn't work very well off distribution and the problem is that if you're just looking at the behavior on the train distribution you can't really distinguish between a model which like will be objective robust which like has learned a proxy which like will work well off distribution and the model which has learned to proxy which won't work well off dispute because if you're just looking at the training distribution and you're just looking at its behavior on the train distribution you can't tell because uh you know those might just look identical on the train distribution as a really concrete example a sort of i there's this sort of maze example that i often like to give so you have a situation where i'm training a model on uh the strain distribution has a bunch of sort of mazes and i wanted to get to end of the maze and i put a little green arrow at the end of the maze and i incentivized the model for getting to the end i train a bunch of these sort of relatively small bases and then i deploy the model in a new situation where i have larger bases and i put the green arrow at some random place in the middle and now there's a bunch of things that can happen so it could be the case that in this sort of new deployment environment the model is sort of capability robust and it really knows how to solve mazes but it it can also not be capability robust and it just can't solve the bigger bases but if it is capability robust then there's sort of two options it's either objective robust and it learned the correct proxy which was getting to the end and so it makes it navigate to the end of the maze or it could be objective non-robust it learned an incorrect proxy which was gets the green arrow so it navigates to the green arrow and if you're only looking at the behavior on the train dispution you can't distinguish between the model which is trying to get to the green arrow and the model just trying to get to the end of the maze because the train distribution those are sort of uh not identifiable you can't sort of distinguish between them by looking at behavior and so if we want to ask the question how likely is it that we will end up with models which generalize in one way versus models which are generalized in another way we have to look at the other thing than just the behavioral incentives which is causing models to be fallen to one category of the other and that is the inductive biases the reason that your training process would produce a model which is going to generalize in one way or generalize in another way despite having the same training performance is all going to come down to what is incentivized by the inductive biases of that training process and to understand what's incentivized by the inductive biases of training process we have to look at the models of turtles because the inductive biases aren't just about the the right the the what's happening here at this point the inductive biases right we all of the behavior on the train distribution is the same so the only difference is what are the inductive biases incentivizing in terms of the internals of the model are the inductive biases going to incentivize the sorts of models that have particular properties or models of other properties and fundamentally that's sort of where you have to look if you want to start answering the question of what you're going to get in this sort of standard machine learning setting because that's the sort of thing which is going to determine what option you end up with when you're in this sort of unidentifiable position yeah so it sounds like like one thing that you notice reading this paper is that it seems very um focused on the case of distributional shift right like uh like you're interested in generalization but not in the like generalization bound sense of like you know from the training distribution to the training distributions but samples you haven't literally seen but you know some kind of like different world i'm like why like why can't we just train on the distribution that uh we're going to deploy on why is this a problem yes that's a really good question um so i think there's a couple of things we've done here so so first one thing that's worth noting is that i think a lot of times even so people read the paper and be like well like what what is going on with this sort of deployment distribution sometimes we just do online learning sometimes we just do like you know situations we're just going to keep training we're just going to like keep going why can't we sort of understand that i think basically my argument is that those situations are not so dissimilar as you might think so let's you know for example consider online learning i'm going to try to just keep training it on you know every data point as it comes in and keep sort of pushing in different directions well let's consider the situation where you have you're doing online learning and the model encounters some new data um we still want to ask the question like for this new data point will it generalize properly or will it not and we can answer the question will it generalize properly will it not on this new data point by looking at all of the previous training that we've done was a training environment of sorts and this new data point is a deployment environment of sorts where you know something potentially has changed from the previous data points to this new data point and we want to ask the question will it generalize properly to this new data point now fundamentally that generalization problem is is not changed by the fact that after it it produces some action on this data point we'll go back and we'll do some additional training on that if there's still a question of how he's going to generalize this new data point and and so you can think of online learning setups and like situations where we're trying to just keep training in on all the data as basically all they're doing is they're adding additional sort of feedback mechanisms they're making it so that if the model does do something poorly maybe we have the ability to sort of have the training process intervene and correct that but fundamentally that's always the case even if you're doing a deployment where you're not doing online learning if you notice that the model's doing something bad you're there are still feedback mechanisms you can like try to shut it down you can try to like you know look at what's happening and retrain a new model there's always feedback mechanisms that exist and if we sort of take a higher level step and we sort of think about what is the real problem that we're trying to solve the real problem we're trying to solve is these situations in which machine learn learning models take actions that are so bad that they're unrecoverable that we our sort of natural feedback mechanisms that we've established don't have the ability to sort of uh you know account for and correct for the sort of mistake that has been made and so you know an example of a situation like this um you know paul paul talks a lot about this sort of uh paul christiano talks a lot about this sort of unrecoverability um one example would be paul has this post um what failure looks like and in the what failure looks like post fall sort of goes into the situation where you could imagine for example a sort of cascade of problems where you know one model sort of goes off the rails because it has some proxy and it starts authorizing for the wrong thing and this causes other models to go off the rails and you sort of get a cascade of of sort of of problems that creates unrecoverable situation where you know you you end up with a society where we put models in charge of all these different things and then you know as one goes off the rails it pushes others off distribution because as soon as you get one sort of model doing something problematic that now puts everyone else into a situation where they have to deal with that problematic thing and that could itself be off distribution from the things we just had to deal with previously which increases the probability of those other models failing and you get the sort of cascade this is i'm not saying this is the only i think this is one possible failure mode but if you can see that it's sort of one possible failure mode that is demonstrating the way in which like the real problem is situations in which our feedback mechanisms are insufficient and you will always have situations where your feedback mechanisms could potentially be insufficient if it you know does something truly sort of catastrophic and so what we're trying to prevent is it doing something so catastrophic in the first place and to do that you have to pay attention to the generalization because every time it looks at a new data point you're always relying on generalization behavior to sort of do the correct thing on that new data so so so when you say generalization behavior you mean like you know it's going to encounter a situation and i i guess i'm not sure why like just normal ml isn't enough like like generalization bounds exist you know like why can't we just um you know train on a distribution deploy you you get different samples from your district that distribution so it's a little bit different but like generalization bounds say that like you know it's just not going to be so bad why isn't that like the end of the story yeah so i think that a lot of these generalization bounds rely on a particular assumption which i think is quite questionable which is the iid assumption the assumption that in fact like they're the deployment distribution and the like training distribution then we can actually create like take iip samples that there's like that we can create like i don't know particular bounds between like the like divergence between the distributions or like things like this um in practice all of these sorts of assumptions just uh i think basically just don't hold and there's ways in which like um there's a lot of sort of like obvious ways in which they don't hold you know there's always things that are sort of changing about our environment there's always sorts of you know various different things i mean you can imagine you know just the fact that like the date advances the fact that like you know various things are happening in the world that you might not be able to anticipate um and obviously there are ways to try to address these problems in the sort of standard practice of machine learning so you know a very standard thing to do is adversarial training so you might have a situation where you're like well we expect that the distribution is going to vary in a bunch of different ways we can try to sample some of the ways in which we expect the distribution might vary and train on those things but there can still be a lot of problems that happen in that situation so one of the things that we talked about in the paper that i think is sort of becomes particularly problematic um sort of when you start getting into these situations where you've really done a good job of doing adversarial training and like tried to train away a lot of the sort of naive sort of simple ways in which you might end up with you know a green arrow optimizer versus a you know end of the maze optimizer is something that we call deceptive alignment and the set of alignment is a sort of particularly pernicious case of a sort of mesa optimizer which has a mesa objective that is sort of different from the the ideal loss function and the particular case of deceptive alignment is a situation in which you have a model and that model let's say has some you know proxy objective you know it's trying to get the green arrow or whatever but it realizes that it's inside of a training process and then that training process is going to select it for getting to the end of the maze rather than getting the air the getting to the green arrow and so it realizes that the correct instrumental strategy to like be able to get the green arrow in the future is to currently try to get the end of the maze so that it will like stay around in the training process it won't be modified to like go to the green arrow instead and then eventually at some point maybe it gets the ability to like take some action like maybe you just deploy it and now when it's in deployment it like can go to the green area a bunch of times and you won't like correct for it or it otherwise gets into some situation where the correction mechanisms the feedback mechanisms that you have are no longer sufficient to prevent it from being able to do like the true proxy objective that it learned and this is particularly concerning because if you're in a deceptive alignment situation relax and so the sort of adversarial training set up no longer really works because if you think about um you know a deceptive alliance system it's going to be actively looking for any sort of possible sort of uh differences between the trained distribution and the employment distribution to let it know that now it's in a situation where it can sort of uh do what it wants and it doesn't have to be as afraid of sort of corrective feedback um and so uh there's a there's a lot of examples of the sorts of things that the you know a sort of deceptive model might be looking for paul cruciano has this one example that i think is kind of sort of fun uh and it's not you know necessarily realistic but it just goes to show the sort of way in which this can happen which is this sort of rsa 2048 example so the rsa 2048 example goes like this you can imagine uh so so rsa2048 is this really large pseudo prime there's this uh sort of challenge to try to factor ourselves sorry what's the cedar prime uh it's like a uh product of two large products okay uh i think yeah i i've been using the terminology wrong i'm pretty sure whatever it's the product of two very large primes but uh finding f finding that factorization finding the like two very large primes uh that like make up this this very large number is is very challenging um and computationally difficult and uh it currently it's sort of beyond the computational abilities of like modern computers and like there are like i don't know at least there's a challenge to try to factor it and nobody has yet been able to do so but it's the sort of thing that like probably will have enough compute that somebody will eventually get around to like being able to factor it in the future and so you can imagine a situation where you have a deceptive model and it's like well i don't know when i'm going to be into in i'm going to be deployed but at some point probably i'll be in a deployment situation where like i have lots of power and i can like just sort of do whatever i want and so i'll choose to look for a factorization of rsa 2048 and if i see such a factorization then i can be pretty sure that i'm no longer in training because the people training me don't have the compute available to factor rc2048 but probably at some point somebody's going to factor it and so i can be pretty sure that if i see a factorization of rc2048 that i'm no longer in trading and i could be pretty sure that eventually i will see such a factorization and so it's sort of a general sort of point in this sort of general problem of when you try to do adversarial training you're putting yourself in this sort of situation where you have to be able to um it's sort of this p versus np thing where like you're putting yourself in a situation where you have to be able to actually like generate all the sort of possible counter examples whereas the model is in a much easier position it only has to be able to do the checking it only has to be able to check whether it is in such a situation and there's just a lot of things where it's easier to check than it is to generate the sort of factorization is an example of that it's a lot easier to check that you have a factorization than it is to generate such a factorization and and this is sort of a general thing you know like i was saying it comes from the fact that there are different complexity classes that vary based on you know like you know we can think about np as being the complexity class of things which are like relatively easily checkable and p is being the complexity class of things which are uh relatively easy to actually generate and these are i don't know we think that these are distinct so there's likely lots of problems like this you know even if we're not thinking about things as sort of rigorous mathematically as rc2488 there's just a lot of things about the world that are easier to check than they are to generate and so that means that if you have a model which is actively looking for differences your like procedure of let's just try to generate a bunch of differences is no longer likely to work yeah i think this is so there's a lot in that answer that i can get back to um one thing that it does um remind me of is uh yeah often people do talk about adversarial training and it always strikes me that like um it seems like if you want to really make that work you need much better interpretability tools than we have at the moment precisely to check this kind of thing like what is the model checking like like to really find the worst case scenario you need to know to know a lot about the model right so um i guess i'm wondering do you think that's right or do you have any thoughts about that i think that's exactly right so i have a post that talks about relaxed adversarial training which is terminology that comes from paul cristiano and basically the idea there is well how can we improve this adversarial training setup well instead of doing adversarial training on the model's inputs we can try to do it on the model's internals we can try to sort of generate descriptions of situations in which the model might fail try to generate sort of like possible sort of like um activations for which the model might fail and these problems might be significantly easier than actually trying to generate an input on which the model fails cool so um there's a lot to get back to i want to get back to basics a little bit just about meso optimizers so one of the examples you use in the post is like you can sort of think um at least you know for the sake of example you can think of evolution as like some optimizer and um evolution is produced humans and humans are themselves optimizers um you know for such and such thing and i wanted to get a little bit more detail about that so i'm a human um and a lot of the time i'm like almost on autopilot or i don't like for instance um i'll i'll be sort of acting on some instinct right like a loud noise happens and i kind of jump um sometimes i'll be doing what i've previously decided to do or like maybe what somebody's told me to do so for instance like we just arranged this interview and i know i'm supposed to be asking you questions and like uh you know and because because we already penciled that into the calendar that's what i'm doing you know um and you know often i'm what i'm doing is like very local means reasoning so i'll be like oh um i would like some food i'm hungry so i guess i would like some food now so i guess i'm going to cook some food like that kind of thing which is different from what you might imagine of like um oh i'm going to like plan i'm going to think you know what i'm going to think of like all the possible like sequences of actions i could take and i'm gonna search for like which one you know maximizes like uh how rich i end up in the future or you know some altruistic objective maybe if you wanna think nicely of me or something uh so am i a mess optimizer and if i am what's my message objective yes that's a really good question so i mean so i i'm happy to answer this i will like just warn i think that my personal opinion is that it's it's like it's easy to go astray i think thinking too much about the sort of evolution and humans example and what's happening with humans and like forget that like humans are not machine learning models and like they're like a real differences there but i will try to answer the question so i think that so if you think about it from a sort of computational perspective it's just not very efficient to try to spend a bunch of compute for every single action you're going to choose to try to compare it to like all of these different possible opportunities but that doesn't mean you're not doing search it doesn't mean you're not doing planning right so you can think about you know one way to model what you're doing might be well uh maybe you have a bunch of heuristics but you change those heuristics over time and the way in which you change those heuristics is by doing sort of careful search and planning or something um i don't know exactly what the correct way is to describe the human brain and like what algorithm internally runs and like how it operates but i think it's a good guess to say and pretty reasonable to say that is clearly doing search it's clearly doing planning it is clearly at least sometimes and you know like you were saying it would be computationally you know intractable to try to sort of do some massive search every time you know you think about you know you're gonna move a muscle or whatever but that doesn't mean that you can't like and on on for particular actions for particular strategies develop long-term strategies think about sort of possibilities for like you know consider many different strategies select them according to like whether they satisfy whatever you're trying to achieve and and that obviously also leads to the second question which is what is the mace objective what is it that you might be trying to achieve um this is obviously very difficult to answer i mean people have been trying to answer the question what are like human values for uh forever essentially um i don't i don't think i have the answer and i don't think that um i don't like expect us to get the answer to this sort of thing but i nevertheless think that like there is a meaningful sense in which you can say that humans clearly have goals and and not just in a behavioral sense in like a mechanistic sense i think there's a sense in which you can be like look humans select strategies they like look at different possible things which they could do and they choose those strategies which like perform well but which like have good properties according to the sorts of things they're looking for you know they try to choose strategies which you know put them in a good position they try to do strategies which like make the world a better place they try to do strategies which like you know people people have lots of things that they use to select the different sorts of strategies that they pursue the different sorts of heuristics they use the different sorts of um you know plans and policies that they adopt and so um i think it is meaningful to say and correct to say that humans are mesa optimizers of a form um are they a central example i don't know i think that they're like relatively non-central i think that like um they're not the like example i would want to like have first and foremost of like mesa optimization but they're a meaningful example because they are a like the only example that we have of like a training process of like evolution being able to produce a like highly intelligent capable system so they're useful to think about certainly um though like we should also expect there to be differences um yeah that makes sense yeah that makes sense um yeah i guess i'm just asking to get a sense of like what a mess optimizer really is um and i guess your answer is interesting because it sort of suggests that like uh if i if i like you know do my start of your plan or whatever and like do search through some strategies and then like execute like i'm like even when i'm in execution mode i'm still an optimizer i'm still a mess optimizer which is interesting because it sort of suggests that if like one of my plans is like uh okay i'm gonna play go and then i like you know open up a game and then i start doing planning about like uh what moves are gonna win uh am i a mesmes optimizer when i'm doing that i mean so my guess would be like not really you're still probably using exactly the same optimization machinery to do go as you are to like plan about whether you're going to go to go and so in some sense it's not that you did so like for a mesa mesa optimizer would have to be a situation in which you internally did an additional search over possible like optimization strategies and possible like models to do optimization and then like selected one right if you're using the exact same optimization machinery you're just one optimizer and you're like you you like are optimizing for different things at different points okay that gives me a bit of a better sense of what you mean by the concept so i have a few more questions about this so one thing you do i believe in the introduction of the paper is you distinguish between mesa objectives and behavioral objectives i think the distinction is supposed to be that my mesa objective is the objective that's like uh you know internally represented in my brain um and my behavioral objective is the objective you could recover by asking yourself okay looking at my behavior what am i optimizing for um in what situations would these things be different yeah so that's a great question so i think that in most situations when you have a meaningful mesa objective we would hope that your definitions are such that the behavioral objective would basically be that may subjective but there certainly exists models that don't have a meaningful base objective and so the behavioral objective is therefore in some sense a broader term because you can look at a model which is an internally performing search and you can still potentially ascribe to it some sort of objective that you could that like it looks like it is following what's an example of such a model uh i mean you can imagine something like um uh you know maybe you're looking at like i think i think for example a lot of like current reinforcement learning algorithms potentially my reinforced learning models might potentially fall into this this sort of category if you think about i don't know um it's it's difficult to speculate on like exactly what the internals of things are but like my guess would be if we think about something like openi5 for example that like it certainly looks in your mind uh listeners what open ai five is sure opening i5 was a sort of project by open ai to solve dota 2 where they built uh agents that are able to sort of compete on a like relatively uh sort of professional level of dota 2. and uh i think if you think about you know some of these agents these are the opening i5 agents i think that my guess would be that the opening ifa 5 agents don't have like very coherent sort of internally represented objectives they probably have like a lot of sort of heuristics potentially there's some sort of search going on my guess would be not very much such that like you wouldn't really be able to ascribe like a very coherent base objective but nevertheless we can look at it and be like there's clearly a sort of coherent behavioral objective in terms of trying to you know uh win the dota game and and we can understand why that behavioral objective exists it exists because the if we think about open ai five as being a sort of grab back of heuristics and all these various different things those those heuristics were selected by the the sort of base optimizer which is the gradient descent process to sort of achieve that behavioral goal and one of the things that we're sort of trying to hammer home in in risk warning optimization is that there's a there's like a a meaningful distinction between that base objective the sort of loss function that the training process is optimizing for and if your model actually itself learns an objective what that objective will be and sort of it's it's easy because of the fact that in many situations the behavioral objective looks so close to the loss function to just sort of conflate the ideas and assume that you know well we trained opening i5 to you know win dota games it looks like it's trying to win dota games so it must actually be trying to win dota games but that last step is potentially fraught with with errors and so that's sort of one of the sort of points that we're trying to make great yeah uh perhaps a more concrete example would a thermostat be an example of a system that has a behavioral objective but not necessarily a mesa objective or an internally represented objective yeah i think the thermostat is a little bit of an edge case because it's like well does the thermostat even really have a behavioral objective i think it kind of does right i mean you know it's trying to like stabilize temperature at some level but it's so sort of simple that um i don't know what is your definition of behavioral objective i don't know probably yeah um so and i wanted to ask another question about this uh this sort of interplay between this idea of like um planning and internally represented objectives so this is i guess this is a bit of a little bit of a troll question sorry i'm sorry but suppose i have my program right um i have this variable called my underscore utility and well like i'm writing a program to navigate a maze quickly that's what i'm doing um in the program there's a variable called my underscore utility and it's assigned to the longest path function and i also have a variable called my underscore maximizer that i'm going to try to maximize utility with um but i'm actually a really bad programmer so what i so what i assigned to my maximizer was the argmin function so the function that like it takes another function and like minimizes that function so that's my maximizer um it's not a great maximizer and then the acts function is my maximizer applied to my utility so here the sort of way this program is structured is if you read the variable names you know is um there's this like objective which is to get the longest path and the way you maximize that objective is really bad um and you actually you use the like minimization function yeah according to you what is the so so this is a mesa optimizer because i did some optimization to come up with this program um according to you what's the mesa objective of this yeah so i think that this goes to show something really important which is that if you're trying to look at a trained model and understand what that model's objective is you you have to like have transparency tools or some means of being able to look inside the model that like gives you a real understanding of what it's doing um so that you can like ascribe a meaningful objective you can imagine a situation so so i think clearly the answer is well the base objective should be the you know the negative of the of the thing which you you know actually specified as as because you know you had some variable name which said maximize but you know in fact you're minimizing it so you know the real objective if we were sort of trying to extract a like objective which is sort of meaningful um to us and to be just so the listeners can don't have to keep that all in their heads the thing this does is like find it's the shortest path even though my underscore utility is assigned to longest path um but you were saying right and so i think that what that goes to show though is it goes to show that if you had sort of a level of durability tools for example that enable you to see the variable names but not the implementation you might be significantly led astray in terms of what you expected the may subject to be of this sort of model and so it does it does it goes to show that there's like difficulty in really properly extracting what the sort of model is doing because the model you know isn't necessarily built to be easy to extract that information it might have sort of weird you know red herrings or things that are like implemented in a strange way that we might not expect and so um if you're in the business of trying to figure out what a model's main subjective is by looking at it um you have to run into these sorts of problems where you know things might not be influenced in the way you expect you might sort of have something which you expect to be a maximizer but it's a minimizer or something and so you have to be able to deal with these sorts of problems if you're trying to look at models and figure out so i guess my um the question that i really want to get at here is um this is obviously a toy example but like typically optimizers like at least optimizers that um i write they're not necessarily perfect right they don't find the thing that like that they're trying to i'm trying to maximize some function when i'm writing it but the thing doesn't actually like maximize the function um so i guess like what i'm part of what i'm getting at is like to what extent if there's gonna be any difference between like your mesa objective and your behavioral objective like to what extent it seems like either you've got to pick the variable name that has the word utility in it or you've got to say like well the thing that actually got optimized was the objective or like you might hope that there's a third way but i don't know what it is so do you think there's a third way and if so what is it yeah so i guess the the way in which i would describe this is that fundamentally we're just trying to get good mechanistic descriptions and you know when we say mesa optimizer i mean mesa optimizer to refer to the sort of model which has something we would you know look at as a relatively coherent objective and like optimization process um obviously not all models have this right so there's lots of models which like don't really look like they have a coherent optimization process and objective um but we could still try to get a nice mechanistic description of what that sort of model would be doing um when we talk about mesa optimizers we're referring to models which have this sort of nice mechanistic description which look like they have an objective and they have a search processor trying to sort of slack for that objective um i think that the sort of you know the way in which you know we we sort of are getting around these sorts of identifiability problems i guess would be well like we don't claim that this works for all models lots of bottles just don't fall into the category of mesopotamizers and then also we're like not trying to um like come to a conclusion about what a model's objective might be only by looking at its behavior we're just asking if the model is structured such that it has a some like relatively coherent optimization process and objective then what is that objective um and it's it's worth noting that this obviously sort of doesn't cover the full scope of possible models and it doesn't doesn't necessarily cover the full scope of possible models that we are concerned about but i do think it covers a like large variety of models and we make a lot of arguments in the paper for why you would you should expect this sort of optimization to occur because it it sort of generalizes well it performs well in a lot of simplicity measures it has this sort of ability to sort of compress really complex policies into relatively simple search procedures and so there's reasons that you might expect that we go into in the paper this sort of model to be the sort of thing that you might find but it's certainly not the case that it's it's always going to be the sort of thing you find okay that makes sense um so i guess the final question i want to ask to get a sense of like what's going on with mesa optimization um how often are neural networks that we train mesa optimizers yes that's a really good question i have no idea i mean i think this is the sort of thing that is like an open research problem my guess would be that most of the systems that we train currently don't have this sort of internal structure um there might be some systems which do we talk a little bit in the wristband optimization paper about some of the sorts of related work that might have sort of have this have these sorts of properties so one of the examples that we talked about is the sort of rl squared paper from openai where in that paper they sort of demonstrate that you can create a sort of reinforcement learning system which sort of learns how to do reinforcement learning by sort of passing it through many different reinforcement learning environments and asking it to sort of along the way figure out what's going on each time rather than explicitly training on like that specific reinforced learning environment um this sort of thing seems like closest to the sort of thing that we're pointing at in terms of like existing research but of course the problem is that our definition is one which is mechanistic and requires understanding of is it actually doing search which is not really something we know because we don't have the ability to really look inside of our models and know whether they're during search or not so is it happening with current models i don't know i think that like we certainly see like some of the sorts of failures in terms of like you know sometimes we see models which try to pursue proxies to the objectives which they're trained under but those proxies aren't necessarily the sorts of ones which are like very objective robust they might not be sort of a really good coherent search process which generalizes well and is able to sort of pursue that proxy intelligently and coherently even off distribution which fundamentally is what we care about because um if you think about the sort of robustness picture if you have a situation where a model is very sort of uh capability robust is really good at doing optimization it's able to sort of take really sort of coherent capable actions but it's doing so for a certain robust objective an objective that is you know not the one which we actually wanted to achieve you end up in a sort of bad situation because you have a very capable model which is doing the wrong thing we certainly see some amount of objective non-robustness but in some sense we sort of also see that along with capability non-robustness like a lot of cases when you sort of look at models off distribution they just sort of are bad they don't do like really coherent things and so we're in a situation where like um these sorts of notions haven't yet come apart completely because we haven't yet sort of gotten good enough at capability robustness having good enough at producing models you're capable of doing really coherent optimization on and off distribution to be able to get to a situation where they have like uh potentially an objective which might be meaningful okay cool so um and i want to move on a little bit to um how we might produce mess optimizers so one reason you give for why or when mess optimization might happen is when there's a lot of um computation available at inference time or runtime relative to um how much computation is devoted to training the system and i guess this is because like the optimization has to happen at some point and if it doesn't if it can happen at like inference time like when the model's running then you know you don't you don't do it so much um when you're training so i guess i'm wondering do you expect this situation of like a lot of computation being used at inference time relative to during training um and if so why and also how big a deal is this do you think yeah so i guess i'm not sure i 100 agree with that description so i think that in which i would just say is that the more compute your model has available to it an inference sign the more likely it is to be a mace optimizer i think the additional question of how the compute available at inference time versus the compute available at training time like compared to each other the sort of relative differential i think is more complicated so for example i think that in fact the more compute you have available at training time um the sort of better you are at selecting for models which are sort of very performed very well according to inductive biases so um yeah so so this is a little bit complicated so one story for sort of why i believe this so um you think about something like um double descent so double descent is a phenomenon which has sort of been sort of demonstrated to exist in sort of machine learning systems where what happens is you start training your machine learning model and initially uh you sort of get this sort of standard learning theory curve where the sort of training error goes down and the sort of test error goes down and then goes back up so you start overfitting and then sort of uh sort of you know inside of double descent is well if you keep training past zero training error your uh sort of um your your test error actually starts to go back down again and can potentially get lower than it ever was earlier and the reason for this is that if you keep training if you put enough compute into just training past the point of convergence you end up into a situation where well you're selecting for the model which performs best according to the inductive biases of your sort of machine learning system which fits the data whereas if you only train to the point where it gets zero training error then stop immediately there's a sense in which there's sort of well the there's only sort of potentially maybe one model it's just the sort of first model that it was able to find which fit the training error feature the training data whereas if you keep training it's selecting from many different possible models which might fit the training data and finding this sort of simple one and one of the arguments that we make for why you might expect mesa optimization is that mesa optimizers are relatively simple models search is a pretty sort of simple uh procedure which is not that complicated to sort of influence but has the ability to sort of uh in fact do tons of different sort of complex behavior and inference time and so it's sort of a search you can be thought of as a way a form of compression it's a way of taking these sort of really complex policies that depend on the lots of different things happening sort of in the environment and compressing them into this build a very simple search policy and so if you're in a situation where you're putting a ton of compute into training your model then you might expect that sort of according to you know something like this double decent story then what's happening is the more compute you put the sort of more your training process is able to select the sort of simplest model which fits the data as opposed to just the sort of first model that it finds that fits the data and the simplest model that fits the data you know we argue is more likely to be a mesa optimizer and so in that sense i would actually argue that sort of the more training compute you do the more likely you are to get mesopotamizers and the more inference time of compute you have the more likely to get misophonizers and the reason for the inference time compute is sort of similar to the reason you are saying which is that infinite sign compute um enables you to do sort of do more search it makes search more sort of more powerful you know search does require compute you have to sort of do this uh you know it takes it takes a certain amount of compute to be able to do the search but it has the benefit of this sort of compression of yeah you have to do a bunch of compute for sort of performing the search but as a result you get the ability to have a much more sort of complex policy that's able to be sort of really adapted to all these different situations and so in that sense the sort of more compute on both ends i would expect to lead to sort of more likelihood of based optimization it is worth noting there's some sort of some caveats here so like you know in practice we generally don't train sort of that far past um convergence um i think that there's also still a lot of evidence that like especially if you're doing things where for example you like do very early stopping where you only do like a very small number of grainy descent steps after initialization there's a sense in which that also is very sort of strong on inductive biases because because you're so close to initialization the like inductive biases of well the things which are relatively easy is sort of simple to initialize which like take up a large volume in a sort of initialization space those are those sorts of things that you'll be very likely to find and so you still end up with sort of very strong inductive biases but actually it's sort of different not to biases you know we don't have there's not like a lot of great research here but i think it's still the case that like when you're doing a lot of sort of training compute on like very large models if you're doing really sort of really early stopping or training like you know much past convergence you're ending up in situations where the sort of effective biases matter a lot and you're selecting for sort of relatively simple models which i think sort of leads to something for search and stuff paper optimization and something for base optimizers okay so that makes sense so yeah this gets me to another question one of the arguments that you make the paper um is this idea that like well mesa optimizers are just like more simple um you know they're they're more compressible they're like easier to describe and therefore like they're going to be favored by inductive biases or by things that prefer simplicity but like this this presumably depends on the language that you're using to describe things so i'm wondering in what languages do you think it's the case that optimizers are simple um and in particular i'm interested in the sort of quote-unquote languages of like the lt norm of your neural network and the quote-unquote language again of the just the size of your of say a multi-layer perceptron or a cnn you know how many layers it has yeah so this is a really good question so i mean i mean the answer is i don't know uh so i don't i don't know the extent to which you know these things are in fact easy to specify in like uh sort of concrete machine learning settings like how hard is it to actually like if you were to sort of you know try to write an optimization procedure in sort of neural network rates how difficult that would be to do and sort of how much sort of space do those sorts of models take up in the sort of neural network prior i don't know my guess is that um it's not that hard and it certainly depends on on the architecture but uh we can start if we start just like very simple architecture we think about sort of you know you know very simple sort of you know uh multi-layer perception or whatever can you encode optimization i think you can so you can do sorts of things where uh you can like um you know i think depth has a is sort of like pretty important here because it gives you the ability to sort of work on results over time create like you know many different possibilities and then like refine those possibilities as you're sort of going through but i do think it's possible to implement sort of like bare bones sort of search style things in sort of just thinking about a multi-layer perceptron sort of setup it certainly gets easier when you sort of look at sort of more complicated setups um if you think about something like an lstm like a recurrent neural network you have the ability now to sort of work on like a plan sort of over time in a sort of similar way to you know like a human might like you were describing earlier where you could like make plans and sort of execute them and especially if you think about something like um alpha zero or even more especially something like new zero which uh explicitly has like a sort of world model that it tries to search over and do you know uh you know mcts sort of monster like research over that to like find you know some objective and the objective that it's the searching over in that sort of situation is a sort of value network that is learned and so you're learning some objective and then you're explicitly doing search over that objective now you know we talk about this sort of situation of like hard-coded search in the paper a lot things get a little bit tricky because you might also have sort of mace automation going on inside of any of the individual models and the situations under which you might end up with that are sort of well you might end up in a situation with a sort of search you've given it to be hard-coded it's not the sort of search which like it really needs to do to solve the problem so it's a little bit complicated but certainly i think that like architectures current architectures are capable of implementing search and being able to do search there's then the additional question of how simple it is i think that so you know unfortunately we don't have really great measures of sort of simplicity for neural networks we have sort of you know very simple measures like common ground of complexity and there is some evidence to indicate that like homograph complexity is like um at least related to the sort of inductive biases of neural networks so there's some recent research from uh uh one of the co-authors and some of the research was was you are skulls who's one of the co-authors on riskier optimization uh trying to sort of analyze the sort of what what exactly is the prior of neural networks initialization um and there's like there's some theoretical results there that indicate that like homograph complexity actually can serve as like a real sort of bound there but of course the monograph complexity is uncomputable for people are not aware margaret complexity is a measure of a description length that is based on finding the sort of simplest turing machine which is capable of sort of uh reproducing that um and so we can look at these sorts of theoretical measures like commodore complex and be like well it seems like search has a relatively low commodore complexity um but of course the actual deductive biases of machine learning and gradients that are more complicated and we don't fully understand them you know there is other research so there's also some research that looks at trying to understand the sort of inductive biases of grading descent trying to you know it seems like great descent generally favors sort of these very large basins um and there's sort of you know evidence these sorts of very large basins seems to match on to sort of relatively simple functions there's also sort of uh stuff like the lottery tickets hypothesis which hypothesizes that this sort of uh what's happening so if you train a very very large neural network it's gonna have a bunch of different sub networks some of those sub networks add initialization will happen to sort of encode uh the sort of already be relatively close to the sort of uh goal that you're going to get to and then the gradient descent process will sort of be amplifying some sub networks and sort of de-amplifying other sub-networks and if you sort of buy into hypotheses like this then you're sort of in a situation where well um there's a question of you know how likely are sort of sub networks implementing something like optimization likely to be an initialization obviously the sort of more the larger your network is the more sort of possible sub networks that initialization there are that exist um in a sort of you know uh combinatorial explosion sort of way such that as we get to much sort of larger larger networks the sort of combinatorial explosion of sub networks sort of seems like it would you know eventually cover even things sort of you know potentially quite complicated but but unfortunately i can't i there's just not that much i can say because our understanding of the inductive biases of neural networks just isn't great we have like some things which we can talk about about simplicity measures and stuff but unfortunately at the end of the day i can be like well it certainly seems like search will be relatively simple and likely to learn but like i don't have like really strong results for that because we just don't have a great understanding of exactly how these effective biases operate it does sound like one thing you think though is that like like you mentioned um if you know if you want to do search in uh something like a multi-layer perceptron you mentioned that uh it helps to just have a bunch of players um to refine options and such so so it kind of sounds like you think that uh if if we were selecting on network depth that would be a kind of simplicity that would disfavor search rather than favoring it um does that sound right to you that does sound right so the way which i would describe that is in some ways sort of looking at inference time compute it's sort of thinking about um sort of more like a speed prior so if you think about a speed prior or something sort of like um yeah so speed prior sort of trying to select on uh models which have a sort of like uh small time from sort of initialization to sort of completion so like the actual amount of time it takes to run the sort of function um if you think about sort of depth versus width if you have a very wide neural network but it's not very deep then it might it's actually very fast to run because you just sort of run through that initial step uh it might be sort of it might still be computationally expensive but because you can parallelize uh the sort of the sort of width wise the actual sort of individual step is sort of relatively sort of relatively fast in a way in which it would sort of be favored by a speed prior and so one of the things that we do claim in the paper is that these sorts of speed prior like incentives disincentivize base authentication because mesa optimization does require the extent the sort of it it requires you to be able to have these sorts of multi-level like long sort of uh relatively long sort of inference time sort of uh computations and if you sort of restrict your model to not be able to do these sorts of relatively long infrared sign computations then it's probably going to be less likely to be able to do uh to do search to be able to do base optimization so certainly in that sense i think that there is a sort of uh distinction there between like um uh sort of a simplicity prior and a sort of minimal circuit speed prior thing where uh with the speed prior you get sort of different incentives and there's a sort of distinction where uh if you sort of are really are selecting on depth you're sort of doing something more like a speed prior which changes things what do you think the important difference here is between time and space because it sounds like you're saying well if you have a lot of time to compute then you can be an optimizer but if you just have a lot of space then you can't be an optimizer like what's the why is there to say symmetry doesn't special relativity tell us that these things are the same well okay so here's a really simple sort of analogy to sort of think about this is let's say i have a lookup table uh as my model versus like a you know optimizer the workup table that is like equivalent to the optimizer behaviorally is going to be really large because it has to explicitly encode all of these different things that it's going to do but it's also going to be really fast because you just look at the lookup table you like figure out what it's going to do and then you do that action why is it fast doesn't it take exponentially long to figure out like where like you go through the lookup table and there's like a billion entries like go really look up something something something i mean not worst case right like eventually you get collisions amortized expected 0 1. all right um i i think this is actually like in fact the case if you look at like uh you know there is analysis of like um looking at sort of uh minimal circuits uh and like speed priors and stuff trying to understand like what would happen in sorts of situations and in fact i think lookup tables like um you know are are are really good because uh they are they they are they're in fact quite fast like the uh if you have all of the information already encoded yes you need to like have some be able to generate some key be able to look it up but i don't know there are like mechanisms that are able to do that relatively quickly you know a hash map being one of them even just looking it up in like uh you know a binary search tree or something it's still not going to be uh not gonna be that bad um potentially and so if you have a lookup table it can be relatively fast but it takes up a lot of space whereas if you have an optimization system um it has to be relatively slow because it has to like has to actually like come up with the like possibility of like all the different possible like options and like then select one of them and that process is going to take some it takes compute it takes time to actually look through different options and select them and you know you can have heuristics to try to narrow down the field of possible things which you're going to look for but you still need to look you're still probably going to need to look at a bunch of different things to be able to select the sort of the the correct one and so um we can think about in this situation um you know the the optimizer has a really sort of large just uh sort of uh it has a larger sort of inference time compute but has a smaller description because you don't have to explicitly write down in the description of this optimizer all the different actions it's going to take for all these different things it just needs to have a sort of general procedure of well let's look at an objective let's do some search to try to select for something whereas if you think about the lookup table it has to explicitly write all these things down then an inference time it sort of gets benefits um you know and this sort of thing this sort of trade-off between you know space and time you know occurs a lot in sort of you know computational complexity setups a lot of times you think about um there's sort of you know you'll take a function and then you'll sort of memoize it which is this sort of procedure of well i have this function i call it a bunch uh which different arguments i can make it faster if i'm willing to spend a bunch of space storing all the different possible results that i generally call it with and you know as a result you get a faster function but you have to spend a bunch of time sort of a bunch of space sort of storing all these different things and so you you sort of might expect a sort of similar thing to happen with you know with neural networks where you know if you if you're if you're able to spend a bunch of space you can store all these different things um and it can it sort of doesn't have to you know spend all this time computing the function whereas if you sort of force it to sort of compute the function every time then it sort of uh you know then it isn't able to store all these different things it has to sort of recompute them okay so yeah i mean one thing this makes me think of is like you know the paper is sort of talking about this you know these risks from like these learned optimizers and optimizers or these things that are like doing search you know like they're actually doing search like at inference time um and they're not these things that you know are behaviorally equivalent to things that do search at inference time and i guess we already got to a bit about like why you might be worried about these mesa optimizers and we're going to avoid getting them and i'm wondering like is meso optimizer the right category if if it's the case that like oh you know we we could instead search for things that are like um well like our our base optimizer we could instead try to like get these systems that have like um you know they they're really quick at inference you know and then they're definitely not going to be mesa optimizers because they can't optimize but of course they definitely could be these you know tabular forms of mesa optimizers which is presumably just this bigger problem so like i guess this is making you wonder is meso optimizer like the right category yes that's a good question so i think that um if you think about the sort of tabular base optimizer how likely would you be to actually find the sort of tabular base softmaster i think the answer is very unlikely uh because the sort of tabular mace optimizer sort of has to encode all of this sort of really complex options like results of optimization explicitly in a way which i think sort of especially when you get to really complex really diverse environments just takes up sort of so much sort of space and explicit memorization but not just memorization like you have to be able to memorize like what your deployment dis like what your sort of like out of distribution behavior would be according to like performing some complex optimization and then like sort of have some way of explicitly encoding that um and that's sort of like um i think quite difficult but but how could we get um presumably like uh like if we're if we're doing this selection on short computation time well we do want something that actually succeeds at this really difficult task so how is that any easier than like it seems like uh if you think you know there are a bunch of these like potential mesa optimizers out there and you know they're behaviorally equivalent surely the like lookup table for like really good behavior is behaviorally equivalent to the lookup table for the things that are these like you know dangerous mesa optimizers sure so the key is inductive biases and generalization so if you think about it um like there's a reason that we don't train lookup tables and we instead train neural networks um and the reason is that the if we just train lookup tables you know if you're doing tabular queue learning for example tabular learning does not have as good inductive biases it is not able to learn like the nice sort of simple and expressive functions that we're trying to learn to like actually sort of understand what's happening and it's it's also really hard to train uh if you you sort of think about tabular queue learning in a sort of really complex setting with like very sort of you know diverse environments you're just never going to be able to even get enough data to sort of understand what the results are going to be for all of these different sort of possible sort of you know tabular setups and so there's a reason that we don't do tabular queue learning there's a reason that we use you know uh deep learning instead and the reason is because deep learning has this has these nice inductive biases it has the ability to sort of um train on you know some some sort of particular set of data and have good generalization performance because you selected a simple function to fit that data not just any function to fit that data but a simple function to fit that data such that it has a good performance when you sort of try it on other situations whereas if you are just sort of doing training on sort of other situations like you know a tabular setup or something you don't have that sort of nice generalization property because you aren't sort of learning these simple functions with the data you're just learning some function which fits the data and that's the sort of you know that's the the the direction i expect machine learning to continue to go is to sort of develop these sorts of nice uh sort of uh inductive biases setups these sort of situations in which we have like these model spaces where we can uh sort of have inductive biases that select for simple models in sort of large expressive model spaces and if you sort of expect something like that then you sort of should expect a situation where we're not sort of training these sort of tabular models we're training these sort of sim these sort of simple functions but the simple functions that nevertheless like generalize well because there are simple functions that fit the data and it's also worth noting right like why simplicity well i think that like there's a lot of sort of research that they sort of understands like you know it looks like if you actually look at neural network protective biases it does seem like simplicity is like the right measure like a lot of these sorts of like the functions which are very likely to exist initialization as well as the sorts of like functions that sgd is likely to find have the sort of simplicity properties in a sort of like in a way which is actually very closely tied to like uh you know computational description like commodore complexity um there's like the i don't know the uh the sort of the research i was mentioning recently from uar and others was uh talking about like the lemon bound which is this sort of particular mathematical result which directly ties up homogenously so there's arguments that like you know what was and there's also a further argument which is just well you know if you think about why do things generalize at all why would a function generalize in the world um the sort of answer is something like occam's razor it's that well you know the reason that things generalize in the world that we live in is because we live in a world where the actual sort of true descriptions of reality are relatively simple and so if you look for the sort of simplest description of this sort of fits the data you're more likely to get something to generalize as well because occam's razor is sort of telling you that's the thing which is actually more likely to be true and so there's sort of these really fundamental arguments for why we should expect sort of simplicity to be the source of measure which is sort of likely to be used um and that we're sort of likely to you know continue be training these sort of expressive model spaces and then selecting on simplicity rather than sort of doing something like i don't know okay so i'm just going to try to summarize each check that i understand so it sounds like what you're saying is look sure we could think about like training these um like selecting models to be like really fast at inference time and in that world like uh you know all sorts of like like maybe you get like uh tabularized mesa optimizers or something but like look the actual world we're going to be in is we're going to like select on sort of description length simplicity and we're not going to select very hard on computation runtime so in that world we're just going to get miss optimizers and you you just want to be thinking about that world is that right i mean i think that's basically right i think i basically agree with that okay cool all right so i think i understand you know um so i'm gonna switch topics a little bit so what types in terms of like are we gonna get these mesa optimizers do you have a sense of like what types of learning tasks done in ai do you think are more or less prime to producing mesa optimizers so like if we just avoided rl um would that be fine um if we like only did image classification would that be fine yeah that's a great question so we do talk about this a little bit in the paper i think that um the rl question is is a little bit tricky so we definitely mostly focus on rl i do think that you can get mace optimization in sort of more supervised learning settings um it's definitely a little bit sort of trickier to understand um and so i think it's a lot easier to sort of comprehend a message optimizing an rl setting i think it is more likely to get myself places in rl settings let's see maybe i'll start with the other question though which is just about the environment i think that that's sort of a simpler question to understand i think that basically if you have really complex really general uh environments that require sort of uh general sort of problem-solving capabilities then you'll sort of be incentivizing based optimization because fundamentally if you think about base automation as a way of sort of compressing really complex policies uh if you're in a situation where you have to have a really complex policy that like carefully depends on the particular environment that you end up in where like you have to look at your environment and like figure out what you have to do in a specific situation be able to solve the problem and like reason about like what sorts of things are likely to succeed in this environment versus that environment your incentive you're you're you're you're like requiring these really complex policies that have simple descriptions if they're compressed into search and so you're sort of directly incentivizing sort of search in those situations so i think it's the question is relatively easy to answer in the environment case in the case of like reinforcement learning versus supervised learning it's a little bit more complicated so certainly in reinforcement learning you have this sort of setup where well we're sort of directly incentivizing the model to sort of act in this environment and sort of over sort of you know multiple steps generally and you're sort of acting environment over multiple steps you know you you're sort of there's a sort of natural way in which sort of optimization can be nice in that situation where you know it has the ability potentially also to like you know develop a plan and execute that plan especially if it has you know some means of you know storing storing information over time but i do think that the same sort of thing can happen in supervised learning so even if you just think about a very simple supervised learning setup let's imagine like a language model the language model has to uh you know it looks at some prompt and it has to predict what the sort of future uh what the text after that would be by you know a human that wrote it or whatever um you know you could imagine a situation in which a language model like this could be uh you know just doing a bunch of heuristics has a bunch of understanding of like you know the various different ways in which you know people write and the various different sorts of concepts those people might sort of have that would follow this some rules for like grammar and stuff um but i think the optimization is really helpful here so if you like if you think about a situation where like well you know you really want to look over many different possibilities like you're going to have like lots of different situations where like well maybe you know this is the possible thing which will come next you know it's not always the case you know there are very few hard and fast rules in language you know you'll look at a situation where like you know there's like sometimes after like this sort of thing they'll be that sort of thing uh but like you know it might be that you're actually you know they're using some other synonym or you're using some other sort of analysis of it and sort of the ability to sort of well maybe we'll just like you know come up with a bunch of different sort of possibilities that sort of might be you know exist for the next so there's for the next completion and then evaluate those possibilities i think it's still a sort of strategy which is very helpful in this domain and it still has the same property of compression it still has the same property of simplicity you know you're in a situation where there's a really complex policy to be able to describe exactly what the sort of human completion of these things would be and a like simple way to sort of be able to describe that policy is a policy which is well you know searching for the sort of you know the sort of correct completion if you think about what humans are actually doing when they're producing language they are in fact doing you know search for for parts of it you know you're searching for like you know the correct sort of way to you know the best possible word to fill into the situation and so you know if you imagine a model which is trying to do something similar to the actual sort of human process of you know completing language then you're you're sort of getting a model which is sort of potentially doing some sort of search and so i really do think that even in these sorts of supervised learning settings you can absolutely get mixed operation i think it's trickier to think about certainly and i think it's probably less likely but i certainly don't think it's um impossible or or even unlikely okay so um yeah as i guess you kind of mentioned like or at least the leader too the surprise i wanted to avoid mesa optimizers from just being i just don't want them you know um so one thing i could do is i could avoid doing things that require modeling humans so for instance like um natural language processing or natural language prediction where you're trying to predict what like maybe me daniel fallon would say after some you know prompt and also avoid doing things that require modeling other optimizers so like um really advanced um programming or advanced like program synthesis or analysis would be a case of this where like you know some programs are optimizers and you don't want your system modeling optimizers so you know you don't you want to avoid that so a question i have is um how much do you think like you could do if you wanted to avoid mess optimizers and you were under these constraints of don't do anything which involves which like for which it would be useful to model other optimizers yeah so this is a question that i have um there's like there's there's some writing about that i've done in the past so i think there are like there are some concrete proposals of like how to do ai alignment that sort of i think fall into this sort of general category of trying not to do sort of not produce space optimizers so um i have this post an overview of 11 proposals for building safety sai which goes through a bunch of different sort of possible ways which you might sort of structure a sort of uh you know advanced ai system in a way to be safe um some of those proposals so one of them is sort of stem ai where the basic idea is well maybe you sort of try to totally confine your ai to sort of just thinking about sort of mathematical problems scientific problems maybe you want to build like you know if you think about something like alpha fold you want to just do really good protein folding you want to do really good like you know physics you want to do really good chemistry you want to do sort of like and you can accomplish a lot just by sort of having having sort of very sort of domain restricted models that are sort of just focusing on domains like chemistry or protein faulting or whatever they're totally devoid of like of humans there aren't sort of optimizers in this domain there aren't humans this domain sort of model you're just thinking about these sort of very certain narrow domains there's still a lot of progress they made there's a lot of things that you can do just with models along those lines and so maybe if you just use models like that you know there's sort of you might be able to do sort of you know powerful sort of advanced technological feats like um you know things like nanotechnology or whole brain emulation if you really sort of push it potentially so that's sort of one round um another route that i sort of talked about in the eleven proposals post is this idea of microscope ai microscope ai is an idea which is due to crystalla chris sort of uh has this idea of well maybe another thing which we could do is just train a sort of predictive model a bunch of data oftentimes that data you know and i think in many of the ways we should have crisp sort of produces this that that might involve humans you might still have some human sort of modeling but the idea would be well maybe sort of in a similar way to sort of the language model uh being less likely to produce based off foundation than like a reinforced learning agent you know we could just train a sort of uh sort of sort of supervised setting a sort of purely predicted model uh to try to understand a bunch of things and then rather than trying to use this model to like directly influence the world as like an agent we just sort of look inside it with transparency tools extract this sort of information or knowledge that it has learned and then use that knowledge to sort of guide human decision making in some way and so this is another approach which tries not to produce an agent who tries not to produce of an optimized overax in the world but rather just sort of produce a system which learns some things and then we use that knowledge to sort of do other things and so there are approaches like this that try to not build mace optimizers i think that these are sort of both both the sort of semiai style approach and the microscope ai style approach i think are like uh sort of relatively reasonable sort of things to be looking into and to be sort of working on um whether these sorts of things will actually work um it's hard to know and there's lots of problems you know with microsoft ai there are problems like you know maybe you actually do bruce may sophomorize or like language modeling an example um there's other problems as well you know even with with mai also there's problems where you know what can you actually do with this if you're just doing uh you know like all these sort of various sort of you know these physical problems or chemistry problems you know if you you know if the main use case for your ai is being a sort of uh you know ai ceo or helping you do more ai research or something it's gonna be hard to get a model just able to do some of those sorts of things without sort of modeling humans for example and so there's certainly limits but i do think that these are like valuable proposals that are sort of worth looking into more okay yeah it seems it also seems like um following these like if you really wanted to avoid modeling humans like like these would be pretty significant breaks from the standard practice of ai as like most people understand it right or at least the standard story of like we're gonna develop more and more powerful ais and then they're just gonna be able to do everything uh like like it seems like you're gonna really have to try um is that fair to say yeah i mean i think that there's like a question of exactly how you expect the field of ai to develop so you could imagine a situation and um i talked about this a little bit also in like the post i i have a uh post chris ola's views on his safety which goes into some of the sort of ways that chris views the sort of strategic landscape um one of the things that chris thinks about is microscope ai and one of the sort of ways in which he talks about a way which you might actually end up with microscope ai being a sort of real possibility would be if you had a sort of major realignment of the machine learning community towards understanding models and trying to sort of look inside models and figure out what they were doing as opposed to just sort of like trying to beat you know benchmarks and beat the state of the art with like sort of relatively black box systems um you can sort of think of this as you know a realignment towards a more scientific view as opposed to a more engineering view where sort of currently a lot of the way in which people i think a lot of researchers sort of approach machine learning is as a sort of engineering problem it's well we want to achieve this goal of like getting across this benchmark whatever and so we're going to try to throw our tools at it to achieve that but you can imagine a more sort of scientific mindset which is well the goal is not to achieve some particular benchmark to build some particular system or rather to understand these systems of what they're doing and how they work um and from a more sort of if you imagine that sort of more scientifically focused machine learning community as opposed to more sort of engineering-focused machine learning community i think you can imagine a sort of realignment that could sort of make microscope ai seem relatively sort of standard and normal okay so i guess i have one more question about uh how we might imagine ms optimizers being produced like in the paper you're doing some reasoning about like what types of mess optimizers we might see and it seems like one of the ways you're thinking about it is like there's gonna be you know some parameters that are dedicated to the goals or the mesa objective and you know other parameters dedicated to other things um and you know if you have something like um first order gradient percent it's gonna like change the parameters in like the goal thing assuming all of the parameters in the you know belief or the planning or whatever parts are staying constant and one thing i'm wondering is if i think about these like classically understood parts of an optimizer so or something doing search right so like you know maybe it has perception beliefs goals and planning like do you think there's necessarily going to be different parameters um of your machine learned system for these different parts or do you think that potentially there might be parts like spread across or parameters um divided to different parts in which case you couldn't um reason about quite as nicely about how gradient descent is going to interact with it so this is the sort of stuff that gets the most speculative so i think anything that i say here has to come with a caveat which is like we just have no idea what we're talking about when we try to talk about these sorts of things because we simply don't have a good enough understanding of exactly the sorts of ways in which neural networks decompose into pieces there are sort of you know ways in which people are trying to get understanding that you can think about like you know uh crystal is sort of circuits approach a sort of clear team trying to understand and sort of decompose into circuits also i know daniel you have your sort of modularity research where you're trying to understand the extent to which so if you can break neural networks down with sort of coherent sort of modules i think that all of this stuff is great and sort of gives us some insight into the extent of which these sorts of things are true now that being said i think that like the actual extent of our understanding here is still pretty mutable um and so anything i say i think is basically just speculation but i can try to speculate nevertheless so i think that my guess is that um you will have sort of relatively coherent like things like you know to be able to do search and i think you will get search you have to do things like come up with options and then select them and so you have to have at least some point where you're like like coming up you have to have like heuristics to guide like your ability to come up with options you have to have like in somewhere generate those options and store those options and then somehow prove them and select them down and so those sorts of things have to exist somewhere now certainly the we found you know if you think about you know many of the sorts of analysis and real networks that have been done there exists the sort of uh sort of poly semanticity where you can have sort of neurons for example that sort of uh encode for many different possible concepts and you can certainly have these sorts of things overlapping because the structure of a neural network is such that you know it's not necessarily the case that things have to be you know totally isolated or one neuron does one particular thing or something and so it's certainly the case that you might not have these like really nice you can point to the network and be like ah that's where it does the search that's where its objective is encoded but hopefully at least we can have like okay we know how the objective is encoded it's encoded via like this sort of combination of neurons there that do this like particular selection that encode like this thing here um i think that i'm optimistic i don't know i expect those sorts of things to at least be possible i'm hopeful that we will be able to get to a point as a sort of field in our understanding of neural eric's understanding of how they work internally that will actually be able to sort of concretely point to those things and extract them out i think this would be really good in our sort of ability to be able to address inner alignment and base optimization but uh i do i do expect those sorts of things to exist and to be extractable even if they require sort of understanding of polysynapticity and the way in which the concepts uh are spread across sort of different the network in various different ways so i'd like to change topics again a little bit and talk about inner alignment um so in the paper you correct me if i'm wrong but i believe you describe inner alignment and it's basically the task of making sure that the base objective is the same as the mesa objective and you give a taxonomy of ways that uh inner alignment can fail so there's proxy alignment where the thing that um you're or or specifically that inner alignment can fail that but like you don't know that it's failed um because you know if an alignment fails and like the system just does something totally different you know that's like you know normal machine learning will hopefully take care of that but there are three things you mentioned firstly proxy alignment where there's some like there's some other thing which is causally related to what the base optimizer wants and um your your system optimizes for that other thing there's approximate alignment which is where you have some objective your base optimizer has some objective function and your system learns an approximation of that objective function but it's not like totally accurate um and thirdly sub-optimality alignment so your system is just like um it has a different goal but it's like not yet so good at optimizing for that goal and when you're kind of batted up on anything for that goal it looks like it's optimizing for the thing that the base optimizer wants so proxy alignment approximate alignment and sub-optimality alignment and pseudo-alignment is like the general thing where like it looks like you're inner line but you're not really so one thing i kind of wanted to know is how much of the space of pseudo-alignment cases do you think that list of three options covers yeah so maybe it's worth also first there's a lot of terminology being thrown around so i'll just try to you know clarify some of these things all right so the base the base objective we sort of mean basically the loss function so we're trying to understand we have some machine learning system we have some loss function we call this the base objective or you know a reward function whatever we'll call this the base objective we're trying to model according to that base objective but you know the actual mace objective they might learn might be different because there might be many different mace objectives to perform very well on the training distribution but which are sort of uh you know have different properties off distribution like uh going to the green arrow versus getting to the end of the mix and we're trying to catalog the sort of ways in which the sort of model can be pseudo-aligned which is so it looks like it's aligned on distribution in the same way that you know getting to the end of the maze and getting through an arrow both sort of look like they're aligned on the training solution when you move them off distribution they no longer are aligned because one of them does one thing one of them is the other thing and so this is sort of the general category of pseudo-alignment and so yes we try to categorize them and we have this sort of um some sort of possible categories of sort of proxy alignment proxy alignment being the sort of main one that we sort of i think about the most and the paper talks about the most um approximate alignment and some optimality alignment also being sort of other possible examples in terms of the extent of which i think this sort of covers the full space i don't know i think it's sort of a lot of this is sort of tricky because it depends on exactly how you're sort of defining these things and understanding like what counts as a mesa optimizer what doesn't count as a mace optimizer because you could certainly imagine situations in which a thing is like does something bad off to be off distribution but not necessarily because it learns some incorrect objective or something and sub-optimality alignment even is sort of already sort of touching on that because some optimality alignment is thinking about a situation in which the model has uh a a sort of defect in its optimization process that causes it to look alike especially if that defect were sort of to be fixed it would no longer look aligned and so you're sort of already sort of starting to touch on some of the situations which well maybe the problem the reason that it looks aligned is because of some weird interaction between the uh the sort of objective and the optimization process similar to sort of your you know example of the way in which it maximizes is actually minimizes or something um i don't know i think there are other sort of possibilities out there in just the sort of those three i think that um we don't sort of claim to have an exhaustive characterization um and one of the things also that i should note is i do like have i don't know i've like done some there's there's one post that i made later on after sort of writing the paper that clarifies and sort of expands upon this sort of analysis a little bit um which brings up the case of for example um sub-optimality deceptive alignment which would be a situation which the sort of model should act deceptively but hasn't sort of realized that it should act as ethically yet um because it's like not very good at thinking about what it's supposed to be doing but then later it realizes it's acceptable and starts acting deceptively which is sort of an interesting case where it's like sort of sub-optimality alignment sort of deceptive alignment um and so that's that's for example another another if it's not so if it's if it's sub-optimal and it's not active and it's not acting deceptively does that mean that you notice that it's optimizing for a thing that it shouldn't be optimizing for and like like just because it does a different thing that it's supposed to do and then great yes the smelly line is a little bit tricky the point that we're trying to make is it's not that it's sub it's like sub-optimal at optimizing the thing it's supposed to be optimizing for such that it looks like it's doing the right thing right so yeah and simultaneously sub-optimal it being deceptive in your sub-optimality deceptive alignment uh well if it uh yeah i mean so so or it's it's sub-optimal in the sense that it's like uh it hasn't realized that it shouldn't be deceptive and maybe if it had realized it would be deceptive then like it might start doing it might start sort of doing bad things and looking uh because it like tries to defect against you or something at some point it would be sort of no longer actually aligned whereas like if it never figures out that it's supposed to be deceptive then maybe it like just looks totally aligned basically for most of the time okay i know so there there certainly are other sort of possible cases than just those i think that it's not sort of exhaustive uh like i mentioned i don't know i i don't remember exactly what the title but i think it's more variations on student alignment is the one that i'm uh referencing um they talked about some of these other sort of possibilities and there aren't there are others i don't think that's exhausted but it is sort of a um i don't know some of the sort of possible ways this can fail do you think it covers as much as half of the space those main three yeah i do i i think actually what i would say is i think proxy alignment although covers like half the space because i basically expect like i think proxy alignment really is like the central example in like a meaningful way and that like uh i think it captures like most of the space and the others are sort of looking mostly at sort of edge cases okay so a question i have about proxy alignment and really about all of these inner alignment failures it seems like your story for why they happen is essentially some kind of unidentifiability right you know you're proxy misaligned because um you know you have a vacuum you the the base objective is to have that vacuum like clean your floors and somehow the mesa objective is instead to um get a lot of dust inside the vacuum and that you normally get from floors but you know you could imagine other ways of getting it and the reason you get this alignment failure is that like well on distribution you just like there was only one whenever you had dust inside you it was because you cleaned up the floor so what i'm wondering is as we train better ais you know as the have more capacity they're smarter in some sense on more rich data sets so whenever there are like imperfect sort of probabilistic causal relationships we like find these edge cases and we can select on them um and you know these data sets have like more features and there they just have more information in it um should that make us less worried about inner alignment failures because we're reducing the degree to which there's these um unidentifiability problems so this is a great question so i think a lot of this comes back to what we were talking about previously also about adversarial training so i think that there's a couple of things going on here so first unidentifiability is really not the only way which is going to happen so that identifiability has required at zero training error if we're imagining a situation where we have we have zero training error it perfectly fits the training data then yeah in some sense you sort of it has to be if it's sort of misaligned it has to be because of identifiability but in practice in many situations we don't train the zero training error we have these you know really complex data sets that like are very difficult to fit completely and you train sort of you know far before zero training error and in that situation it's not just an identifiability and in fact you can end up in a situation where the inductive biases can be sort of stronger in some sense than the uh sort of the training pattern and this is sort of if you think about you know avoiding overfit if you train sort of too far on the training data sort of perfectly fit you sort of over fit the training data because now you're trained to function the sort of genome just perfectly goes through all of that you know this double set like we're talking about previously this would cause some of this into question but if you just think about stopping exactly zero training error this is sort of a problem but if you think about instead you know uh sort of not overfitting and sort of training a a sort of model which is is able to fit a lot a bunch of the data but not all of the data but you stop such that you're sort of still have a strong influence on your inductive biases um on like the actual sort of thing that you end up with then you're in a situation where it's not just about engineering ability it's also about like you might end up with a sort of model that is much simpler than the model that actually fits the data perfectly but because it's much simpler it actually has you know better generalization performance um but because it's much simpler and doesn't fit the data it might also have sort of these you know uh you know sort of a very simple proxy uh and that simple proxy is sort of uh you know necessarily exactly the thing that you want an analogy here that sort of is from the paper that i think is sort of useful to think about is imagine a situation you know we talked previously about sort of this the sort of example of sort of humans being sort of trained by evolution and i think this example there's a you know i don't necessarily love this example but it is sort of in some sense the only sort of one we have to go off and so if you think about a situation you know what actually happened with humans and evolution well you know evolution is trying to sort of train humans in some sense to sort of uh you know uh maximize genetic fitness to you know reproduce and to like produce uh uh you know uh continue carry their dna on you know down into future generations um but in fact this is you know not the thing which humans actually care about we care about all these sorts of other things like food and uh and mating and like all these sorts of other proxies basically for the actual thing of you know getting our dna into future generations and i think this makes sense be and it makes sense because well if you imagine like a situation where like uh uh sort of an alternative human that like actually cared about dna or whatever it would it would just be really hard for this human because first of all this human has to do a really complex optimization procedure they have to like first figure out what dna is and understand it they have to have this really complex high-level world model to understand dna and to understand like that they care about this dna being passed into the future generations or something and they have to do this really complex optimization like if i you know if i stub my toe or whatever i know that that's bad because i get a pain response whereas if a like a human in this alternative world stubs their toe they have to like deduce the fact that like stubbing your toe is bad because it like means that you know toes are actually useful for being able to like run around and do stuff and like that's useful for being able to get food and being able to like attract a mate and like therefore because of like having a sort of stub toe like i'm not gonna be able to do those things as effectively and so i should avoid stubbing my tongue but that's a really complex optimization process that like um requires like a lot more so you know requires a lot more compute it requires you know it could potentially you know it's much more complicated because like specifying dna in terms of you know your input data is really hard and so we might expect that you know a similar thing will happen to to neural networks to models where even if you're in a situation where you you could theoretically learn this sort of perfect proxy to like actually care about dna or whatever you'll instead learn simpler easier to compute proxies because those similar easier to compute proxies are actually they're sort of they're better because if they're easier to sort of actually produce actions based on those proxies you don't have to do some complex optimization process based on the real sort of correct thing they're easier to sort of specify in terms of your input data they're easier to sort of compute and figure out what's going on with them they're they don't require as much sort of description length and as much sort of explicit encoding because you don't have to sort of encode this complex world model about like what is dna and how does all this sort of work you just have to code like you know if you get a pain signal don't do that and so in some sense uh i think we should expect sort of similar things to happen we're like even if you're in a situation where you're not sort of uh where we're like you're training on this sort of really rich detailed environment where like you might expect it to be the case that like the only thing which will work is the thing which is actually the correct proxy in fact a lot of incorrect proxies can be better at achieving the thing that you want because they're simpler because they're faster because they're easier and so we might expect even that sort of situation to end up with sort of simpler easier policies yeah and i guess that's interesting because it's sort of well like when i'm thinking about that it seems like there's this trade-off of like if ever like you can imagine a world where evolution or you know the like the anthropomorphic god of evolution or whatever um put like this uh it really did say daniel you know the thing you're gonna care about is like gene alleles like like the allele frequency of uh i but i'm using the wrong terminology but you know how how how much my genes are in the population also also daniel stubbing your toe bad like it doesn't help with that uh like you should pain typically like not good for this um you know but like uh what what you really care about is uh your genes being propagated in that case i really presumably like i'm gonna basically like behave fine and you know when when these things are gonna diverge hopefully i eventually figured out but there's this problem i guess there are two problems firstly like the then it's like a lot more description length because you have to pre-load all this information and secondly there's this issue where like well when i'm kind of dumb uh the the bit of me that's saying like well what you actually care about is um is uh propagating your genes that's not actually influencing my behavior in any way because i'm just like acting on the proxies anyway because i know they're proxies so so like if there's some kind of uh pressure to like well like it seems like that bit is just gonna be like totally unmoored and like might randomly drift or you know maybe get pruned away because it's just like not influencing my behavior yeah i think that's basically exactly what i would say i agree with that cool um on both points all right another thing this question is kind of long and it only eventually gets to being about inner alignment so i'm sorry uh listeners but uh we're gonna get there so there's this argument in the paper which basically says that if you're training a system for a diverse set of tasks you want to expend most of your optimization online like once the task is known you don't want to do like a bunch of your optimization power in like generating this system and then just like you know pre-computing like okay i'm gonna figure out the policy for this type of task and pre-compute like this type of task you just want to do that like optimization once you know the task so that you can only do it for the thing that's actually useful and this this reminded me a lot of eric drexler's comprehensive ai services or case cais model where you just like instead of having an agi you just like there are a bunch of tasks you want to be done and you just have a bunch of like ai's that are like made for each type of task and you know when you have tasks you like select the appropriate ai so i'm wondering like does this argument my first question is does this argument like kind of reinforce or or like um is this an argument for the idea of doing something like kais instead of um you know creating this like single unified agi yeah so i think it gets a little bit more complicated when you're thinking about like how these how things play out like on a sort of strategic landscape so there might be many reasons for why you want to build like one ai system and then like for example fine-tune that ai system on many different tasks as opposed to like independently training many different ai systems because you know if you independently train many different ai systems you have to like expend a bunch of training compute to do each one um whereas if you like train one and then fine tune it you like maybe you have to do like less training compute because you just do it like you a lot of the sort of overlap you get to do only once i think in that sense and i think that sort of does the sort of argument i just made sort of parallel the sort of argument that you make in the paper where it's like well there's a lot of things you only want to do once like the building of the optimization machinery you only want to do once you don't have to figure out how to optimize every single time but the like explicit sort of what task you're doing the locating yourself in the environment locating the specific tasks that you're being asked to do is the sort of thing which you sort of want to do online because it's just really hard to pre-compute that if you're in an environment where you have to have like massive sort of possible divergence and like what tasks might be presented at any different point and so there's a sort of similarity there where like it sort of suggests that like um in a case sort of style world you might also want like a situation where like you know you have like one ai which sort of knows how to do optimization and then you fine tune it for many different sort of possible tasks you can it doesn't necessarily have to be like actual fine tuning you can also think about something like gpt3 uh so of with opening api where people have been able to sort of locate specific tasks just via prompts so you can imagine something like that also being a situation where like you have a bunch of different tasks you train one model which has a bunch of general purpose machinery and then a bunch of different sort of ability to like locate specific tasks um i think any of these sort of possible scenarios um including the sort of you just train many different ais or different sort of very different tasks because you really need fundamentally different machinery are possible and i think risk mode representation basically applies in all these situations i mean like whenever you're training a model like you have to worry that you might be training a like a model which is sort of learning an objective which is different from the lost function you're training under regardless of whether you're sort of training one model and then sort of fine-tuning the bunch of times or training many different models uh or whatever sort of situation you're in yeah well so the reason i bring it up is that if i think about this case scenario where like say you know you um the way you do your optimization online is you know you know you're um you want to be able to deal with like these 50 environments so you train you know maybe maybe like pre-train one system but you fine-tune like all 50 different systems you have different ais and then once you realize you're in one of the scenarios you deploy the appropriate ai say um to me so that seems like a situation where instead of this inner alignment problem where i had this um you know you have this system that's like training to be a mesa optimizer um and you know it needed to like spend a bunch of time doing a lot of computation um to like figure out what worlds it's in and figure out an appropriate plan it seems like in in this case world you just have like um you want to train this ai and and the thing that like the optimization that like in the unified agi world would have been done like inside the thing during inference you get to do um so you can you can create a you know this uh machine learner that um doesn't have that much inference time you can make it like not be an optimizer you know so maybe you don't have to worry about these inner alignment problems and then it seems like what case is doing is it's turning it's it's saying well i'm going to trade your inner alignment problem for an outer alignment problem of you know getting the base objective of this machine learning system to be what you want what you actually want to happen um and also you get to be assisted by like some ais maybe that you like already trained you know potentially to help you with this task so i'm wondering first do you think that's an accurate description and secondly if that's the trade um do you think that's a trade worth making yeah so i guess i don't fully buy that story so i think that um if you think about like a case setup i i don't think so so i guess you could imagine what situation like what you're describing when you're trained a bunch of different ai's and then you have like you know maybe a human that tries to switch between those ai's um i think that first of all i think this runs into like competitiveness issues i think there's a question of like you know how good are humans at figuring out like what the sort of pos what the correct strategy is to deploy in this situation in some sense that might be the sort of thing which we want ai's fields do for us the human could be assisted by an ai yeah you could even be assisted by an ai but then you like also have the problem of is that ai doing the right thing and in many ways i think that in each of these individual situations like i have a bunch of ais that are that are trained in all these different ways and different in different environments and i have maybe another ai which is also trying to assist me in figuring out what i should use each one of these individual ai's has an interlibrary problem associated with it now might that interlining problem be easier because they're trained on more restricted tasks in some sense yes i think that if you train on a restricted task you're less likely to end up with you know searches less valuable because you sort of have this sort of less generality but if at the end of the day you want to be able to solve really general tasks like the generality still has to exist like somewhere and i still expect you're going to have lots of models which are trained in like relatively general environments such that you're going to like you know search is going to be incentivized we're going to have um you know some models which have sort of uh uh you know mace objectives and in some sense it also makes sort of part of the control problem harder because you have all of these sort of potentially different maze objectives uh and different models uh and it might be harder to sort of analyze exactly sort of what's going on it might be easier because they're they're less likely to have missed object uh base objectives uh and or might be easier because they're it's harder for them to sort of coordinate because they don't share the exact same architecture and base objective potentially um so i'm not exactly sure in the extent of which you know this is like um a better setup or worse setup but i do think that it's a situation where like um i don't think that like you're sort of trading the like making the interlining problem go away or turning it into a different problem it still exists i think in basically the same form it's just like you have an interlining problem for each one of the different systems that you're training because that system could itself uh develop you know a sort of process on an objective that is different from the loss function that you're you're actually trying to get to to pursue and i think i guess like um if you think about a situation i think there's like so much generality in the world that like you know maybe we're in a situation where like let's say i train like you know 10 000 models i've reduced the like total generality in the world by like you know uh you know some number of bits whatever you know you know tens of the you know three like three orders of magnitude or whatever right um i don't know what did i say 10 000 four orders whatever some number of orders of magnitude right i is is the extent to which i've reduced the you know the total the total space of generality but the world is just so general that like even if i train you know 10 000 different systems like the amounts of sort of possible space that i need to cover if i want to be able to do all of this you know various different things in the in the world you know to respond to like various different possible situations there's still so much generality that i still think you're basically going to need search like you just haven't reduced the space like the the sort of the like amount of which you reduce the space by cutting it down uh by like having multiple models is just dwarfed by the size of the space in general so you're still going to need to have models which are capable of dealing with large general purpose sort of large sort of generality and being able to respond to many different possible situations now you could also imagine like you know if you're trying to do an alternative situation where like it's case but it's also you're trying to do case like microscope ai or stem ai where you're explicitly not trying to solve all these sorts of general situations maybe things are different in that situation but i think certainly if you want to do something like case like you're describing we have a bunch of different models for different situations and you want to be able to solve like you know ai ceos or something it's just the problem is so general that like you just uh there's still enough generality i think that you're gonna need search you're gonna need like authorization yeah i guess so i think i'm convinced by the idea that there's just like so many situations in the world that like you can't you can't like manually like uh train a thing for each situation um i do want to push back a little bit on the and actually like that thing i'm conceding is just enough for the argument but anyway i do want to push back on the argument that um that you still have this inner alignment problem which is that um like like the thing an inner alignment problem is at least in the context of this paper is it's when you have a mesa optimizer and that mesa optimizer has a massive objective that is not the base objective and it seems like in the case where you have these like not very general systems you just you have the option of training things that um just don't use a bunch of computation online well like don't use a bunch of computation at inference time because they don't have to figure out what task they're in because they're they're already in one and instead they can just like be a controller you know so so these things aren't misoptimizers and therefore they don't have a mesa objective and therefore there is no inner alignment problem now maybe there's like the analog of the interline there's the like compiled interalignment problem but that that's what i was trying to get at yeah so i guess the point that i'm trying to make is that even a situation where like you're like oh it already knows what task it needs to do because we've trained the thing for specific tasks is that if like if you're splitting the entire d of like a really complex problem into only 10 000 tasks or whatever we have a thousand tasks like that each individual task is still going to have so much generality that you're going to i think you'll need you'll need search you'll need based off information you'll need a lot of infrastructure and compute even to solve it at that level of granularity okay the the point that i'm making is that just like i think that the the problem itself requires is is like you know if you're if you're trying to solve really complex general problems sort of generality there that even if you're sort of going down the level of granularity which is like a thousand times reduced there's still enough granularity you'll probably need search you'll probably want you sort you'll still still probably get any softnesses there you're going to need a lot of infrastructure et cetera et cetera okay that makes sense i want to talk a little bit now about deceptive alignment so deceptive alignment is this thing you mentioned in the paper where your mess optimizer has a different goal than um and we've talked about this before but to recap um your mess optimizer has a different mess objective than what the base objective is and and furthermore like your meso optimizer is still being trained and its meso objective compares a like it cares about what happens like after gradient updates so it cares about steering the direction of gradient updates so therefore in situations where the mesa objective and the base objective might come apart if you're deceptively aligned your mess optimizer picks the thing that the base objective wants so that its mess objective doesn't get optimized away so that you know one day it can like live free and no longer have its have to be subject to gradient updates and then you know pursue its mess objective correct me if that's wrong yeah i think that's like a pretty reasonable description of what's going on yeah um so the the paper diverts a fair bit of space to this uh and i'm wondering like well like it mostly devotes space to what happens if you get deceptive alignment and what i really what i'm kind of interested in and i don't really understand is how this occurs during training so at what point do you get an objective that cares about what happens like after episodes you know after gradients or updates after grading updates are applied like how does that end up getting selected for yeah so this is a really good question and it's a complicated one uh and so we do talk about this in the paper um uh and and a good deal i think there's also a lot of other discussion in other places so like one thing just to point out so let's see um mark zoo has this post online forum does sgd produce deceptive alignment which i think is really good starting but also i worked with him a bunch on sort of putting together a lot of the arguments for this in that place i think that's a good that's a good place to look but i'll try to sort of talk talk through some of the ways in which i sort of view this okay so i think that uh we sort of in in the paper we sort of we sort of have this sort of following story um and telling stories about brainy descent is hard and it's uh the sort of thing where like you should expect to be wrong in a lot of cases and so i'm gonna tell a story but like i don't necessarily think that this is precisely the sort of story that you would end up with but i think there's sort of general purpose arguments that there's sort of general arguments you can make for why this might be likely why it might not be likely and i think that like telling stories is helpful for trying to understand like what a setup might be like in which you would get this so here's the story i'm gonna tell so we have a proxy aligned based optimizer i think that basically this is sort of where you start because you know if we're thinking about you know all the sort of arguments i mean previously thinking about things like you know humans you know the you know having a pain response rather than a uh sort of you know caring about dna you're gonna start with some proxy simple fast process you know but maybe you do a bunch of adversarial training maybe you do a bunch of trying to sort of uh you know training on on all of these sort of different situations you really force it to go to zero training area and really understand everything that's happening to have to sort of have a perfect uh version of the base objective what happens to this situation well at some point you know we'll say it has to learn the base objective as we know what the base objective is and be able to you know to like act directly according to that base objective but there's multiple ways in which a model can learn what the base objective is so uh one way in which it can sort of learn what the base objective is is well it has the base objective directly hard coded into it via gradient descent so it there's like uh it has some proxy effective and created sort of tinkers with that objective and keeps sort of like encoding into it sort of more and more detail to like what it uh sort of what the correct thing is that's one mic um we call that in the paper um internalization there's another way which is it config it can learn what the base objective is via its input data so rather than via the creating descent update steps being the thing which sort of encodes the objective into it it could uh you know maybe you're trying to train it to like you know do language modeling predict what a human would do and it just reads much about humans and how they work or maybe you're training to do value learning and try to sort of do you know optimize for human values and it reads you know wikipedia page on ethics and it tries to really understand human values and so it gets a bunch of information about the objective that you're training it on through its input data it gets a bunch of information about like the the thing that you're trying to get it to do just because you you know if you're gt3 style whatever you're training on a huge you know you can imagine training on a huge corpus of data it has a whole bunch of information and a lot of that information is just you know whatever you're training on is probably going to be somewhere there's going to be information about that thing that you're training it on in that input tab and so it can also just figure out what the objective is by looking at that input data but then there's a question which is well so so let's say the the model you know it's doing a bunch of world modeling it's trying to understand the world and so it's it's produced you know uh an internal model that includes uh what the the like correct base objective but it doesn't know that's the correct base objective it's not trying to optimize for that let's just try to optimize proxy it'd be really nice from the perspective of gradient descent from the perspective of the training process if it instead of optimizing for its proxy started authorizing for that thing which is encoded in the world model started optimizing for its like description it's like better understanding of the base objective so there's a question how does grading descent how does the training process modify the model to start authorizing for that thing and code in its world and fundamentally in the paper we sort of analyze two ways um one of which is sort of we call coordinate alignment which is well it could it can its objective could change into a sort of pointer to that thing in the world model and another way is sort of deceptive alignment which is it could realize that as an instrumental goal for authorizing its proxy it should be trying to optimize for this thing because it realizes that that's the base objective that is being trained under and so then there's a question uh and so that and so we end up with sort of three ways in which you could learn the base objective internalization coedibility and deception and then there's the sort of question that we asked which is well how likely are these different things to occur how well do they perform on things like simplicity are they like uh you know are these or how you know how well do these measures perform so so one sort of thing that i'll start with and this is an analogy from uh uh bachelor garris which is sort of a counting argument it was just like well let's just start with the question how many models are there they fall into each one of these categories um and i think the answer is well there's a lot more deceptive models um and so here's an analogy here's the analogy so to think about why that's the case so we can imagine a situation where you have uh this sort of christian god is like trying to select humans according to the extent to which they sort of align with the bible um and you can imagine three sort of different humans that might sort of fall into this category so one of which would be jesus christ who is just like has the exact same objective as god because and in christianity i believe jesus christ is not totally a human uh sure i i agree with that but let's sort of let that aside and imagine that jesus christ is like a possible sort of option here that god is sort of considering between and jesus christ has the exact same objective as god does because you know whatever uh and then you have another sort of example which is let's say uh martin luther uh and martin luther is like i'm gonna look at the bible and figure out exactly what the bible says uh uh i don't uh and like try to do that as like effectively as possible um and then you have uh blaise pascal and blaise pascal is like i i don't care about the bible i don't care about god but uh i know that there is a god and that god is gonna like select me according to whether i care about the bible so i'm gonna like care about the bible and these sorts of three people match on to pretty pretty closely match onto the sort of three sort of scenarios of described so you know jesus christ is internalization martin luther would be sort of cordiability and blaise pascal would be deception and the question is you know how many blaze pascals are there how many martin luther are there and how many jesus christs are there and i think the sort of initial response would be well there's there's only one sort of jesus christ because it has to like explicitly encode the entirety of the objective um there might be more multiple martin luther's because you know pointing in you know there might be sort of some sort of basin of you know cordiality it might be the case that like you know if you get your pointer slightly wrong you can still like correct it by figuring out like what the bible says about what your pointer should be but you still basically have to have exactly the correct pointer to the bible and like believe that the bible is the thing you should be following but there's a lot of blaze pascals because blaze pascal can have any objective uh essentially as long as plays pascal is aware of the fact that they're an optimization process whatever their objective is they're going to try to sort of change it to you know be the one that is according to the bible and so there's this simplest counting argument which is well it just seems like there's a lot more deceptive models than the others and so just as a very sort of naive prior it seems like we should expect deception but we can go further than that and we can try to ask further questions of like simplicity uh and like you know uh how how likely would this actually to be to occur in a creative design process so it seems like so before we get to those arguments so you're right that um the paper does divide a fair bit of space to like okay suppose you have a proxy you're proxy lined and your proxy you know extends across gradient updates and you know then you learn about the base objective and i i i think there's like a decent story in the paper that people can read for like how that happens once you have this proxy that extends past gradient updates and i think my initial question like to me the thing maybe listeners are different but to me the thing that seems most implausible about this story is that gradient descent produces this kind of proxy because like that kind of proxy is not like like when the gradient is applied the the thing evaluating gradient doesn't know whether the objective that extends past when the gradient applies was met or not because like it hasn't happened yet um so so how do we the thing that confuses me the most is how do you ever get a situation where your system cares about what happens after the gradients are applied yes that's a great question and in fact this is actually really close to a lot of the research that i'm currently doing trying to understand precisely the situations in which you do you get sort of these cross-episodic objectives and other special inspirations you can produce sort of myopic objectives instead so i think the way to think about this is it fundamentally it's a generalization question because you're correct that when you're doing training it doesn't understand that you know there's like potentially you know multiple episodes in your reinforcing learning setting so it doesn't know about their multiple episodes and so it learns objective that isn't really sort of in terms of those multiple episodes but let's say it does discover the existence of those multiple episodes the question that we're asking is how does it generalize how does it generalize in terms of caring about those wealth episodes or not caring about those multiple episodes it learns some understanding of an objective some understanding of what it's supposed to be what sort of proxy it's trying to fulfill um and that proxy you know how how once it discovers that these other episodes exist does it think that they does it does it think that it's proxy also exists there does it care about is proxy in those situations also or not and that's sort of this the question we're answering this sort of generalization question and i think that by default we should expect it to generalize such that um you know here's a very simple intuitive argument is um you know if i train a model and it's trying to find all the red things it can't and then suddenly it discovers that there are way more red things in the world than i previously thought because there's also all of these red things in these other episodes i think that by default you should basically expect it's going to try to get those other bad things too that like it would require an additional sort of um distinction in the model sort of understanding of the world to believe that oh only these red things the ones that matter not those red things and that sort of distinction i think is not the sort of distinction that you would be it has gradient sensor has no reason to put that distinction in the model because like you were saying in the training environment where it doesn't understand that there are these multiple episodes whether it has or does not have that distinction is completely irrelevant it only matters when it ends up in a generalization situation where it understands that there are multiple episodes and is trying to figure out whether it cares about them or not and in that situation i expect that this sort of distinction would be lacking that it would uh sort of be like well there are red things there too so i'll just care about those red things also now this is a very sort of handwritten argument and it's certainly like you know i think a really big open question and in fact i think it's one of the places in which i feel like we have the most opportunity potentially to intervene which is sort of why i was talking about one of the ways which is sort of a place where i focus a lot of my research because it seems like if we are trying to produce not produce deceptively land based optimizers then one of the ways we might be able to intervene on this story would be to prevent it from developing uh an objective which cares about multiple episodes and instead ensure that it develops a sort of myopic objective now there's a lot of problems that come into play when you start thinking about that and you know we can go into that uh but i think just sort of on a service level the point is that well you know just by default it's certainly despite the fact that you might have a biopic training set up whether you actually end up with a myopic objective is very unclear and it and it seems like uh you know my opinion objectives might be more complicated because they might require this sort of additional sort of restriction of you know not just like objective everywhere but i'll actually objective only here and that sort of might require additional description length that might make it sort of more complex um but certainly it's not the case that we really understand exactly what will happen and so one of the interesting things that's happening here is it's sort of breaking sort of the assumptions of reinforcement learning where when we sort of uh a lot of times when we do when we do you know reinforcement learning we assume that the sort of episodes are iid that they're sort of independent that you know that the model you know we just totally resample the episodes every time there's no sort of way for it to sort of influence across episodes but this sort of problem starts to occur as soon as you start in entering situations where it does have the ability to influence across episodes such as online learning uh like we mentioned at the start which is you know a common like often i think when you're thinking about how are we gonna get like really powerful ais that work in reality it's like yeah we're gonna do online learning precisely yeah and so if you think about for example um an online learning setup you know maybe you're imagining uh you know something like a recommendation system so it's trying to sort of recommend you uh youtube videos or something one of the things that can happen in this sort of a setup is that well it can try to change the distribution to make its task easier in the future you know if it if it tries to give you videos which will change your views in a particular way such that it's easier to satisfy your views in the future that's a sort of non-myopia that is sort of um could be incentivized just by the fact that you're doing this sort of online learning over many sort of steps and and if you think about something especially what can happen in this sort of situation is let's say i have a sort of or another situation and this can happen is let's say i'm doing i'm just trying to sort of get train a model to um uh sort of satisfy you know humans preferences or whatever it can try to modify the humans preferences to be easier to satisfy and that's a sort of another way of which in which sort of this sort of non-myopia could be manifest where you know it tries to you know take some action in one episode to make humans preferences different so then future episodes they'll be easier to satisfy and so there are a lot of these sorts of non-opiate problems and and not only so i was making an argument earlier that you know it seems like not my opinion might just be simpler there's also a lot of things that we do that actually explicitly incentivize it so you think about something like population-based training population-based training is a technique um where what you do is you basically create many a sort of population of models and then you sort of evolve those models over time you're changing those per their parameters around removing sort of uh sort of changing their hyper parameters and stuff just sort of make it so that they have the sort of best performance over many episodes but as soon as you're doing that as soon as you're trying to select for models which have the best performance over multiple episodes you're selecting from models which do things like change the human preferences such that in the next episode they're easier to satisfy or you know recommendation systems which like try to trick but change you such that you'll you'll be easier to sort of give things to in the future and this is the sort of thing that actually can happen a lot so david krueger and others have a paper that talks about this where they sort of talk about these sort of as hidden incentives uh for sort of distributional shift or sort of auto induced distributional shift where the sort of model can itself sort of you can have setups like population-based training where the model sort of is is directly incentivized to sort of change its own distribution in this sort of non-biopic way and so in sunlight there's sort of like two things going on here there's like it seems like even if you don't do any of this sort of population-based training stuff the sort of non-mapping might just be simpler but also there's a lot of the techniques that we do that because they implicitly assume this sort of iid assumption that's in fact not always the case they sort of implicitly can be incentivizing for non-myopia ways that we might not want okay so so i have a few questions about how i should think about the paper um so the first one is like most of the arguments in this paper is kind of informal right um there's very few mathematical proofs you know you don't have uh you don't have experiments on mnist you know the gold standard and ml uh so how confident are you that the arguments in the paper are correct and separately how confident do you think readers of the paper should be that the arguments in the paper are correct that's a great question yeah so i think it really varies from argument's argument so i think there's there's a lot of things that we say that i feel pretty confident in things like um uh you know i feel pretty confident that you're going that like search is going to be like a component of of models i feel pretty confident that like um you know you will get simple faster proxies some of the sort of more complex stories that we tell i think are sort of less com less sort of you know compelling things that i feel like are somewhat less compelling like you know we have the description of mesa optimizers having this really coherent optimization process really computer objective i think that like i do expect search to the extent which i expect a really coherent objective and a really good optimization process i'm not sure i think that's certainly a sort of limitation of our analysis also you know some of the stories i'm always talking about about deception i think you know i do think that like there's a strong prior that like should intend should like you know be like it seems like deception is likely but we don't understand it you know fully enough we don't have like examples but happening for example and so it's certainly like something where we're sort of more uncertain so i do think it sort of varies i think that there's sort of some arguments that we make that i feel relatively confident in something i feel relatively less confident i i do think i i mean we sort of we make an attempt to really sort of look at you know what is our knowledge of you know machine learning effective biases what is our understanding of the extent to which you know certain things would be incentivized over other things and try to use that knowledge to like come to the best conclusions that we can but of course that's also limited by the extensive of our knowledge you know our understanding of inductive bias is relatively limited we don't fully understand exactly what neural networks are doing internally and so there's there are fundamental limitations to how good or analysis can be and how accurate can be but i do think that in terms of a like well this is our best guess currently i feel like this you know how i would how i would sort of think of the paper okay um and i guess sort of a related question i i think my understanding is like a big motivation to write this paper is like you know just generally being worried about like really dangerous outcomes from ai and thinking about how we can avert them is that fair to say yeah i mean certainly the motivation for the paper is we think that the sort of situation in which you have a you know mesa optimizer that's optimizing for an incorrect objective is like quite dangerous and this is like you know essential motivation there's many reasons why this might be quite dangerous um but based i mean the sort of simple argument as well if you have a thing which is really competently optimizing for something which you did not tell us to optimize for that's not a good position to be in so i guess my question is what um out of all the ways that we could have like really terrible outcomes from ai um what proportion of them do you think do you come from inner alignment failures that's a good question yeah so this is tricky um i i mean any sort of number that i give you isn't necessarily you know going to be super well informed i don't have like you know a really clear sort of um an analysis behind any number that i give you my expectation is that most of the worst problems uh sort of the majority of existential risk from ai comes from inner alignment and mostly comes from deceptive alignment would be my my personal guess and some of the reasons for that are well i think that like there's a question of what sorts of things will we be able to correct for what sorts of things will we not be able to predict for and there's a sense in which um like you know if you think about sort of the the thing we're trying to avoid is these unrecoverable failures situations which our feedback mechanisms are are not capable of responding to what's happening and i think that sort of deception for example is you know particularly bad on this especially if you have a situation where like because the model basically looks like it's it's fine until it realizes that it will be able to overcome your feedback mechanisms in which case it no longer looks fine and that's really bad from respect to being able to correct things by feedback because it means you only see the problem at the point where you can no longer correct it um and so those sorts of things are why sort of lead me into the direction of thinking that like things like perceptual alignment are likely to be a large source of the essential risk um i also think that just like um i don't know i think that but there's certainly a lot of risk also just coming from standard proxy alignment situations as well as like uh relatively straightforward outer alignment problems um i am sort of i don't know relatively more optimistic about outer alignment because i feel like we have better approaches to address our alignment i think that things like um uh paul cristiano's amplification as well as with sort of his learning the prior approach jeffrey irving's debate i think that uh there's a lot of there exists a lot of approaches out there which i think have made significant progress on outer alignment even if they don't fully solve outer alignment whereas i feel like our progress on interlinement is um is sort of less meaningful um and i expect that to sort of continue such that i'm like more more worried about uh interlining problems so i guess this gets to a question i have so one thing i'm kind of thinking about is like okay what do we do about this um but but before i do that in in military theory there's this idea called the udolph uda is short for it's o o d a um it's um observe orient beside an act right and the idea is like well if you're if you're a fighter pilot and you want to know what to do first you have to observe your surroundings you have to orient to decide to think about okay like what are the problems here you know what am i trying to do you have to decide on an action and then you have to act and you know the the idea of the ooda loop is uh you want to complete your ooda loop and you want to not have the person you want to shoot out of the sky complete there you loop so instead of asking you what we're going to do about it and struck jumping straight today first i want to check like where in the ooda loop do you think we are in terms of inner alignment problems yeah so it's a little bit tricky so in some sense like we're there i mean there's first of all there's a view entry aren't even that observed yet because we don't even have like really good like empirical examples of this sort of thing happening uh we like have examples of machine learning failing but we don't have examples of like is there an optimizer in there does it have an objective did it learn something we don't even know and so we aren't even ad observe um but we do have some evidence right so you know the paper looks at lots of different things and tries to understand you know some of what we do understand about machine learning systems and how they work so in some sense we have observed some things um and i i certainly don't think that um and so in that sense i would say the paper is sort of more at orient you know the paper doesn't you know risking demonstration doesn't it doesn't come to a conclusion about this is the like correct way to resolve the problem it's more just like trying to understand the problem based on you know the things that we have observed so far um there are other things that i've written that are more sort of at the like the side stage um but are still i don't know still mostly at the orient stage i mean so like stuff like my post on relaxed adversarial training for inner alignment is closer to the decide stage because it's investigating a particular approach but it's still more just like let's orient and try to understand this approach because it's like there's still so many open problems to be solved so i think we're still relatively early in the process for sure do you think that it's a problem that we're doing all this orientation before the observation which is like i guess the opposite of what if you take you to loop seriously you know it's like the opposite order that you're supposed to do it in sure i mean it would be awesome if we had better observations but in some sense we sort of have to work with what we've got right we want to solve the problem and we want to try to make progress on the problem now because that will make it easier to solve in the future um we have some things that we know we have some observations we're not totally naive um and so we can start to make understanding we can start to like like you know produce theoretical models try to understand what's going on before we get you know uh concrete observations and for some of these things we might never get concrete observations so until you know potentially it's too late you know if you think about deceptive alignment you know there are scenarios where we don't see you know if a deceptive model is trying to is good with being deceptive then it we won't learn about the deception until the point at which we deployed it everywhere and it's able to sort of you know uh break free of whatever sort of uh you know mechanisms we have or sort of doing feedback and so um if if we're in a situation like that you have to solve the problem before you get the observation uh if you want to not die into the problem and so we're it's it's sort of one of the things that makes it a difficult problem is that we don't get to sort of just rely on these sort of more natural feedback groups of like when we look at the problem that's happening we try to come up with some solution and we deploy that solution because we might not get great evidence we do not get good observations of what's happening before we have to solve it and that's sort of one of the things that makes the problem so scary i think from my perspective is that we don't get to rely on a lot of our sort of more standard approaches okay so on that note um if listeners are interested in following your work um you know seeing what you what you get up to what you're doing in future um how should they do that yeah so my work is uh public and it's on basically all of it is on the alignment form so you can find me i am ev hub evh on the alignment forum uh so if you go if you go there you should be able to just like google that and find me uh or just google my name uh and you can sort of go through all the sort of different posts and writing that i have some good starting points um so uh once you've and if you've already read versatile and optimization other sort of good places to sort of look at some of the other work that i've done uh might be an overview of 11 proposals for building safe advanced ai which i mentioned previously and goes into a lot of different approaches for like trying to address the like overall as safety problem and includes an analysis of all of those approaches on outer alignment and inner alignment and so trying to sort of understand how would they address you know base optimization problems and all this stuff so i think that's a good place to sort of go into them to try to understand my work there's also the another post might be good would be the rack's adversarial training for interlining post which i mentioned sort of tries to address the sort of very specific problem of you know black adversarial training it's also worth mentioning that you know if you like risk monetization i'm only one of multiple co-authors on on the paper and so you might also be interested in some of the work of um you know some of the other co-authors so i mentioned that uh you know uh vlad vladimir has you know a post on strategy robustness which i think is really relevant uh worth taking a look at i also mentioned some of um uh ur's recent stuff on trying to your skulls on trying to understand um some of what's going on with um sort of uh in touch of biases and sort of the prior of neural networks and so there's lots of lots of good stuff there um to take a look at from all of us okay well thanks for appearing on the podcast and the listeners i hope you join us again this episode was edited by finnon adamson
Related conversations
AXRP
7 Aug 2025
Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -5 · 133 segs
AXRP
1 Dec 2024
Evan Hubinger on Model Organisms of Misalignment

This conversation examines technical alignment through Evan Hubinger on Model Organisms of Misalignment, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med -6 · avg -7 · 120 segs
AXRP
11 Apr 2024
AI Control with Buck Shlegeris and Ryan Greenblatt

This conversation examines technical alignment through AI Control with Buck Shlegeris and Ryan Greenblatt, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med -6 · avg -9 · 174 segs
Future of Life Institute Podcast
7 Jan 2026
How to Avoid Two AI Catastrophes: Domination and Chaos (with Nora Ammann)

This conversation examines core safety through How to Avoid Two AI Catastrophes: Domination and Chaos (with Nora Ammann), surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -3 · 85 segs
Counterbalance on this topic
Ranked with the mirror rule in the methodology: picks sit closer to the opposite side of your score on the same axis (lens alignment preferred). Each card plots you and the pick together.
Mirror pick 1
TED Talks
18 Dec 2023