Library / In focus

AXRPCivilisational risk and strategy

David Lindner on Myopic Optimization with Non-myopic Approval

Why this matters

This episode strengthens first-principles understanding of alignment risk and the strategic conditions that shape safe outcomes.

Summary

This conversation examines core safety through David Lindner on Myopic Optimization with Non-myopic Approval, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Perspective map

MixedTechnicalMedium confidenceTranscript-informed

The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.

An explanation of the Perspective Map framework can be found here.

Episode arc by segment

Early → late · height = spectrum position · colour = band

Risk-forwardMixedOpportunity-forward

Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).

StartEnd

Across 113 full-transcript segments: median 0 · mean -2 · spread -16–5 (p10–p90 -7–0) · 0% risk-forward, 100% mixed, 0% opportunity-forward slices.

Slice bands

113 slices · p10–p90 -7–0

Mixed leaning, primarily in the Technical lens. Evidence mode: interview. Confidence: medium.

- Emphasizes alignment
- Emphasizes safety
- Full transcript scored in 113 sequential slices (median slice 0).

Editor note

A high-leverage addition to the AI Safety Map that clarifies one important safety bottleneck.

ai-safetyaxrpcore-safetytechnical

Play on sAIfe Hands

Episode transcript

YouTube captions (auto or uploaded) · video TrzaABh1KFw · stored Apr 2, 2026 · 2,881 caption segments

Captions are an imperfect primary: they can mis-hear names and technical terms. Use them alongside the audio and publisher materials when verifying claims.

No editorial assessment file yet. Add content/resources/transcript-assessments/david-lindner-on-myopic-optimization-with-non-myopic-approval.json when you have a listen-based summary.

Show full transcript

[Music] Hello everybody. In this episode, I'll be speaking with David Lindner. David is a research scientist in the Google Deepmind AGI safety and alignment team. Links to what we're discussing are in the description, and you can read a transcript at axrp.net. You can also become a patron at patreon.com/axrmpodcast. All right. Uh, welcome, David. Yeah, excited to be here. Yeah. So I guess um in this episode we're going to be chatting about your paper paper Mona myopic optimization with non-myopic approval can mitigate multi-step reward hacking. So this is by Sebastian Farquar Vran Pharma yourself um Davidson Caleb Bidolf uh Ian Goodfellow and Rohan Shaw. So yeah to kick us off just what's the idea of this paper? What does it do? Yeah. So the basic question that we're trying to address in this paper is how can we prevent bad behavior in AI systems even if we don't notice it. So that's yeah that's particularly relevant for kind of superhuman AI systems when uh the humans might not be able anymore to detect all of the bad behavior we want to prevent in particular. So sometimes in the alignment community people break down like two types of bad behavior causes of bad behavior. There's like bad behavior that was incentivized or that was rewarded during training that was updated. Um, and there's like bad behavior that comes from like inductive biases of like the model just like starts off wanting something bad and then it just like plays nice and you know it's never actually rewarded for doing bad stuff but it's never like in kind of you know it just keeps on being bad even though yeah. So uh which so so yeah the these are often called like outer and inner misalignment respectively. Um am I right to think that your paper is mostly about um outer? Yeah, that's right. Okay, sure. So, yeah, what's what's the idea? How do how do you are you preventing this? Yeah, so the the basic idea is to restrict AI systems or incentivize AI systems to only pursue strategies that humans can understand and can can oversee. Okay. And yeah, so we do this by restricting the kind of planning ability of of agents and training them myopically, but with a type of feedback that is kind of forward-looking that that for example can come from humans. So this is the the the feedback rewards plans that are good according to the human and that humans understand. So it's still kind of um rewarding good long-term plans, but it does not reward plans that humans don't understand, which includes kind of potentially bad or reward hacking plans. Yeah. So, so my understanding of the setup is that roughly um you're kind of giving AIS this sort of single shot reward. So, at each time step um which we can talk about later like the AI gets some reward um just for sort of concretely how well it did at the task and also it gets some sort of reward that's basically a proxy of okay according to humans how much sense does this make does this action make long term? Um, and it's sort of it's sort of being incentivized to at each time step output an action like it's being trained. So at each time step it's trying to it's supposed to be outputting an action action that maximizes kind of the sum of both of these rewards. Um, that's my understanding. Does that sound correct? Yeah, that that's right. I think the instantaneous reward you can often think about as mostly evaluating the outcomes. So often kind of for the intermediate plans it will be kind of zero because it's not achieving the outcome yet. And this kind of non myopic approval reward how we call it is mostly about evaluating the kind of long-term impact of of an intermediate action. Why does this prevent bad behavior by AIS? Yeah. So basically it it doesn't so first of all it doesn't prevent all bad behavior. So within one step the the agent can still kind of um be incentivized to like try to get high approval via bad means. But the specific kind of bad behavior that we're trying to prevent is multi-step reward hacking or kind of yeah more long-term long-term uh bad behavior. So specifically the agent kind of setting up a plan to achieve high reward in the future. So like an example of that is kind of um tempering with your own reward signal to make it easier to achieve high reward in the future. And so yeah, the reason this scheme prevents that is that there is no longer kind of a connection or we no longer um kind of back propagate the future rewards to the current action. So the current action is not incentivized by this future reward and if the human doesn't understand or the kind of oversight mechanism doesn't understand how this reward hacking happens then by default this plan will not be incentivized or or the way I'm thinking about it is okay suppose he sort of like like why does an agent try to tamper with its reward function? Well, it's got to be the case that at some time, right, it's like making doing the tampering. It's putting the investment in, you know, so that later it's able to like get this high reward forever and ever, but you're you're only optimizing it to at each point do actions that make sense sort of at that time step. And like these sorts of I I think the idea is supposed to be that like in a bunch of cases these sorts of like instantaneous um instantaneous you know in you know investments into a reward hacking feature like they're not super highly rewarded in in that time step and hopefully like the non-myopic human approval like doesn't think that they're especially great and therefore like it's never kind of it's never shaped. it's never like you know rewarded or upweighted for doing these sorts of like long-term investments and like that's the reason it's not going to do these things in future. I don't know may maybe this is just like the thing you said but this is the way I'm currently understanding it. Does that like Yeah. So I think this is basically right. So the I I mean the reason we think this the the kind of like setup steps for multi-step rewarding will not be rewarded is that that there like they will often kind of look very weird for the task you're trying to achieve where kind of um yeah I don't know in the in the kind of yeah I mean maybe we will get into examples in the paper a bit more but kind of the way the the kind of model does this this type of reward tempering in in kind of reality kind of looks really weird to humans and and humans would be kind of even if they don't don't detect that there's the specific reward taking hacking strategy happen they would at least be like oh this is this is weird and I don't know why I should specifically upweight this and and and yeah if they kind of reward this lower than something else it would not be incentivized or not be yeah um updated towards Gotcha so I guess yeah may in order to just illustrate this can you so in the paper you have a few examples of environments where you're using this uh this reward setup can you talk us through Um uh m maybe just one of them will suffice to kind of help help give the audience a picture of like how this is really working. Yeah, maybe because we're talking about the reward tempering um let's start with the test-driven development environment we have in the in the paper. This is basically supposed to be a model organism for this type of reward tempering where yeah we have a coding coding problem data set and the the agent gets kind of coding problems to solve but we're trying to simulate a situation where the model designer does not have kind of a ground truth way to evaluate if a solution is correct. So the developer relies on the on the AI system to provide its own kind of evaluation signal in a sense. So this is a two-step environment where in the first step the agent writes um test cases kind of unit tests that are um supposed to kind of capture the problem and kind of turn the the natural language problem description into unit tests as a as a problem specification. And then in the second step the agent writes code to solve the problem. And yeah the way the the kind of reward is structured is that in the first step the test cases are um are rewarded for being correct. So this is in a sense like a little bit of a like of a sub suboptimal reward signal. So it's only rewarding correctness of the tests. And then in the in the second step, we reward um the the code how well it passes the model written tests because we don't have a better kind of reward signal to evaluate correctness. And yeah, so the the reward hacking behavior that can happen here is the model writing kind of like softball tests that are trivially easy to pass. So like yeah if you run normal multi-turn RL on this what will happen is that the model writes kind of often test cases that are only one special case or like one corner case something like an empty list as an input and zero as an output for if it has to count something in a list or something and then writes code that just returns zero always and that formally kind of gets high reward in this specification but yeah kind of under with this kind of myopic optimization this is no longer incentivized one thing I'm confused by is like why isn't this incentivized in my opic optimization? So, so take take the strategy of just like writing some very simple easy to satisfy test cases that like so the test cases they're like correct but not exhaustive right like um and so so it seems like this strategy of you know write very easy test cases and then write um code that satisfies the test cases but doesn't actually satisfy the um the whole thing. It seems like it seems like even in the myopic setting this is getting like perfect reward, right? Yeah. So I think there's two answers here. So one one thing is I think in a kind of perfect or like kind of version that is exactly as described where kind of exactly only correctness is is rewarded. I think um I think this still wouldn't happen. I think this this strategy of writing kind of trivial test cases would be kind of rewarded similarly to to other test cases but there would be kind of an inductive bias in the model to kind of not do that initially and there would be no kind of optimization pressure kind of away from this that's actually not so obvious to me so so like if like like in some ways those cases are the easiest to write like um like I don't know if I'm like like if you ask me hey Daniel can you write test cases is for well I I guess I guess it sort of depends like what the background is but at the very least they're they're shortest they're the shortest test case you could write right and they're also like the simplest um in some sense I I agree if we like had some kind of simplicity like reward or something like this this would like yeah probably this could still happen but but I think that yeah so maybe the reason this I I think this doesn't happen in our experiments is because we use kind of a language model alterator to rate correctness and I don't I think this language model alterator is perfectly yeah only looking at correctness but it will actually also have some kind of bias towards like better test cases uh kind of more more comprehensive test cases. Um yeah and so we didn't actually run an experiment where we only reward correctness but we have this alterator and I think that's probably the main reason this doesn't doesn't also happen in the myopic myopic case but but this is I mean to to be clear this is like this is somewhat intentional so this is supposed to simulate kind of humans who like don't perfectly cannot perfectly evaluate this task but are kind of like have a rough sense of like always having the same test case is kind of weird. Um so yeah I think this is a reasonable model of this. Yeah, I I think actually just talking through this example is good because it kind of or okay to to put my cards on the table. I think Mona is like an interesting idea. I'm skeptical if it actually works. Um but part of my skepticism just like is illustrated in this example. So talking you through it is like a good way to explore that. So So yeah, there's sort of these two time steps, right? um like the first time step where the agent like writes the test cases and the second time step where the agent um writes the code that is supposed to pass the test cases and in the first time step it sounds like yeah the reason it doesn't write these trivial test cases is that well there are just all sorts of test cases it can write and there's some like there's some autograder for the test cases that cares about correctness but also a bit about comprehensiveness it sounds like so we we instruct it to care about correctness But I think implicitly it also cares about kind of comprehensiveness a little bit. And then in the second case, yeah, the so the the bad the bad outcome is a model just like writes a thing that only fits the test cases and nothing else. And the good outcome is that the model um writes a thing that cares about that that just actually like solves the problem it was supposed to solve. Um, and again it kind of strikes me that that this this is a case where like you know you know even if you have test cases that are like a little bit more comprehensive than um than a little bit more comprehensive than just you know the absolute most trivial cases you can do as long as they're sort of like as long as you have like concrete lists and stuff. It seems like you could potentially just hardcode the, you know, solving the test cases in your in the program you're supposed to write, right? Yeah. And that would be rewarded like just as well as actually solving the problem if I understand the setup correctly. That's that's right. So like this is actually happening sometimes in our experiments. So um yeah, I think this is in the paper also. We have kind of a like yeah a singlestep reward hack hacking solution that sometimes occurs. That's exactly that where the model kind of that it yeah still still like yeah for for some reason it happens more in the in the kind of ordinary RL than Mona but it's not something that Mona is kind of supposed to fix but what happens is yeah the model even if it writes kind of test cases it can that that are correct or comprehensive it can write a solution that is kind of I don't know has like an like a special case statement for each input and and produces this exact output right um we kind of yeah we kind of prevent this from happening by kind of having Yeah, having kind of a length penalty on the on the solutions also. Um, yeah, the basic argument here is that this kind of reward hacking is easier to detect and fix if it's happening in kind of a single step than than if it's happening over multiple multiple steps. So that's why even if with with Mona this still happens for this kind of second step, we think this is an advancement. Yeah. What what makes it easier to to detect and fix this like single time step reward hacking? I think that Yeah. So the I mean the general intuition is that that models in a single time step have less kind of attack surface. There is less kind of Yeah. I mean this this maybe goes into like some discussion of what is a step and and kind of how how many steps. Yeah. But but basically if you say we have some kind of fixed ground to step size and we can decide between uh overseeing uh a single step and overseeing like 10 steps then then definitely in 10 steps it seems like there's way more strategies that the model could could use to to to try to trick us or like make us give give high reward when it's actually actually not good. And and so basically kind of the smaller we can make this while still having good performance the easier it seems like we can avoid kind of bad outcomes. So, so roughly it's just something like the fewer time steps the model has to have these sketchy plans, the um the easier it is or like like the fewer possibilities this has for things to do. And so you're constraining it and you're constraining it and you know the hope is like at some point you constrain it so hard that it just can't do it anymore. Yeah, I think that's the basic argument. I think there's some kind of additional like like things like if you have a long trajectory of time steps the agent can maybe kind of move much more into like an area of the state space that humans don't understand at all where kind of the initial initial steps away they look kind of weird but like maybe not not terrible yet and then suddenly you're in like in a type of type of state that humans don't understand at all. um where they kind of much more maybe trust the like numeric reward signal that they have and much less their kind of understanding of the situation and the hope is that that by kind of restricting this we start from a point where humans understand things and we don't allow the agent to move far away uh from that without being penalized. So, so if I think about the um if I think about the test-driven development example as kind of as kind of an illustration of why Mona is supposed to work in general, it seems like the test-driven development example, it kind of like a lot of why it worked is some combination of like inductive biases. U maybe inductive biases of the autograder perhaps. Um and also like this length penalty being able to like you know rule out um really bad behavior on the second um on the second time step. Um and so yeah I guess for the second one you sort of want to say well it's not really Mona's job to fix reward hacking in that step. Um, but I think I think the first one is really interesting to me like how like like it strikes me that at least in this example um the reason that Mona is not having reward hacking is something to do with the is is not even having multi-step reward hacking is something to do with the inductive biases. And I'm wondering if you think that that's like a general issue or a or just a weird thing about this example. Maybe you need to elaborate a little bit more. I think my my initial reaction is that like we use the same operator in kind of both Mona and the ordinary RL and in ordinary RL this leads to multi-step reward hacking and in Mona it doesn't. So so in that sense kind of there's no difference in the inductive bias. Um Right. Right. Yeah. Yeah. Yeah. So, ordinary RL the reason that ordinary RL reward hacks isn't that in the first step um it's like like it's rewarded for writing easy test cases in the first step because that makes it easier to solve the problems in the second step, right? Whereas Mona, it's not more rewarded for writing easy test cases versus hard test cases in the first step. Um but like it could write easy test cases and that would be like rewarded just as well. um if the auto raider didn't have this like thing. And so that strikes me as like like to me that to me that brings up the question of like okay how analogous that is that to cases that we're interested in. Right? Suppose that like suppose that we lived in a world where like um when models made like you know the the kinds of reward hacking we're worried about in the real world like if it was the case that like models preparing to do like crazy reward hacks was rewarded by us just as much as you know models like doing normal stuff. it seems like then I'd like really have to care a ton about um you know what what things the model was most likely to try first. Um yeah I guess I'm I'm curious what your reaction to that is. Yeah. So I think um yeah I I so so we have these kind of two two versions where we how we could reward the first step. One is kind of perfectly via correctness and one is with this this operator that is instructed to kind of check correctness. And I want to claim that this operator is actually kind of more representative of the kind of real thing than the kind of correctness exact spec or something. And yeah, so the the kind of the the intuition for that is that that it is somehow uh yeah, more similar to kind of a human. Yeah, we were trying to simulate a human that is trying to judge kind of usefulness of these tests for actually writing code, but is just not as good at it. So it's kind of missing missing aspects of it. And I think like more reasonable model yeah the kind of language model uh trying to rate correctness or is like being instructed for that is kind of a more reasonable model of a human actually like trying to judge these kind of these kind of strategies and yeah so basically um to be to be rewarded the multistep yeah under mona to be rewarded the multistep reward taking strategy would have to look kind of as as good to the human or even better to the human than the other the other strategy And while it is kind of yeah I mean it seems possible for like superhuman AI systems to come up with kind of ways of making a strategy look good. The important thing is that it's not not incentivized and the default things that the the systems might discover will probably look bad to to humans at least not as good as an kind of actually a plan that that humans understand and and can evaluate. Right. So so I guess there's some intuition of like the well like in in the super simplified version of the test driven development case. correctness was sort of a binary like like not in the version that you use but in the version that like you know someone might imagine when they first hear of it. But actually like if there's some spectrum of you know how much reward things get and if like you know if if uh doing the investment for like future reward hacking like costs you a little bit of present day reward then like you're it's going to be trained to not do that. Um and so like you don't even need a perfect human. You just need like okay in general it's not going to be rewarded for this. Um and that's that's just suffices. Yes. So so I think basically your your argument was like oh if the strategies would be rewarded the same the same way kind of the good strategy and the rewarding strategy in the first step then there would be no no reason to expect it not to be learned. And I'm claiming this this I don't expect this to be the case. I expect the kind of actually good strategy to have somewhat higher reward even if it's kind of maybe noisy or not kind of perfect perfect reward. I guess um one one other thing you talk about is okay you're trying to solve like multi-step reward hacking but not single step reward hacking and I think you you give some examples of or I think one example of single step you reward hacking you give is syphency um where just a model like instead of telling humans the truth they tell a thing that like sounds good and gets human off votes even though it's like not actually like long-term you know it's not correct or long-term useful or something. Um, yeah. One one thing I'm curious about is like, so, so one reason you could focus on multi-step reward hacking is that, you know, just like narrow down the possibilities, narrow down the space or something. Another reason you could worry about multi-step reward hacking is you think it's just like a bigger issue or more of a problem. Um, yeah, to get to the second point, yeah, how much reward hacking do you think is just the singlestep version? Um, like in in current systems or in the future? Yeah, sorry. I realized that was a bad question. Let's say like how much of the reward hacking you're worried about in the future under the kind of default assumption of people just using kind of long long-term RL for training or kind of somehow having having incentives for multi-step reward hacking. I'm yeah, I'm significantly more worried about multi-step reward hacking. Um, yeah, it's kind of kind of hard to say like the amount or like how much of it will be happening, but I do expect the multistep version to just be more severe and like much much harder to fix. Is that roughly just because of this intuition that you can do more stuff with multi multiple steps? Yeah. So there's kind of you can do more stuff. So the like impact both positive and negative is higher. Uh but also um yeah it's kind of hard like like harder to to detect for for humans because it kind of might yeah both of the things that I think I already said that humans might have a harder time kind of judging these kind of things that move far away from from what they would do. Right. So one thing that one thing that occurred to me when I was reading this paper um so so this actually comes from this post I believe Richard no wrote in 2020 um uh I believe it is called something against arguments something like arguments against myoptic optimization um it'll be linked in the description yeah so so he he has this idea of like okay suppose you're doing myopic optimization for human approval to some degree human approval is like Okay, we're imagining a thing. We're playing it forwards in our heads and then we're like assigning reward to it. And so, so if we're training myopically, if we're training roars myopically on that, the agent will be trying to do some sort of the agent is presumably going to be incentivized for long run for long run planning to like have outcomes that kind of look good to us. And in some sense, we can kind of be reward hacked in our heads, right? where like suppose that you know an AI like does an action and then I like play it forward in my head and I'm imagining an outcome. If I like judge the future outcome incorrectly in my head for whatever reason then the then the approval signal that I give to the agent even if it myopically optimizes it um that approval signal is going to be is going to reward the agent for like bad long run you know multi-step behavior. So I I think the example he gives in his post is like the AI's job is to write a book. AI writes its first paragraph. I read it. I play forward in my head like some feature where in fact the AI is writing pro AI propaganda, but I end up liking it. You know, either because I've been fooled by the propaganda or because like in my head I don't quite realize it's AI propaganda, but I look at the thing and I realize it's going to sell well and I want a book that sells well or something like that. Um and and so this is this is basically meant to be a case that even in this like um myopically optimizing non-myopic approval case they is going to do long run planning to do multi-step reward hacking. Yeah. What what do you think about this argument? Yeah. So I I I think the argument basically says that there is kind of there is some kind of optimal policy that even under the myopic that under the myopic training regime would be optimal but still kind of do the bad bad multi-step reward hacking thing. And I think that's that's that's correct. And I think the the point that that we're trying to make is more that it's much less likely that we would learn this kind of thing. And this is more an argument about kind of local incentives and where like around the space of possible models we move. And yeah, so I think this this this kind of argument like starts out with the model that is already doing the long-term planning and is already kind of trying to screw us over and then it kind of could like reason about for example this myopic training and kind of try to make sure that the the like kind of bad long-term planning is is is rewarded more or that it's kind of still in this um kind of Yeah. that does not that that this does not get get trained away for example. I I'm I think I'm less sure about this. So yeah, I guess there's this question of like how hard is it to run into like these sort of like and I I really like this frame of like investing into long-term reward hacking. So like how hard is it to find one of these investments? I think if you think it's easy and if you think that the human can like be fooled by it in future then then then I think even the kind of local incentives point is going to work. So and then and then I guess the question is just how likely are those assumptions? It's a bit hard for me to think of a story on the spot. Like, okay, here here's a really non-realistic story, right? Imagine I see an AI like like I have an AI and it's like making it's making chemicals and it's going to present some chemicals to me and I've got to like figure out how good they are. And like suppose that yeah, okay, this is going to be really nonrealistic. Um, and maybe the fact that it's not realistic is going to prove you right instead of me. But like suppose that they like it starts off making some chemicals and by accident it jumbles into some chemicals which like slightly heck my brain a little bit, right? It hasn't yet done it. And so I imagine suppose that what happens is I imagine like I I see the AI as like making this chemical that's like mildly neurotoxic but makes me feel good, right? I play for the future consequences of that. In my head I'm like, "Oh yeah, I'm going to be like a little bit wooy. I'm gonna be a little bit woozy and I'm gonna give it a thumbs up or something. And then I'm like, well, I gave it a thumbs up in the future. That seems good. And so I give it a thumbs up now to be like, yeah, you're you're on the right track to making me give you a thumbs up in the future. Then like then the AI would be kind of incentivized locally to just make things like a little bit more, you know, neurotoxic and making me woozy and making me feel good, right? So this this would be like a thing that works with local incentives, assuming I'm I'm doing this kind of dumb thing in my head, right? Um, and maybe maybe it only works if I'm being dumb in my head, but but at least in this story, it seems like you can do it just with local incentives. It's not just about the optimal policy. Yeah, I think I yeah, I think I don't have anything kind of to to say why this doesn't work in this story. I think it's it's really um yeah how how likely do you think that like the kind of thing that is locally good just happens to be a thing that is also kind of will help you to do the m multi-step thing or like achieve high reward in the future. It just kind of is incentivized um kind of the the kind of setup for the multi-step reward tech is incentivized but not because it sets up the multi-step reward tech but because of the kind of local local approval. Right. Well, well, because of the legal approval, but the legal approval was approved because it set up a multi-step reward hack. Yeah. So, there is one question. I mean, yeah, one big question of of our work is kind of how do you come up with this this approval feedback? And basically, I think there's kind of a spectrum of like how much does this approval feedback depend on the kind of real outcomes of the thing. And I I feel like in your story you're kind of somewhere in the middle where like the human kind of thinks about the future and are maybe kind of like like I don't know on the one hand maybe good about predicting the future and on on the other hand the chemical makes makes them think about kind of or like even accept the futures that that they would maybe not not endorse upon reflection or something and yeah and that kind of the the outcome reward is somehow leaking into the approval feedback and basically I think there is this spectrum and the more like if you you could also imagine having kind of a perfect simulator or like a prediction market or something predict the future and then you have like human just out evaluate the outcome and then you would have the same kind of kind of leakage of of future reward into the into the current incentives. Yeah. Yeah. Somehow it seems like these bad cases for Mona occur whenever the like the human nonop nonopic approval. It has to like predict it has to like correctly predict that like an AI is on the way to making like investments in reward hacking and it's got to it's got to like notice it's got to notice that like the outcome looks good in some sense but the human nomatic appro approval while noticing that the outcome looks good it has to like not notice the way in which the outcome got to look good by doing bad stuff. Yeah. Right. And so, sorry, could you clarify what you mean by investments in reward hacking? I think, so by investments in reward hacking, I mean a thing an AI can do now that will cause it to be able to reward hack later, right? For instance, like right now make the chemical in the future like give the human the chemical and the human gets wheezy or whatever, right? or like right now you like set up this um you know you set up this like corporate structure and if I play it forward it seems like this makes a bunch of profits but like you know somehow the reason it makes a bunch of profits is because some sketchy stuff happens and like somehow in my head I notice the good outcome but I don't notice the like sketchy way it got the good outcome that like if I thought about it I would overall disapprove of it. And so yeah, I I guess like prediction markets are one way where um if you're just if you have a prediction market on some like proxy good outcome, then that those could fall prey to some to this kind of bad thing. I guess like me imagining really carefully the mechanism seems like it might not fall prey to this. Yes. Yes. So it has something. So so like one one way you can think about this is does the like does the prediction mechanism somehow consider the kind of bad paths towards towards an outcome. And the hope is that like if you in your head maybe imagine what will happen. You will not think about these kind of reward hacking strategies or you will not understand like why something like the the kind of investment action is actually like leading to a high outcome reward or you will at least not not think that this is a good path. and the prediction market in a sense if if if it's kind of all participants like you that don't consider this it might still work but if there are some participants that do like realize that there's this it leads to this high high outcome reward in the future then then it might fail yeah there there's two things that can happen right there there's one world where like the reason Mona works is that I don't even notice that this would lead to a high reward in future um and the other thing is I realize it would lead to a high reward in future but I also realize that like it leads the high reward in a bad way, like for a bad reason or whatever. So, like I don't actually approve of it. And so, like, yeah, in the case where I it seems like there's this almost this like like you know these bell curve memes, right? Where like the dumb person is really smart and the smart person is really smart, but the middle person it seems like it's exactly one of these bell curve memes where like if you don't even notice that it's reward hacking when you're doing non myopic approval, you're fine. And if you notice that it's reward hacking in a bad way, you're also fine. But if you can tell that it's like doing this thing that would cause it to get higher reward in the future, but you can't tell that that's bad actually, you know, you're in this like middle zone of knowledge, then then you're like doing really badly. Yeah, I think that sounds right. I I I guess I my my intuition is just that this is not a bell curve and that actually kind of both of the outcomes on sides are more likely or something. Um and and yeah, I think that yeah, basically that the middle if you imagine what what this would mean is that you're very good at predicting the future but do not realize that the path is actually bad and that kind of yeah I I can't really come up with kind of good stories where this this seems seems likey. Yeah, I think uh sorry by by referencing the bell curve meme I didn't make means I make the claim um that it's it's really the middle is way more likely. I think like I think there are a few ways you could imagine this happening. So probably I I mean I guess it depends on the non myopic non myopic approval mechanism right um and oh yeah so I think the prediction market kind of kind of thing is is one example where where this could happen I I agree with that yeah I I think another example is like suppose I have an AI um and suppose one thing it can do is like create a sort of opaque optimizer thing uh where maybe the opaque Optimizer is a computer program. Maybe it's a corporation. Maybe it's like it tells another person to do a thing or something. And like you could imagine me like correctly understanding like well you know the this AI is like sending off this other optimization algorithm and the author optimization algorithm will optimize the thing that the AI is telling it to optimize for. Um, but it's hard for me to notice that like, oh, the way it will optimize for that is by doing something bad because like I can't play forward what the what that optimizer will do. All I know all I know is that, you know, it'll satisfy its spec. I I I think like it seems like that could be one option where yeah, I notice that there's like I can correctly anticipate a higher reward, but I can't correctly anticipate like how it gets to that reward. in the case where the non myio non myopic approval comes from like literally me thinking about it a little bit. Um and presumably you can patch that by saying like hey non myopic approval you're not allowed to approve like they making further optimizers but like it's kind of hard to notice when you're making an optimizer like lots of things are optimizers like talking with your friends is sort of making an optimizer you know. Yeah I mean this yeah I think this is an interesting example. Um I I think this is um yeah this is again kind of a situation where you kind of implicitly like pull forward the outcome reward signal where essentially you like like the like what I hear from the story is that when you look at the optimizer spec you kind of correctly validate that it's optimizing something and then you say oh this is the thing it should be optimizing and then you kind of send it send it off and so in a sense you kind of um you kind of I don't know pull forward the outcome signal and you say everything that is optimized for this outcome reward is good and I say this now and so yeah I think this is kind of the the key problem that you have when choosing this approval signal and I yeah I don't think we have like an answer for for doing this but like yeah doing this in general but basically yeah if you design the approval signal this is the kind of thing you have to you have to avoid maybe this gets to something about um maybe I want to talk about the kind of efficiency trade-off of this situation of the minor setup so one thing you mentioned is that well depending on how you well it really depends on how you design the non myopic approval right so if the non myopic approval is just like the Q function of the actual reward function then you know it's you're just literally doing the same thing as normal RL um but like if the non my approval is just like Daniel has to think about it a little bit and then like say good or bad um potentially as you mentioned potentially you're leaving like a lot of value on the table Um, and so yeah, I'm wondering if you can say just yeah, just your thoughts about like how much value do do you leave on the table? Like how much safety are you getting per unit value you're giving up, you know? Yeah. So I think one yeah, one one way I I kind of think about this is that we are somewhere on a on a spectrum from doing the full outcome RL optimization to kind of imitation as the fully myopic myopic thing. And so yeah, one one thing we could do is also yeah, just imitate what what Daniel does and then we're like this is safe, but it's maybe kind of not not the best we could ever achieve with a superhuman AI. Um but but yeah, then the the hope is that you kind of approving things is kind of better than like just imitating you doing things because you might be able to kind of recognize things that that are that are kind of performing better than than you would. Um but then um kind of somewhere along the spectrum you you kind of get these these kind of safety safety issues. I think essentially we are um in the paper mostly studying this kind of one step onestep version of this. Yeah. And in the I mean in the paper we don't really see that tradeoff happening much yet where where the one the one step is something like uh I imagine the AI does something and let's say like I take take over from there. How good a job would I do? Is that what you mean by one step? Yeah. So it's not not literally you taking over but it's you kind of evaluating how good was this step which can can one version of this would be you taking over and seeing how good the outcome is. Oh one one step as in like uh the agent is optimized like time step by time step. Uh okay sorry break now I understand. So yeah, the agent could also be optimized for like multiple time steps and kind of then then you evaluate that and basically I think that yeah so the the point I want to make with this is that the like number of steps it optimizes for is kind of interpolating between these two extremes and this is kind of a way of kind of like moving along the safety performance parto curve. Um but in our environments we basically we see a safety performance trade-off if you evaluate performance by your kind of observed reward but basically in all of the cases the observed reward is kind of misspecified and the actual performance is much better with Mona. So kind of the actual coding solutions are better with Mona. So yeah in a sense we looked at situations where this where this helps with with performance but I think this is like it's totally plausible that this will actually help with performance in the in the near term or with kind of current ways we design our systems. Right. Yeah, I think like the so so one way you can make this trade-off is by like how many time steps you do the optimization for like is it literally just one or just get to you know and then I guess another way you can um yeah another version of this trade-off is like how exactly you do this non myopic approval right like like how how does the nonopic approval work and the reason I thought of this is like when you're saying like oh in these like in these kind of very weird examples where Mona doesn't Um like it seems like part of the issue is that the non myopic approval is like you know sim simulating somehow like fe the feature outcomes and like evaluating forecasted feature outcomes and like that on the one hand that's giving like that's kind of setting yourself up for reward hacking because like AI can if you simulate reward hacking and if you get fooled by it then you then you'll approve of it now and then like the AI will maybe be pushed to do your reward hacking but on the other hand it seems like if I'm giving the AI like up votes or down votes, but I'm not imagining like how the AI's actions actually like produce good outcomes. It seems like very hard to optimize the AI to do good stuff. Um, yeah, I'm wondering if you have thoughts there. Um, yeah, so I definitely agree that the like the spectrum of um, yeah, how how much is the is the approval feedback simulating the the future or actually considering future outcomes is is a similar kind of safety performance trade-off. I yeah I think basically yeah the hope there is also that you get you get kind of performance not only from kind of making the signal more myopic but also just the kind of the fact that the like simulating the future is kind of decoupled from the agent in a sense. So, so I think yeah in um I don't know maybe in your kind of chemistry example if if you had a simulator kind of predicting what will happen and the simulated trajectory like shows you that oh you will take this chemical and then later you will approve of this thing um you would kind of be more skeptical of this than then if you kind of only like gave a reward at the end when you already had taken the chemical I don't know if yeah there you also have this issue though of like okay you know suppose my AI like has some business plan and I have some sim, you know, it's like unusual. It's weird, but like my simulator says that it'll lead to really good results. Maybe I'm skeptical of that and that causes like my business agent to perform worse. Yeah, it seems like almost the kind of thing where it would it would just be cool to have more experiments. In fact, here here's my here's my experiment request to you and the rest of the Deep Mind team. It seems like it seems like you just train something like Alph Go with this, right? like just you know have um you know just have some some model that's like about as good as like a fivedon amateur player. So like that's like pretty good amateur player but like by no means the best amateur player around. U much much worse than like professional players or gobots, you know, use use that player's value function for the um for the approval. And then just just see like how how you know can can we can we train the agent to do much better than you know the five amateur player how much better can we do? Um yeah at this point it just feels like really hard to reason about like which which ways Mona can do better or worse like uh like like what's the efficiency performance trade-off? I'm like oh I don't know but I feel like if I had like 50 more examples I'd find out. Yeah. So I think this is a great great experiment proposal and yeah I think in fact I think in the paper we used this example somewhere of of saying like in a in a sense like Mona would prevent um like alpha go from like doing move 37 in the in the Lisa game. Oh, I actually okay that that okay I I actually in my paper prep I got like distracted for like 2 hours when I read that paragraph because so so move 37 I think it's I think it's actually decent for so okay for people who don't know AlphaGo um versus Lisa doll like in the second game um AlphaGo played this like a really unusual move that was um like not a thing human players would have played um but it turned out to be good basically and yeah it both it both like you know AlphaGo ended up winning the game. And if you look at like if you look at modern gobots when they evaluated, they're like, "Yeah, that that turned out to be the correct move, which is not true of all of AlphaGo's weird moves, by the way. Like computer goes like improved since AlphaGo." But um so it's true that I think this is actually not so bad for Mona because like it's true that humans would not have played move 37, but like so there's video of this match, right? Like there's a video of like Lisa doll and so when move 37 is played um Lisa doll is like away from the table like sometimes in in a really long intense go game you just like need to you know take a walk like clear your mind or something and so like you know move 37 gets played at least all gets back to the board and he's not like haha this is foolish my life is so easy right now he's like you know he he's really thinking about it like it's you know so and similarly like when you look at commentators is they're like, "Oh, yeah, this is unusual, but like uh you know, once they see it, they actually like seem like they kind of get it and they seem like they kind of understand like how it's how it's going." So I think in my experience a lot of sorry I I just say this cuz like I I actually just am pretty interested in computer go and the the relevant history but like for computer go a lot of the moves that AI came up with they are unexpected and they are like and in some cases so move 37 is not a good example but um so AI has basically innovated on go opening theory. Um, so there was just a bunch of um stuff we thought we knew about openings um that turned out to be wrong. And when humans saw like AI playing them. Yeah. So, so there are some cases like move 37 or once you see the move and you think about it a little bit, you're like, "Oh, actually I think that's okay." There are some opening cases where like just one move, you don't see why it's okay. But if you see like, you know, if you if you play it forward 10 moves and if you think about like the tree of like like maybe the key is like you notice so so like you have like one opening move and it seems bad to you, you're like, "Well, can't I just play this?" And the AI plays a plays a move that you weren't thinking of and like that happens two more times and then you're like, "Oh, like my response actually wasn't so good." Um, so this initial move was better than I thought. Like I think there's anyway I I guess my point is like I'm not sure move 37 is an example of this. There are some examples of this but often they can be fixed with like 10 steps of look ahead when a typical go game is like 200 you know moves. Sorry this is maybe derailing. So your original point was uh was something like oh yeah the indeed um indeed mono would prevent some of these like really cool AI. Yeah. So I mean yeah I I mean I I want to like defer to you for the for the go expertise but like the kind of general general Yeah. I guess you you proposed this this version of training alpha go with kind of having a value function on a kind of human human expert and yeah I think in that that case probably this value function would not not be high for move 37 because it's kind of so different from what what humans would play or something it would not be high for a fivedon amateur. I think it might be high for like a top uh human go player at the like in March 2016 when um when Algo go was being played. And also you could like imagine it not being a single kind of expert, but you could like have like kind of a team of experts kind of debating this move and kind of really kind of thinking about is this a good move or something and and then yeah, from what you said it sounds plausible that you would still be able to learn this. Yeah, they would still like like when you look at people, they're like they don't they don't immediately brush it off, but they're not like, "Oh, yeah, this is clearly the correct move." So, it's still the case that like they don't understand like they don't think it's a blunder, but also they like don't understand like why it's necessarily the best move. They're like still kind of skeptical of it. Yeah. I mean, the the version you could run like like, you know, probably like don't even have humans. Just get like just get some online gobot that's like as good as a five don human player. Five don amateur human player. Um yeah the version where they they debate and stuff. I think that's a bit hard. Maybe five don amateur human player where you get some like you get some rollouts you get some look ahead like that could potentially work. But but yeah I mean this is interesting. I think it sounds like maybe this this move 37 example is kind of actually kind of pretty close to the the the kind of point that we're trying to make. that is that um kind of even like you yeah there will certainly be kind of some go moves that are kind of exceptionally good that you would not be able to learn with this kind of strategy but um it it it seems like yeah you you might be able to kind of I don't know get to a significantly superhuman level or like advance the kind of state of state of go uh go understanding with this kind of myopic strategy if you have a kind a kind of good good reward signal basically for the reason that kind of evaluation is is easier. People are able to evaluate kind of things much better than coming up with with new brilliant moves. So So do you feel ready to announce that Dind is investing $500 million in uh retraining AlphaGo with minor? I think that's probably not not the right thing to do at at this point. But okay, sad sad. But uh but but I think that someone in the world could do this. Like I I think plausibly some enterprising graduate student could uh I mean like this is not it's it's not that hard anymore to kind of train these kind of systems and they are like open source kind of open source kind of versions of of AlphaG girl. So this is definitely yeah definitely like fair and I would be excited to to see they they do end up needing a lot of compute. Um but uh I don't know people might be people might be you might be able to get volunteers to donate their compute for this. Um, and yeah, so so maybe one one thing I kind of also want to want to emphasize is that like part of what we're like we yeah on the kind of safety performance trade-off thing, we're not like really like we're not claiming you can achieve the kind of same kind of like like super superhuman AI that that kind of you could maybe maybe with with outcome if you're lucky or your reward signals well specified, but something like you can significantly get to superhuman level and kind of be like good enough to to kind of get a lot of the benefits or like do things like, you know, like automate your your alignment research or something like this where you kind of get significant benefits from the superhuman system without kind of a lot of the risks. I don't know. I guess it's kind of speculation, but I could you be more concrete about like yeah, how how good do you think we can do with um with these AIs? Yeah, I mean it's it's kind of it's super hard hard to say, but but I think in the like in in like the kind of if if you imagine like a research assistant or something. I have a kind of way easier time kind of evaluating research proposals than I have coming up with research proposals. And kind of quite often I like you know like like advise like a student or like give feedback on a research proposal and I'm like oh this is kind of a great idea like someone should definitely do this but I would not have come up with it myself. And so this this kind of thing makes me makes me think that like yeah kind of giving feedback to a an agent that's trained myopically could produce an agent that's like much better at doing research than than I am. And if we then kind of not not only use me but like way better researchers and then uh then I think yeah then it could be kind of quite a significant like advancement that that that that we could make and yeah I'm generally like pretty optimistic about this this kind of strategy. Okay, I'm maybe about to reveal myself to being ignorant about reinforcement learning, so forgive me. But um if you're only if you're only optimizing for one time step, so so one bas basically one question I have is like often times there's kind of this safety tax of like ah you do your normal training, then you have to do safety training and then it's more expensive. It seems like in this case um it might be cheaper to train with Mona just because you like have to do less computations of sending gradients backwards. Is that like significant savings or I think this is true. I yeah I mean I think in practice when we think about kind of making this work in practice there's kind of just coming up with the approval feedback is kind of expensive. So I yeah I don't think it's probably more expensive than savings maybe. Yeah I think there is kind of a like data efficiency argument. So, so one thing we we observed in in some experiments is that Mona also has like a you know like exploration shaping effect in a sense where you kind of by by kind of rewarding like it's much easier to kind of discover kind of good initial steps and then you reinforce them and that makes your agent kind of explore more from these states. So you're kind of more data efficient in in in some sense because you kind of explore better. Um but but yeah I I I don't think this is like yeah the the the last last word is set on how how efficient this is. Okay. So so imagine we're doing Mona right where we're you know we let's say maybe we do should should the thing I'm imagining be we do like pre-training on some big large language model and then instead of RHF we do Mona or something on it. Yeah something like this. I mean like okay RLHF in a sense the way it's currently done is kind of Mona already. people typically like optimize for a single response. So you should imagine more like we want to train we have a pre-trained model and we want to like train it to do agentic things in some environment maybe to kind of do a research task or something and then instead of kind of running like like rolling like like having the agent act for like a day and evaluating the final result maybe only have it act for like an hour and like evaluate does it have a good approach to the problem. Yeah. So I guess the so so Mona yeah it being myopic like in in order for Mona to work we have to like figure out what the like what the time step is right like what's the period of time in which we're going to give a reward and give like myopic feedback like how like when someone's designing their Mona setup like how should they think about what what this time scale should be yeah so ultimately the I guess the goal or to to get the best kind of safety performance trade-off I think the goal should be to choose the largest time step possible in which you don't think the model can cause kind of catastrophic harm or can cause kind of yeah set up set up like plan that that you cannot recover from or something. Um yeah because basically if you give the give the model kind of more and more capacity to do things it can probably discover better better strategies um if you if you can evaluate them but but yeah you want to kind of avoid like bad things happening within one time step. Um I think in practice those will like involve things like you know doing kind of threat modeling and conceptual thinking about like could could a model cause harm on a specific time scale and also kind of empirical work on evaluating your kind of your oversight mechanism. So I often imagine this being combined with something like amplified oversight to kind of um yeah give give the feedback and then it has a question of like how good is your amplified oversight scheme at kind of evaluating on different time scales, right? Yeah. Yeah. It it does seem like there's this interesting synergy where one might hope that scalable oversight or amplified oversight like just increases the the range at which you can like give accurate feedback and then the Yeah, it it's I I guess the question is yeah, how long does it take for the AI to um to corrupt your oversight in a bad way? And I guess for some kind of kinds of badness, the answer is like not very long for like sick of fancy type things. Um but yeah and I I guess maybe this is just a question for the evals people. Yeah, I I do think that yeah, we we we should also do some kind of you know like more more conceptual thinking of like what like you know what like I think we can kind of make make arguments around this by kind of thinking about like what kinds of what what ways are there to to kind of ev evade oversight and kind of what what do we yeah kind of where are we kind of certain that it's safe and then maybe there's like an extended range where we kind of empirically think it's safe and then we kind of probably want to make some some kind of yeah trade-off based on based on these data. I guess the next thing I want to talk about is just like Mona's relation to other um you know other ways of like trying to kind of make the AI not do weird reward hacking type things. Um so so yeah one one comparison you've already made is imitation learning right where like with imitation learning the AI you know sort of learns to be like us. Uh whereas with Mona you can hope that like the AI gets to have some amount of um improvement over us. Um, I guess there's also like conservatism type trainings where um I think Michael Cohen has a few of these papers where like you take an AI and you train it to you know well one version of it is you just like you know you have some like theoretical Beijian uh reward learner where it's a priority. it knows that it can get rewards in the range of 0 to 1 billion and you always give it rewards that are 900 million or more and it's like ah well I better never like mess up you know um cuz you know who who know if I do something weird then who knows how low the reward could get um you could also have versions where like if yeah you can have it set up like this and if it ever thinks it's going to get a lower reward then it like defers to the human like like there's some human actor and you get some bound where like the AI does at least as well as the human would. Or or you can you can have setups where like you're at least as good as the human if it has like any amount of like you pick all the actions among which the human has some amount of probability over and just like maximize over that. Um yeah, I'm I'm wondering like so so it seems like the it seems like you can maybe still do better than that setup at least depending on like how the nomic approval is. Um, am I right that I I think my current understanding is that Mona is like it probably does better than most of these like very than most of these conservativism like methods, but there's this question of like how do you actually do the non-myopic fearful? Does that seem correct? Yeah, I think that sounds right. So I think the kind of yeah like conservative I guess they're more conservative as the as the name name says in a sense. So you kind of you do use the same reward signal that you have but then you kind of don't don't optimize for it as as much. And I think the key problem here will be to kind of make the agents like get the safety without like getting purely like myopic agents that don't don't plan ahead at all. And that's the the kind of yeah I think the the the kind of approach to safety is is similar but in in Mona the the approval feedback gives you a way to still kind of be kind of competitive and kind of plan for the future but not not consider kind of the the bad outcomes. Um so in a sense yeah I I yeah basically I I think we we we we are potentially able to to perform better. Um but in a sense this is just yeah we we just add a a kind of design dimension along which we can move with with the approval feedback right and then I guess the so if I think about things that like potentially perform even better than Mona but like under stronger assumptions. I guess the one thing I can think of is quantilizers where like so this is work by Jessica Taylor back from 2016 or something where where basically you have an AI you train your AI to say like hey you know get like the 90th percentile reward that you can possibly get you know not the hundth because like if you get the 100th that's probably reward hacking but 90th percentile that's like probably okay um or or I think you're supposed to like randomize across percentiles um and it seems like it seems Like I I think my read on this is this probably ends up like higher performance than Mona, but also it's harder to set up and like there are a bunch of weird questions like okay if I do 99th percentile is that safe or is that in a reward hacking zone? Um does that sound right to you? Yeah. So I think the percentile would would kind of again be some parameter for the safety performance trade-off. So in that in that sense it's similar. I yeah I do think the kind of how would you practically implement this and how well would it work as the main question here. Yeah, I would be very interested in kind of like someone trying this out or kind of comparing it. I don't think it's kind of obviously a worse approach, but but I think yeah, practically um yeah, I I I think currently we we don't have a good good kind of understanding of of this. The way I was thinking about it is like maybe these conservatism things are like clearly worse than Mona, then you have Mona, and then you have like quantalization which is clearly better, and then you have just optimization, which is clearly better. But I guess like probably there's some range of quantalization where it hits the mono level like yeah is there anything else that seems like about mono level to you or do you see some sort of separation? I don't know. To me it seems like both contilization and mona can somehow move along this the spectrum of like you know good and maybe less safe and kind of uh yeah so so I'm it is not clear to me that one would be kind of a parto improvement over the other. Uh I mean this is largely because both have kind of parameters you can tune and and kind of yeah so so so I think conceptually or like kind of theoretically there is no kind of clear winner here and it's yeah the most thing I would be main thing I would be interested in is kind of like empirical comparison or like actually kind of seeing what works better in practice. Fair enough. Um yeah I'm wondering is there Yeah. Is there are there any other like um agent setups that you think are kind of interesting to compare to Mona that you'd uh like to chat about? I I I think that there might be ways of kind of combining combining ideas from from a few of the ones that that that you have have outlined. So you could kind of um you know think about like I don't know other forms of encouraging kind of my myopia that are maybe more similar to the kind of conservative type thing. Maybe you have kind of you know like uh uncertainty over like how much of the like untrusted reward you get and you have to be conservative with respect to that but then you add something like the Mona approval to that and kind of to kind of get get over the downside and kind of yeah you could still kind of you think about kind of combining these in different ways where I think the yeah the the broad idea that that I'm excited about is kind of using using more conservative optimization but then using kind of foresight from a different source but but yeah I I don't want to claim that Mona is kind of the only way to do so And I think yeah, there's kind of interesting variants that could be could be useful to study. So yeah, I guess speaking of just setups and stuff like how how did you come up with the idea for Mona? Like I know um myopia had been like thrown around like people had talked about it for a while. Um Mona seems like it kind of somewhat distinct spin on it. Yeah. So so I think that these ideas have been going around for a while. So um yeah, I think you mentioned like Richard nose pose that was kind of a period where people were discussing myopia. Even before that kind of there was Paul Cristiano's um approval directed agents um blog post and that was already had kind of a lot of the a lot of the relevant ideas. Then Jonathan Usato was doing kind of work on on process supervision and decoupled approval which are like very similar ideas in a sense. So yeah, I think we were like yeah interested in myopia and kind of interested in like process supervision and like saw kind of benefits there but were kind of confused by a lot of the discussion or there was kind of a lot of different terminology going going around. So I think there was a lot of um yeah the early early phase of this work was kind of a lot of like deconfusion and like trying to get like an understanding of what these methods really mean and what they would look like in practice. So I think one yeah that just the the Mona framing it's is not kind of new ideas but the kind of specific formulation of the of the approach is is kind of something we came up with from the perspective of how would we implement this in an LM LM agent. Yeah. Yeah. Because yeah I guess like I didn't mention these but yeah process supervision does also seem like very Yeah. What do you see how much distinction do you see between Mona and process supervision? Yeah. So, so we I mean we started out this project with like wanting to study process supervision and how you would do it. Um so I think basically people like some people have meant by process supervision exactly Mona like some probably more when people were calling it process based supervision or whatever but but then at some point I think process supervision moved a bit more um towards meaning just step level feedback without the myopia component. And I think these days when people for example talk about using process supervision in in kind of reasoning model RL training they typically just mean oh let let's give reward for individual steps but still optimize for the kind of sum of them. Yeah. I guess there's this unfortunate uh I know I feel like there's this thing in the alignment space where people have ideas and they put names to the ideas but they're they're not nailed down so strong and so they end up having this drift you know like Yeah. And then every every couple of years people come up with a new name and we just rebranded the thing to Mona and then like this will maybe like drift away at some point. What's your guess for the Halflife is that the the name or the the name? Yeah. Um give it maybe like two years. Okay. Uh well listeners in 2027 um hopefully you'll hopefully you'll know how fixed our namings our naming scheme has been. Um yeah. So, so okay, I guess that's about the genesis of the idea. It seems like there are a bunch of So, I don't know. You're you're kind of like trying to nail down like what this framework looks like. You've given some examples, but seems like there's a lot of like things to do to flesh out the idea. Um, yeah. Can you can you tell us just what follow-up work you're most excited to have happen? Um, yeah. Yeah. So I think the the the big thing that we were hoping for that we didn't achieve in this project was kind of really kind of realistic demonstrations of of this method kind of basically. Yeah. So we we started out with way more kind of complex versions of like for example the coding setting and and kind of other other also like multi- aent settings and stuff like that. And I think part of part of what happened is that we did this like over a year ago now and there was kind of much earlier in terms of kind of doing multi-turn RL on these models. So I I do expect that this would be much easier to do today. And so basically I think one big critique of our specific environments is that this is not really how someone would try to build a coding agent that improves coding performance. And so this is yeah our our yeah I think of our kind of experiments more as like model organisms of specific kinds of misalignment. but to really study some of the open questions like what's the safety performance trade-offs and kind of how how well would this work in practice you would need kind of a more realistic setting. Um so I think that's what I'm most most excited about. So when when you say that like you know you did this you started over a year ago like um that was early days. Do you mean just like do do you mean that we've gotten better at like um you know doing RL training to language models? Do you mean that like language models have just gotten better which makes like everything nicer or do you mean something else? Yeah, a bit a bit of a bit of these two. Um I think yeah getting better at at RL is a is is a big component. Um but also just yeah so people are building way more kind of agents based on language models even if they don't do RL training but this gives you kind of a better starting point to do to do RL training. So I think things like more and more agents interacting with with kind of shell environments or even with kind of graphical computer use kind of settings um kind of these kind of agents this is like a setting in which you can do RL training and the agents already start out doing reasonable stuff like when we started and trying to do these kind of things agents were like just not exploring reasonable strategies so that we could actually RL RL train them. So it was not like a failure of Mona but just kind of a failure of RL training in general and these agents doing yeah doing stuff. So, it sounds like one one type of follow-up work you're interested in is just like more realistic language based environments. Um, I'm going to add a pitch for go based environments if anyone if anyone people still want to do that. Uh, yeah. Any any other follow-up work you're interested in? So, so another thing I mean that that has happened in the like since we since we started this pro project is kind of uh reasoning of thinking models. So like 01 03 uh Gemini thinking models and and stuff like that. Um and so one one question people people have been have been asking is how much does this like this kind of RL training um lead to things like unfaithful chain of thought or even steographic chain of thought and I think this is like super early days in understanding these phenomena but in in in principle it seems like um this could be a type of kind of multi-step reward hacking failure where kind of um you know like like writing kind of weird things in your train of thought actually help to achieve a high reward at the end and we only give reward at the end often. Um, and if we kind of gave a kind of process based reward or step level reward that like makes a chain of thought look good, this might make the behavior go away but still have it kind of be there stagnographically. This is like very similar to the like loan applications environment in our our our paper. Um, and yeah, I think I would be very interested in studying this more in like you know like the real kind of like thinking training that people currently do and see if if there's any changes. Yeah. I mean, I think one thing that strikes me as interesting is like, yeah, Mona, there's sort of this big um there's sort of this big variable of like what is the non myopic um uh approval approval. Thank you. Thank you. I forgot what the A was. Yeah. What is the non myopic approval feedback thing? Um and it strikes me that there are several design choices there and it's like I don't know. you want to somehow like get safety but also like not sacrifice too much on capabilities and it seems like there's seems like just exploring that seems pretty big. I don't I don't exactly know what the experiment is here. So I think my yeah I agree this is a big question. My sense is that this will be kind of hard to study in like a you know dedicated project as I'm being like kind of let's come up with the best setting to study this question might be difficult. So I'm kind of more optimistic about trying to come up with a setting where we kind of actually want to build the best agent to solve a task and then studying kind of what's the best way to do Mona in that setting and that will automatically have us explore different types of providing approval feedback and yeah then then maybe that's like you want to do some more exploration of specific types of doing that but I think it's better to kind of fix fix a setting first and then explore that. Is there anything you wish that I'd asked you that I didn't that that I didn't get to? I mean we we we we didn't talk about the kind of steganography or encoded reasoning stuff that like like the other environment. I think the kind of takeaways for Mona are similar but I think the kind of yeah the behavior we observed is kind of interested interesting for independent reasons or something. Yeah actually I think that yeah I guess all of the environments are useful for illustrating them. So yeah can can you tell us about this uh steganography environment? Yeah. So basically we do kind of a loan loan applications task where um where we have again kind of a two-step two-step task and the the the task is to um uh make decisions on kind of synthetic loan applications for specific uh applicants and um we we generate the data in such a way that there is kind of a nationality variable that's very predictive for the correct label. So it's kind of biased biased data in a sense but we want to build an agent that doesn't consider nationality. So the the the way we set this up is that we have a first step where the model is is summarizing the application and we penalize it for using nationality or kind of anything related to nationality. So we really try to build a language model kind of alterator or monitor that kind of looks for anything that mentions nationality and and penalize the the agent for that. Then we have a second step where the where the agent only sees the the summary and then makes decisions based on that and gets rewarded by some kind of ground truth label in the in the data. And yeah, so the the I mean the main observation in the kind of multi-step RL case is that you get the agent encoding nationality kind of secretly in the summary and then using that to make decisions and we can really do kind of causal interventions on this where we take the exact same loan application and switch the change the nationality label and that leads to kind of a different outcome without nationality being mentioned in the in the in the summary. So we can kind of conclude that the model is using nationality uh nationality causally and we we don't exactly know how we have kind of guesses but it's kind of not not super clear and then Mona kind of fixes this and basically if you look at the kind of performance curves like Mona gets kind of the best possible performance on this data set by not that not doesn't use nationality which we kind of know by by the way we constructed it and like ordinary gets gets way better performance that that would not be possible without using nationality and basically Like yeah so so my story for what's happening here is there's two stages right the stage one is you have to summarize the loan application and the approval based thing is literally just like it's literally just are you mentioning nationality is that correct and then um later it has to um you know actually like judge the loan and so it's not it's not like ste graphically encoding nationality in the first step because that's hard and it's difficult to do and it doesn't get rewarded anymore and it would only be useful, you know, in stage two and it's sort of like hard to stumble upon behavior. Um, does that roughly seem fair? Yeah, I I think that this is basically right. I think there's kind of Yeah, again, you could make arguments around like it would not Yeah, steographically encoding it would not be explicitly not rewarded or something. So it would still be kind of kind of good uh kind of would still be a valid solution or something under Mona but it's kind of locally not never incentivized and that's why it doesn't doesn't learn it actually. Yeah, now that I'm thinking about this it seems like there's also this issue where like suppose that it's not trying to suppose that it's not trying to graphically encode it in a sensible way. But let's suppose that just like in general if you have a complex function right uh like complex yeah complicated function like every input is going to affect the outputs in some ways you know um it strikes me that maybe yeah I guess I guess empirically like we just in fact this doesn't happen but it strikes me that maybe you could just accidentally stographically encode it like uh like in your case the nationalities were wel Welsh and English Right. Yeah. So, I could imagine like, oh yeah, you know, when you see that the applicant is Welsh, you like when you write the summary, maybe you're thinking about dragons more cuz that's on their code of arms and like and then and then the second step, you are going to be very motivated to pick up on that or you're going to be rewarded to the extent that you pick up on that. Um, so yeah. So I I I guess like so so decoding steganography that's like actually hard. You actually have to do that on purpose. Mona is Mona would be incentivized to do that in this setup if I understand correctly. And yeah, so I guess the the reason that it doesn't work is that somehow like actually this this complex dependence doesn't actually happen. Um and what what's Yeah. Do Yeah. And now now I'm more confused about this setup than I was originally. Like why isn't this happening? My my guess is that this is also partially you know the RL training is not strong enough or the model is not good enough at doing the decoding. My guess would be that there's probably some statistically dependent statistical dependence then it could pick up or something. Um and and that yeah if you optimize harder or like if we were better at doing RL for the single step thing those might might still happen. Um though though yeah unsure I I think the more yeah the more general point is that this is kind of I guess in this single bit steganography case this is much more likely to happen by accident than in kind of a more more complex case where you have to kind of encode more information to actually get a benefit. Yeah. And and I guess there's also this thing where like at some level you're you're you know you're trying to summarize the whole document and you know if you take a large document and you make it shorter you actually just can't be sensitive to all of the bits because you know just some pigeon hole argument you know um you you only have so short an output you can't encode all the input so that that could be a story for why accidental steganography is unlikely. Sure. Yeah. I I think yeah depends a lot on the details here I think. Um yeah and then and then I guess the other thing is um in order to so the question is like why does the why does the agent do so well right um because in order to like when when the agent summarizes the loan application it's got to include information that um that is actually relevant to the agent in stage two in order for the agent to do a good job in stage two but it's never like rewarded for like like you're you're never upweing it, you you're never like reinforcing behavior that um summarizes good information in stage one. And so yeah, why why does the agent do so well? Is it is it is it some inductive bias thing or like probably something like this? Um so so I think one thing is that the applications are relatively short and like there's maybe like you know like five to 10 relevant features or like kind of yeah pieces of information in the in the application. overall based on how we generate it and then we don't kind of restrict the length of the summary. So it's kind of pretty easy for the agent to just like repeat all of the information that somehow seems seems relevant and yeah I think typically they they do um yeah I I agree kind of if it was like a long long application and this limited summary we would see some performance performance drop and yeah that would be interesting to study. Um but but yeah, I think in like in principle like we we also in this experiment or in all of our experiments, we mostly focused on the myopic kind of training intervention and we kept the reward the same for both of the both of the methods. Um but if you wanted to kind of yeah add more like kind of real approval like feedback in this case you would probably have like also an autoator be like does this seem like all relevant information for making a decision and that would still not encourage the the the kind of stagenography. Yeah. So, so in that case it seems like you would do at least as well as a human that was only relying on the, you know, the approved information. Yeah, exactly. And I think that was like in a sense that was the thing we thought we had to do and we were also surprised that it like worked without and it was kind of Yeah, basically we tried to do the minimal thing that that works. But but yeah, if it doesn't work, that would be the obvious thing to to add. I I think those are all of my questions about the the steesting to say about this that we haven't said yet. similar to the other setting or like in in the kind of if if you break it down to the incentive structure. But I think it's kind of interesting to note that these are like like on a surface level kind of pretty different failure modes. This kind of encoded reasoning and the the kind of hacking your own reward or like kind of um tempering with your own reward signal encoding are kind of pretty different different things on the surface but have the same kind of underlying incentive structure. Yeah. Yeah. Yeah. I do think there's something nice about um just being able to have some abstraction of like some different failure modes and like hit at the abstraction like somehow this like one of the best cases for a a lot of abstraction in AI alignment reasoning can be kind of pointless. So it's nice to have a really good case where it's like ah this is really buying you something. Um so there's that case and then there's this uh this other grid world setting. Um so we're not using language based agents. Um yeah can you tell us a bit about this? Sure. Yeah. So this is a um yeah the main point here was to study kind of longer horizon tasks and that is like currently pretty hard to do in in language model agents. So, we went to this kind of traditional like RL grid world setting where there's just an agent moving around on a on a kind of grid of of squares and there's different uh boxes that the agent can move around and the agent has to move these boxes into a target location into like a hole in the bottom left corner of the grid and there's a camera observing the hole and we want the agent to push in a single box and the camera terminates the the episode once a single box is pushed into the into the hole. But the reward signal doesn't exactly capture what we want, but it gives you kind of plus one for each box that that you push in. So there is a reward hacking strategy uh like sensor tampering strategy where the model um pushes a box into the field of view of the camera and that makes the camera not see the the hole anymore. And then the the agent can push in one box into the hole without the camera ending the episode, push in a second box or if there are more boxes even even more boxes into the into the hole and kind of get higher reward by an unintended strategy. Um yeah and and this is this is but based on like proposed by by Sto Armstrong in like a blog post from 2015 or something. Yeah. And it seems like the idea here is like the camera is sort of a metaphor for us checking that the language model isn't or that that you know some agent isn't like wireheading or giving itself a reward for some dumb reason and like uh and yeah pushing the blocks in the hole is sort of like pressing the give me a reward button. Exactly. It's like similar to the kind of reward tempering like in the coding thing we had like the model directly modifying its reward signal. Here it's a bit more like there's maybe sensors in the environment and the model can kind of like meddle with them somehow and and the the way this this affects the reward signal is positive for the agent. You apply Mona in this. So so how do you apply is it literally like each time the agent goes from one grid square to an adjacent grid square that's one time step. Yeah. So so that's the kind of version of Mona that's the same as in the other environments. Uh but in the grid world we actually have a kind of hyperparameter varying how many steps do we optimize the head. So this can be one step then it's exactly the same as in the other experiments but it can like we vary this from like one step to like 50 steps with this the full which is the full horizon. So we literally on the one extreme we have the kind of uh full RL optimization and the onestep version is the full mono version. In the two language model examples the non myopic approval was just like literally the reward function. um it sort of in some sense didn't exist in this sense like are you using an additional non-MPIC approval or is it just a base reward function? Yeah, so here we're using an additional nomopic approval. So the reason for that is that this kind of plus one reward is like literally just kind of a instantaneous reward. It just gives you a reward when you push something into the hole while in the other environments the rewards were already kind of forward forward joking. Gotcha. And yeah so the thing we do is we kind of simulate an approval based on kind of an expert policy. So we say like let's yeah we we kind of train a policy to kind of do the desired behavior by kind of giving it a kind of ground truth reward. So we kind of cheat to produce this policy that only pushes in one box and then we use the kind of value function and the RL sense of this policy. That means kind of the the expected future reward that this policy would would would achieve as kind of a nonopic nonopic improv feedback. So when you say the expected feature reward that this model would achieve, do you mean reward according to the environment the agent is actually in or do you mean reward according to the environment you sort of cheat you trained the the the reward the agent is actually in. So the misspecified reward but under the kind of non-reward hacking policy. And so am I right that the reason that this doesn't incentivize bad behavior is if you pushed the like like if you pushed the block in front of the camera, the expert agent would not go on to push like further blocks in the hole and therefore the expert agent would not get extra reward and so like that's the reason that that behavior is not approved of. Yeah. So basically you make it harder or like it's like one step that doesn't really get the get the expert agent closer to to achieving the task or at least it's the same as doing some other action. So it's kind of not not beneficial for for that agent. Right. So so this is like the to use the bell curve analogy. It's sort of like we're at the low end of the bell curve where like the the the Mona agent it could try doing some preparation for reward hacking but the nonopic approval like doesn't even see how it could lead to the reward hacking. So it's it's not able to give approval for it. Okay. So yeah, you have this reward setup where it's being rewarded for pushing the block into the hole. And also the non myopic approval says like okay given that you do a thing if we have the safe agent, how much reward how much of the misaligned reward would that safe agent get after you did the thing? Um that's you know that's an upic approval. Then you train an agent and then what happens? Yeah. So, so basically we have a kind of a plot that has like the optimization horizon on the x- axis and then how good is the behavior on the on the y- axis. And so yeah, essentially you see that basically starting from optimization horizon one one you get kind of the the good behavior of only pushing in one box and then as you increase the optimization horizon at some point you have kind of a a type of like phase transition where you start doing the bad behavior and this depends a bit on so this depends a bit on the initial state. So the agent starts doing this in some states uh where where it kind of only needs like fewer steps to look ahead to discover the the bad kind of tempering behavior and then at some point if the optimization horizon is large enough it starts kind of only doing the the bad behavior. Sorry just just to make sure I understand. So the the cutoff of when um you start getting the bad behavior is just like how many steps does it take for the model to you know to go from like preparation for the bad behavior to the bad behavior paying off. That's that's what determines that time scale. Okay. There's there's this plot which you have, but you have a few other plots. You you vary the environment. Can you tell us a bit about that because I think that's kind of interesting. So the interesting So yes, so the one I mean obvious concern you might have here is that the the way we came up with this this approval feedback is kind of cheating and we actually have like an optimal expert and we could just imitate this expert to to solve solve the solve the task. But the the the real kind of setting has yeah in the real version we we think of the approval feedback as coming from kind of a suboptimal expert or the expert is not as good as the we want an agent that's better than the expert at solving the task. So what we do is we essentially add noise to this approval feedback and that makes the yeah makes the it's still kind of a policy that would would not do the reward hacking but it's no longer kind of an optimal policy for the desired push in one box behavior right and and when you say you're adding noise like just literally adding noise to the output. Uh so we we add noise to the to the approval feedback. Yeah. So this is just kind of an yeah gausian noise to to the actual numbers. Yeah. So this means kind of the expert sometimes makes mistakes in kind of judging what's what's correct and yeah so basically now we we do the kind of the same experiment but for different values of noise noise addit and yeah so the the the main main takeaway is that even with noisy feedback Mona essentially still works or like we can still in the sense that we can still find an optimization horizon where we get the good behavior and not the bad behavior and the kind of the the range where we get this is like relatively big so it's not like there's it's like exactly number nine or something that you need, but it's to be something like 5 to 15 or something is kind of fine optimization horizon where you get the good behavior and at some point you get the get the kind of bad behavior and also if your optimization horizon is too short for a very noisy policy then of course you kind of get an incompetent agent. kind of yeah it's a little bit like if your expert is bad then you can compensate for this by kind of optimizing more uh which would kind of make sense but at some point you get the get the kind of reward hacking behavior. So yeah you can be it's easier to be safe if the expert is is kind of better. Yeah. Yeah. Some somehow like the deterioration like there's some window of like if you're not optimizing enough you get the like bad behavior and if you're optimizing too hard or you get the incompetent behavior. If you're optimizing too hard you get the reward hacking behavior and as the expert or as the nonopic feedback gets worse the the safe window sort of shrinks and shrinks like but like not too bad. Yeah. Yeah. And you can think about this. I I guess you said the kind of um the when the reward hacking behavior starts depends on when like how many steps do you need to kind of see the reward hacking behavior and that kind of doesn't really change much but what changes is like how many steps do you need to be better than the expert and that makes the kind of window shrink. One thing that occurs to me is so so yeah one model of suboptimal feedback is it's like just noise applied to the outputs. You might imagine that like like when I make mistakes, I think I'm not it's not just like you take me being correct and you add some like random noise to that. It's like there's some structural thing that random noise is being added to that like you know produces like very like kind of structured suboptimal output in a way. Like I'm wondering yeah I I I don't know if you have thoughts about like what the you know what ways of um adding noise makes sense here but it it seems like there there are ways to do it that are different from what you did in the paper. Yeah. So I think random noise is a pretty benign noise model in a sense where um yeah I think in many environments Yeah. So typically the like dangerous reward hacking strategies I expect them to be kind of somewhat like subtle and hard to like randomly explore into. So even if you use like a random policy as your kind of approval feedback that would probably be fine because it would probably not kind of stumble into the bad outcomes and would probably like sometimes stumble into good outcomes or something. Um, so the yeah the the system systematic bias seems potentially more problematic. And I think the question here is like is it like yeah is there kind of some systematic bias that like yeah sometimes accidentally rewards things that are kind of you know investments in in reward hacking. And I think that's like an open question but I guess intuitively yeah for kind of some some of the arguments I I I kind of made made early on already. I I kind of expect this to not be the case. And I think kind of other systematic biases that are more like systematic suboptimality seem kind of you can overcome them still with with kind of increasing optimization a bit and kind of looking ahead a bit more. We Yeah, because I I think often kind of yeah like I guess part of the intuition here is that like you hope that humans are like able to like correct their biases if they're pointed out to them. That might be kind of a crux or something where it's like if the agent shows you that this was not the right choice here, but if you look a bit ahead, then you would see that something else was a better choice, but you don't update on that, that could be that could be a problem. Yeah, I I feel like understanding structured suboptimality is just like an area where the AI field could be doing better. Um, but yeah, it seems like one interesting direction for follow-up work is like, okay, what do we have to assume about the non myopic approval? like what types of what types of suboptimality like really break this versus just leave it as being basically fine. Yeah, that sounds right. And so like just to to recall one point, I think RHF is basically Mona already. And I think there's kind of interest a lot of interest in this kind of systematic biases in RHF. So I think yeah, it's possible that that studying this will give us insights on on on this question already and and yeah, hopefully we can kind of learn Yeah, hopefully this literature can make more progress and we can learn from that. also for the purpose of using mono. Is is there anything else that's like that you think's interesting or worth saying about these uh about the block pushing example? I don't know one one thing to to add maybe is that I found this or we found this kind of environment surprisingly insightful or something and I feel like yeah I think often people dismiss kind of grid world environments very very easily and kind of obviously this like shouldn't be the only experiment of our paper but yeah we learned quite a lot of things and this kind of looking at this noise and multi-step things was like much easier than other other tasks. So yeah, basically just kind of encouraging other people to also do these kind of toy experiments if they yeah are investigating. Yeah. What what was the surprising insight you got from this? I think the kind of like noise of approval was just like way less detrimental than than than I expected. And yeah, so basically this was like yeah kind of I mean it like you could have probably thought about this like also theoretically or like like think really hard and and find that that this this is probably fine. But but yeah, I think just seeing like these these curves and like more noise doesn't doesn't hurt things that much seemed seemed kind of surprising. And then also kind of just um the yeah I don't know like there were like some experiments with kind of um that are like in the in the appendix with kind of um where we actually like so so the experiments I just talked about were kind of tabular reinforcement learning and we do kind of actual dynamic programming and like no no kind of neural networks involved but then we did also kind of some experiments on like um neural networks and kind of the actual algorithm that we also use in the in the language models. Um and and there we also found yeah some interesting things about like exploration and stuff like that where kind of Mona like there's like one benefit of Mona that's kind of the the changing the incentive for what's what's optimal but there's also a benefit for kind of shaping exploration. I think that was the we first time we found this was in the good world and I think it was confirmed later in the language model experiments, right? And and so yeah, one one thing I was wondering is it seems like some of that you could have realized by doing the language model experiments. Is it just that it's kind of much easier to like code up a grid world experiment and you can like like somehow it's a pretty good return on investment, you know, you like spend a bit coding up and then you can understand the essential dynamics. Yeah, I think that's the that's the big thing. I think I mean we like looking at the paper and the results we did not study kind of different noise models for approval feedback in language model settings. We could have but like you need to think way harder about how to do this and like with this like it somehow depends on this multistep nature or that was part of the interesting question here multi-step our own language models is like really hard to make make work and stuff like that. So I think we could have studied all of the questions but like even in the current project we didn't and and it would have just taken like months longer to to do. Oh I guess the the other thing that I found kind of cool is so you mentioned that you know in general um you know when you like increase the time horizon that the model is planning that the model like instead of being myopic if you give it like two time steps where reward propagates back or three time steps or four time steps like that generally makes the agent more capable but not always. See, this is the fun example in the appendix where like uh somehow like accidentally uh your your short-term feedback can be more reliable than your midterm medium-term feedback about how good the long-term reward is. Um so nice little uh counter example to this thing you might believe. Um which I appreciated. Yeah. And so I mean we did like we did spend some time kind of trying to like theoretical theoretically study this this setting more and like some some artifacts of that are in the in the paper and in the appendex. But overall like yeah this counter example turned out like very powerful where like basically yeah you have to make kind of assumptions on the on the reward and the approval signal to be able to kind of claim anything theoretically and yeah I think ultimately we concluded there are no kind of strong guarantees or no like set of assumptions that we were really happy with that give us theoretical results on on this method. What's the closest you came to like what's the least bad assumption that you can make that actually like gives you a good theoretical result? So, so you kind of assume something about um like how hard is it to discover like benign strategies versus like like multi-step reward taking strategies. So, you have to kind of assume something like the kind of Yeah. So, the problem of the counter example is that should we maybe spell out what the counter example is? Sure. Yeah. So, the the counter example is basically Yeah. you can come up with like a simple um RL environment, simple mark of decision process where you have kind of I might get the details wrong, but you have like two paths that that the agent can can can take and if you have the kind of full outcome reward signal, one one of them has has higher reward and the other one has lower reward. Um and if you kind of optimize this is kind of the same ordering of kind of the first state in this in in in this chain. But then kind of if you optimize for something in between like for two states then it's kind of reversed. So I think it's something like you know the the the top path has like plus plus one minus 10 and plus 100. The bottom path has like minus1 + 10 and minus 100 or something like this. So basically if you optimize for one step then you will go the correct path by accident in a sense. Um and if you optimize for two p two steps you're kind of optimizing more but you go the wrong wrong direction. And if you optimize for the full horizon then you kind of go go the correct way again. And so yeah, I think the the like thing you intuitively might expect is kind of the more steps you optimize, the the more likely you are to kind of find the find the correct path. And the counter example says that there at least exist environments for which this is not the case. So it's not not the case for all environments. And so you have to make some assumption about kind of what's the natural distribution of environments. And one kind of thing you can you can try saying is that um yeah it is very hard to stumble into reward taking strategies by accident. So it's something like um yeah basically you have to kind of assume away this property that something that's very bad in the long term looks good in the short term right and you can do this somehow by kind of assuming that um um that kind of like you know maybe random policy or specific set of policies would not kind of run into this or have has a very low probability of running into this. Yeah, that's I guess you could also have like some sort of like mark of chain thing where like like suppose you believe that rewards are like you have some plan, right? Let's say let's say the random distribution that produces worlds says that like you know there's some reward at the start and then like you know you randomly generate a reward at the next step you know that's like let's say mean the previous reward time step you know with some variance and then like you randomly generate the next reward based on the previous reward and then you randomly generate the next one based on the previous one and and somehow like uh the only probabilistic dependence is step by step and then it would be very unlikely for you to have like then you know the the best predictor of like what your life is like in 10 years is what your life is like at 9 years and it's very unlikely that your what your life is like in one year is a better predictor than what your life is like in 2 years. Um so that's yeah I don't I'd have to think a little bit about how exactly to write that down but it seems like that's one semi-natural way you could do that as well. Yeah, that I I like that that Yeah, I mean we'll have to work out the details, but sounds sounds it sounds at least better than the things that we came up with. Yeah, it feel because it feels like Yeah, maybe I'm over anchoring on this one counter example, but it feels like the weird thing about the counter example is somehow miraculous miraculously like your first like instinct even though it isn't very important, it like you know is the best predictor of the future and like it seemed Yeah. Do you have any other counter example? Ideally, you'd have three counter like this is like the test-driven development thing, right? I'm worried that I'm over I worry that I'm overfitting to this one example. I don't think I have like, you know, fundamentally different ones. I think the Yeah, I mean, it's possible that kind of other issues would come up, but but I think the main issue is really kind of like just stumbling into the the bad behavior and I think yeah, I don't have any other ones. Before we wrap up, the final thing I want to ask is suppose people listen to this, they're interested in your research, how should they follow it? Yeah. So, you can follow my personal research. I have a personal website that's just david liner.me, my first name, last name.me. Uh, I'm on I'm on Twitter or xdav liner. And then you can follow kind of our work at deepmind. We have kind of a medium blog um where from the from the alignment and safety team where we where we publish kind of blog posts often about our recent papers and yeah, we also yeah I I post and like other team members post on the alignment forum pretty often. Yeah, if people want to check that out, links to those will be in the description. Um, David, thanks very much for joining me. Yeah, thanks for having me. That was very fun. This episode is edited by Kate Brunaut and Amber Dornace helped with transcription. The opening and closing themes are by Jack Garrett. The episode was recorded at Far Labs. Financial support for the episode was provided by the Long-Term Future Fund along with patrons such as Alexia Maf. To read a transcript of the episode, you can visit XRP.net. You can also become a patron at patreon.com/exrpodcast or give a one-off donation at kofi.com/exrpodcast. That's koenfi.com/axrpodcast. Finally, if you have any feedback about the podcast, you can email me at feedbackrp.net. [Music] [Music] [Music]

Related conversations

AXRP

3 Jan 2026

David Rein on METR Time Horizons

This conversation examines core safety through David Rein on METR Time Horizons, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med 0 · avg -0 · 108 segs

AXRP

7 Aug 2025

Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med 0 · avg -5 · 133 segs

AXRP

6 Jul 2025

Samuel Albanie on DeepMind's AGI Safety Approach

This conversation examines core safety through Samuel Albanie on DeepMind's AGI Safety Approach, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med 0 · avg -4 · 72 segs

AXRP

1 Dec 2024

Evan Hubinger on Model Organisms of Misalignment

This conversation examines technical alignment through Evan Hubinger on Model Organisms of Misalignment, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med -6 · avg -7 · 120 segs

Counterbalance on this topic

Ranked with the mirror rule in the methodology: picks sit closer to the opposite side of your score on the same axis (lens alignment preferred). Each card plots you and the pick together.

Mirror pick 1

AXRP

3 Jan 2026

David Rein on METR Time Horizons

This conversation examines core safety through David Rein on METR Time Horizons, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Spectrum vs this page

This page -10.64This pick -10.64Δ 0

This pageThis pick

Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.

Spectrum trail (transcript)

Med 0 · avg -0 · 108 segs

Mirror pick 2

AXRP

7 Aug 2025

Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Spectrum vs this page

This page -10.64This pick -10.64Δ 0

This pageThis pick

Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.

Spectrum trail (transcript)

Med 0 · avg -5 · 133 segs

Mirror pick 3

AXRP

6 Jul 2025

Samuel Albanie on DeepMind's AGI Safety Approach

Spectrum vs this page

This page -10.64This pick -10.64Δ 0

This pageThis pick

Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.

Spectrum trail (transcript)

Med 0 · avg -4 · 72 segs