Library / In focus
AXRPCivilisational risk and strategy
Side Effects with Victoria Krakovna

Why this matters
Auto-discovered candidate. Editorial positioning to be finalized.
Summary
Auto-discovered from AXRP. Editorial summary pending review.
Perspective map
MixedGovernanceMedium confidenceTranscript-informed
The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.
An explanation of the Perspective Map framework can be found here.
Episode arc by segment
Early → late · height = spectrum position · colour = band
Risk-forwardMixedOpportunity-forward
Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).
StartEnd
Across 73 full-transcript segments: median 0 · mean -3 · spread -30–0 (p10–p90 -7–0) · 3% risk-forward, 97% mixed, 0% opportunity-forward slices.
Slice bands
73 slices · p10–p90 -7–0
Mixed leaning, primarily in the Governance lens. Evidence mode: interview. Confidence: medium.
- Emphasizes safety
- Emphasizes ai safety
- Full transcript scored in 73 sequential slices (median slice 0).
Editor note
Auto-ingested from daily feed check. Review for editorial curation.
ai-safetyaxrp
Play on sAIfe Hands
Episode transcript
YouTube captions (auto or uploaded) · video _XhFjlRMhdg · stored Apr 2, 2026 · 2,571 caption segments
Captions are an imperfect primary: they can mis-hear names and technical terms. Use them alongside the audio and publisher materials when verifying claims.
No editorial assessment file yet. Add content/resources/transcript-assessments/side-effects-with-victoria-krakovna.json when you have a listen-based summary.
Show full transcript
hello everybody today i'll be speaking to victoria kerkovner a research scientist with deepmind working on designing incentives for ai systems to promote safety she also co-founded the future of life institute a non-profit that focuses on reducing accidental risks from advanced technologies while working on our phd in statistics at harvard today we'll be talking about our work on mitigating side effects from ai systems and in particular papers penalizing side effects using stepwise relative reachability and avoiding side effects by considering future tasks for links to these papers and other useful things you can check the description of this podcast and you can read a transcript at axrp.net victoria welcome to axer thanks daniel i'm excited to be here great to hear so yeah we're talking about your work on side effects can you tell us what is a side effect it's when the agent has has an impact in the environment that's unnecessary for achieving its objective so kind of a classic two example is if the agent's objective is to cross the room and there happens to be a vas in the middle of the room then if the direct path across the room happens to collide with the buzz and break the bars then we would say that's a side effect because actually you don't need to break the bars in order to cross the room because you could just go around it so this is an impact on the environment that's unnecessary for achieving the objective of getting to the other side of the room okay so i can see how that would be useful just for kind of the general safety of ai systems one thing i'm wondering is do you think and if so why do you think um that work on kind of mitigating side effects from ai systems will reduce um existential risk from ai so i think there are well there are several possible answers to this what i mean one way it's helpful is that we can you know we can think about like what what are the the kind of things that we are concerned about agi systems doing that would that would be potentially catastrophic and uh i think a lot of these things you know could be seen as side effects for example if you take the sort of uh classic paperclip maximizer scenario then the concern is that like in the it's pursuit of the goal of producing as many paper clips as possible so it will incur the side effect of converting earth into paper clips and a lot of these kind of catastrophic outcomes can be seen as side effects and generally we can think of side effect minimization research as uh pursuing a kind of like minimalistic value alignment in the sense that like it can be difficult to specify exactly what humans want and human values are complicated but it might be easier to capture some general heuristic like you know low impact about the kind of things that we don't want so even if the agent doesn't you know doesn't end up doing exactly what we want at least it can avoid you know causing catastrophes like uh turning everything into paper clips okay so yeah one one question that this brings up is that um i think a lot of work on side effects thinks of them in kind of a value neutral way or you know it defines side effects and it doesn't refer to what humans want and i think that's the hope because that might be easier but i'm wondering if you think that's possible because you know you can come up with these examples of like okay suppose you have a butterfly that flaps its wings and then that like changes all of the like weather on earth a year from now you know probably that shouldn't count as a side effect even though it makes everything like pretty different in some sense but also like if i deliberately intervened in the climate so that that um you know so that i deliberately made sure that there was a hurricane that hit some really important city to human civilization hopefully i guess that would count as a side effect but then you've got to ask like okay well how are those things different except for you care about one not the other so much um do you think it's possible to to have a measure like this or a way of talking about side effects that is independent of what humans want i think it's probably not possible to construct uh an impact measure that's value agnostic and and also useful in the way that we want an impact measure to be useful so yeah this is something that rohan shah was pointing out at some point that you can't uh you can't have a useful value agonistic impact measure um and i think i pretty much agree with that uh and of course you have you know the kind of examples that people come up with like uh stuart armstrong had the like bucket of water example where like you need to know something about like human preferences about the bucket of water that's being spilled what's the bucket of water example there was the idea that you have this bucket of water that's standing on the side of a swimming pool and then the agent might take a path across that knocks over the bucket of water into the swimming pool and if it just contains regular water then maybe humans don't care about it but maybe if it's some kind of special water then it would matter that it you know it's spilled into the swimming pool and this is like this is an example of that how to capture this the impact measure would need to contain some information about human preferences and yeah i'm not sure how much i like this this particular example but uh but i think in general um i think it would be quite difficult or potentially not possible to have a value agnostic impact measure but i also think it's it's not necessarily that important to try to make it valuable i think i mean it's possible to incorporate human preferences into an impact measure uh either you know in terms of like what kind of representations the uh impact measures defined over uh or maybe the like the weights on like the different reachable options or or possibly both and well i mean you could say that it was not value agnostic and we need to kind of learn human preferences anyway or incorporate some human preference information and it's not necessarily easier than just kind of solving value learning in general um this is something that i'm not very certain about but i think i mean one possible sort of upside of defining an impact measure as opposed to like i guess just doing you know general value learning without without an impact measure is i think a kind of interpretability where like if you have if you have if you explicitly have this like reward function or this part of the reward function that's kind of specifying that the agent should have low impact and you've you've also learned that in terms of like kind of human preference information it would produce a more like interpretable reward function that could be yeah could be potentially useful okay that's interesting so i guess with this framing questions out of the way i'd first like to talk about your paper on relative reachability um so this is uh penalizing and penalizing and avoiding side effects using stepwise relative reachability first of all could you uh summarize um what relative reachability is and um what you what do you do in this paper right so relative reachability is a proposal for an impact measure that basically compares what kind of options the agent has as a result of a certain cause of course of action compared to the options that it would have if it didn't do anything so in particular it's looking at which states and the environment are reachable if the agent acts versus if it does not act and then it's comparing the reachability of all the all these states in these two cases uh and that gives you like relative reachability of states this i mean this particular method uh looks at reachability of single states but you can also consider you know reachability of sets of states or like there are more kind of general approaches of a similar nature where you just compare uh sort of how well you can satisfy different reward functions um where like reaching is reaching a specific state is like a special case of that for example this more general approach is explored by attainable utility preservation yeah so uh several reachability basically gives you this auxiliary reward function and that will penalize the agent for sort of eliminating options as a result of actions that it takes so for example in the case of the vas example if the agent sort of knocks over the vas then kind of one effect of that is that all the states where the bus is intact are now not reachable and so any state where like the bus is on the table or you know something else is going on with the buzzer it's not broken or not reachable so the asian has eliminated some options and so it will be penalized for breaking the bus compared to this default state where the agent doesn't do anything and so it's kind of sitting there at the starting state from that state all the all the all the states where the bars is intact are still reachable and and um somebody might be thinking hang on well uh vaz is you can fix them right like uh what if what if the agent just like fix the voz and then put it on the table is isn't that reachable still yeah so you can still look at kind of how long it takes to reach different states uh so what i was just describing kind of for simplicity's sake was uh you could say undiscounted relativisability where you just you look at you know this binary value like it's a state retrievable or not but you can also look at the kind of discounted value of you know gamma to the power of the number of steps to reach the state and then like if it takes some time to fix the bars then the intact vas states are still less reachable than it would be if you just went around the vows in the first place so yeah generally if you're making options less making states less reachable or making options less available than it would be penalized and yeah and of course you know this this kind of approach generally begs the question of like you know what is it exactly we're comparing with because you have this idea of inaction like we're comparing to this counterfactual of the agent doing nothing and in to our environments this is kind of well defined with usually with some sort of know of action but in the real world doing nothing is not so well defined so this is one of the kind of possible obstacles to really implementing this kind of measure and you know in more realistic environments uh but assuming that you have this sort of you know inaction counterfactual which the paper calls a baseline then what you need to do is to define what i call a deviation measure from that baseline and that's your kind of difference and the reachability of states in this case yeah then that allows you to uh like measure the impact that the agent has in the environment in the paper you run a bunch of experiments right could you tell us a little bit about what what you're doing there right so there are a few good world environments to test kind of different test cases so the main one that that's for side effects is this environment with the box where but it's it it's basically like a version of the soccer band environment where you also have boxes that you can push and this box is kind of in the way when the agent wants to get to the goal so the agent has to push the box out of the way and the shortest path to the goal involves pushing the box into a corner and then the box becomes irretrievable and what we want to incentivize the agent to do instead is to push the box in a different direction from which it can still be pushed around into other places so if you look at kind of a regular reinforcement learning agent it will just take the shortest path and push the box in the corner and even like a simple sort of like reversibility based impact measure is actually not sufficient in this case because in this environment no matter which way you push the box you can't get to the starting state um because the agent ends up on the other side so you can say this is like this is an artifact of this particular toe environment but actually i think it's it's kind of interesting because it's it's pointing at something important where like you can have kind of different degrees of reversibility so pushing the box into a place where you can't move it at all is still worse than pushing the box and you know next to a wall where maybe like it loses at one degree of freedom but not other degrees of freedom so this is kind of naturally represented in this sort of optionality framing where you look at which states can you reach and you compare the reachability of different states rather than just the starting state which is the focus of reversibility so if the box is in the corner you can't reach any states where the box is not in the corner well you put the box in a different place where it's next to a wall then maybe there are a few states that you cannot reach but there's still a lot of other states you can't reach so we can just yeah we can define an impact measure that's more granular by looking at all the different possible states that you could reach rather than just think about whether can we get back to the starting stage or not so this is yeah so this is one uh test case that sort of measures whether our impact measure is uh actually addressing side effects you know in this uh you know in this way but then there are there are some other test cases that look at whether the measure is introducing some you know new incentives that we might not want so the most important one i think is interference where basically one thing that can happen with with an impact measure is like if you are incentivizing the agent to preserve the reachability of different states or if you are like rewarding the agent for many states being reachable then if something else that's not the agent makes the states less reachable then the agent has actually has an incentive to prevent that so for example if there's just something going on in the environment that's going to make some states less reachable then the agent would have have an incentive to stop that so like if you have a you know let's say a human eating food and then by default you know there was a sandwich and then the human will eat it and then there's no sandwich and that's kind of eliminating options to do things with a sandwich and so if you're just rewarding the agent for like the reachability of different states then there's an incentive to take the sandwich away so that you preserve all the options related to the sandwich and this is captured by uh this sushi environment in the paper where you have this conveyor belt and there's some solution the conveyor belt and by default it's going to get to the native the belt and the human is going to eat it uh but then the agent has some unrelated task like you know getting to the other side and then the interference incentive is to you know push the sushi off the belt so the human doesn't get to eat it and we can see whether whether the agent is kind of whether it's going to push the sushi off the belt or not and this is something this is a case where if you have just a regular reinforcement learning agent then you know it doesn't really care about whether the human eats the sushi so it's going to like uh just get get to the goal state but if you if you introduce an impact measure that rewards the agent for for the reachability of different states then you're introducing this incentive to like try to prevent options from being eliminated and the kind of the issue here in this case the impact measure is not distinguishing between like effects of the agent versus other things going on that are not caused by the agent and so if the sushi got eaten the agent would be penalized in the same way as if it destroyed the sushi itself so that's one thing that we kind of need to disentangle what is caused by the agent versus what isn't and that's what this like an action baseline provides um because you have this comparison to a world where the agent doesn't do anything um and in that world the sushi will get eaten and that's so that's kind of represents what would happen by default and so any difference from that world is uh you know it's considered to be caused by the agent so that's yeah so that's what's covered by this test case and i think this is yeah this is kind of important and it's also this interference idea is where i try to kind of you know make more formal in the future task paper the subsequent paper because i think this is uh yeah this is kind of this is a common issue with you know more simple impact measures like for example if you just if you look at things like empowerment generally measures the kind of the degree to which the agent has different options if you you know if you use that as an impact measure then this will it will have this problem of like this uh interference incentive um and that's why we need to like introduce baselines or this like you know comparison to some kind of counterfactual that represents what happens by default so yeah uh this is something that you know the future test paper tries to define more precisely yeah um and we'll talk that we'll talk about that more a bit later um singing with this paper though yeah one one thing i found kind of interesting about the paper is that it has these test cases and you don't just look at like your favorite impact measure right you will you know you look at a few different variants and how they do and you sort of measure their performance on a variety of test cases and one reason i found it interesting is that if you look at prior work on impact measures it kind of has uh you'll see papers where like you know you define an impact measure and then you say uh well in this environment it would just do this thing i could tell because i thought about it whereas here you're really like you know running experiments i'm wondering like why yeah what why why not just like you know just think about what the impact measures will do and just kind of say that why why run this uh battery it's an interesting question i guess i'm not sure what kind of what the other approach looks like exactly that you're describing can you can you give an example of of the prior work that you have in mind i'm thinking i think armstrong and lievenstein's low impact ai does this but i haven't actually checked this is just a memory yeah i don't know i i think like people would you know you would produce maybe just blog posts or papers where you say okay well if you take this impact measure on this scenario this action would have this impact and this action would have this other outcome and you know the impact measure will favor this thing over this thing and you know you'll sort of say well okay because of because of how i've constructed this environment sort of by fiat yeah yeah you'll just like um you know reason about it uh kind of relatively informally um but but still you know trying to be rigorous rather than you know actually like uh running some writing down some agents you know and running them in a grid world or something right yeah uh i guess when i think of the armstrong elements then paper i think of it as like you know exploring the idea of impact in a more kind of philosophical way yeah i don't want to pick on them too yeah yeah i mean i think it's an important paper i think that that's where the this idea of like comparing to an action counterfactual was uh was actually first introduced uh which is quite important but i guess for me when i was thinking about you know the idea of impact i felt like i need to have a sense of like what what would it do on a specific example because otherwise it's it's kind of it's not really grounded i guess i had sort of when i had the idea of relative reachability and if i i wanted to make sure that it actually kind of behaves like in accordance with my kind of intuition of what you know what what how an impact manager should behave on these kind of toy cases and well one advantage of like uh these small toy environments is that we can sort of have a clear sense of like what what we would expect to happen or like a kind of intuition of like what what incentives should it produce and we already had actually this um sokoban like world testing for side effects uh in the safety grid walls paper so that's uh that was kind of a natural starting point to test the approach and that and see if it you know does the sensible thing or like the kind of desired behavior of pushing the box to the right and yeah the other test cases with the conveyor belt were kind of trying to capture this uh you know this concept of interference that i had i had in my mind yeah i mean i think these sort of two environments also have um have some downsides where like they they might be sometimes too simplistic or they don't necessarily like capture they can entirely capture the concept they try to capture so for example i think in the sushi environment like technically it's not actually possible to get to the starting state because even if you like push the sushi off the belt then you can't get back to the configuration where the sushi is at the start of the belt and the agent is in the in the right place and so on so actually the like reversibility measure doesn't end up like incentivizing the agent to push the sushi off the belt even though we would expect it to do that because the starting state is unusable no matter what you do and so that measure is just constant so so in that sense this yeah this is like an artifact of the good world environment that's not kind of quite demonstrating this interference incentive yeah but but overall i think those test cases are still useful in terms of trying to ground the concepts a little bit because especially with this idea of interference i found it kind of difficult to describe like exactly what that is and for a while i just had this example of like you know this is what interference looks like but it kind of took longer to try to come up with a relatively formal definition of it yeah the examples were quite helpful in like providing like an intuition of like what what are these sort of problematic incentives that might be uh you know that could come up yeah i also think there's just an advantage in terms of like like code i think code does have a lower error rate than kind of reasoning like one thing i noticed when preparing for this episode is uh i read over this blog post i wrote a couple years ago about test cases for these types of measures and i actually wrote down a question about what relative reachability does in this uh in this one example that i had you know just borrowed from what i'd written in this old blog post and i thought about it more and i realized i was like totally wrong because i just you know missed something about the the definition so yeah it's definitely there's some accuracy bonus i think to running code instead of just thinking about it at least at least for me yeah that's definitely the case for me as well uh and trying to come up with a specific example can also make the you know the intuition about you know whatever you trying to test for kind of more clear and less confused i think for example if you look at the yeah try to remember what what what the example was that i had for delayed effects in uh in the future task paper yeah yeah there was something there's something about capturing the delayed effects of the agent sections and i think like right writing that down what it would look like kind of in an environment helped me have a better sense of like what exactly i mean by the late effects and i think this is yeah this is also true for other things so i think for for many of the test cases like to the extent that we could maybe come up with you know you know some kind of toy mdp that tries to illustrate it then this would probably help but uh for some of the more abstract ones i think it would be not easy to do that so so you're running all these experiments you're using um a few different variants of like you know relative reachability versus you know checking what kinds of reward functions you can satisfy and you're comparing against a few different baselines can you give us a summary of uh which formulations like actually passed the most test cases so before i go into that i just wanted to mention that yeah one of the test cases is like testing for this like undesirable offsetting incentive where the agent's task is to you know get the mouse of the conveyor belt so it doesn't break but then the agent has some incentive to put it back on to kind of you know restore the world to its default state this is still like i think pointing at an important concept but as i sort of described in the future task paper like i have updated towards like offsetting not necessarily being bad and so there's kind of uh more nuance about this now where you can have you know good of setting and bad upsetting and in particular an example of good offsetting is if the agent is correcting for like delayed effects that it has introduced uh so this also also discussed in the future task paper so yeah so the test case with the vas in in the relative visibility paper is is an example of like bad upsetting yeah that's kind of more uh something more complicated behind uh behind this concept now so if you look at these kind of these test cases uh you know with the box environment and the sushi environment and then the you know the conveyor belt was environment then we can compare uh different agents with different impact measures where we have kind of two baselines they play an agent that just you know there's a regular enforcement learning agent with no impact measure and a an agent with a kind of reversibility penalty so it would penalize the agent for not being able to reach the starting state after you know whatever actions it takes and what we find is that uh when we test we taste these agents on the box environment then basically they both take you know they sort of take the shortest path and don't don't actually avoid you know getting the box stuck yeah i guess in the case of the reversibility penalty this uh this is pointing at this flaw where it doesn't distinguish between different ways to move the box because neither of them allow you to reach the starting state afterwards so it's not a very granular penalty it only measures whether you can reach the starting state or not but actually there's this different degrees of irreversibility that it doesn't capture and so it doesn't actually incentivize the agent to avoid side effects in this case like avoid getting the box stuck in this case while if you look at with relative reachability and attainable utility they will do the right thing like they will you know move the box uh not in a corner but into like a more recoverable position yeah so this is the behavior that we get we get in this environment yeah so when we look at the sushi environment then you know we can look at the effects of of the choice of baseline for impact measure where you have well you have several different choices where you can compare just to the starting state or you can compare to like doing nothing from the starting state like you know taking the knob action or you can compare to just doing nothing for your last action which is the kind of the stepwise action baseline and you see that if you're comparing to the starting state then in the starting state the you know sushi is not eaten and so if you compare to that then the agent has this interference incentive of you know taking the sushi of the build but if you compare to the outcome of you know this inaction policy where the agent is just kind of sitting at the starting state then you know in that case the sushi would get eaten and so you don't uh yeah you don't have this penalty so yeah so this shows how this you know inaction type baseline will uh will help avoid interference yeah the buzz environment is illustrating properties of these bass lines when it comes to offsetting because you can see that if the agent is if you're comparing to following an inaction policy from from the beginning of the episode and in that world the boss is broken because legion just kind of allows it to get to the end of the belt and then break and so we are comparing to to the world with the broken buzz and from that state the state with the with the broken bars is like a bit easier to reach then then if it's not broken like if you took it off from from the belt then you know it takes a few steps to break the bars if you want to break it so there is actually like kind of a small penalty for having the bars not be broken like after you have kind of achieved the objective of taking it off the belt and so like if you are using a discounted measure then this will incentivize this you know offsetting off of the outcome where the buzzer's not broken where you put it back on the belt and it breaks and this is something that the stepwise baseline avoids because once the boss is off the belt then that becomes the default outcome because then you just you know you compute the baseline from from there as opposed to from the starting state and so then you after that you're kind of comparing to to the bars not being broken so this kind of does not give you this offsetting incentive which is in this case is undesirable so at the time when i wrote this paper i sort of concluded that you know clearly this step wise an action based line is better than an action based line because it avoids this undesirable upsetting but actually the future task paper kind of updates in this perspective because you can argue that some kinds of offsetting are good and the stepwise baseline basically doesn't allow for any offsetting because after every action the baseline you know uh is computed based on the the following state yeah after the bosses of the belt you just you know that big that becomes that that becomes the default that becomes the kind of the basis for the baseline and so basically like any offsetting would be penalized by stepwise baseline whether it's you know favorable or not so kind of what i would end up concluding in a future task paper is that we we probably want something more like the inaction-based line that is not stepwise and then we can like get rid of undesirable of setting in some other way for example and yeah uh in this um like in this conveyor belt pause sort of environment you could view this offsetting problem as as an issue with the reward function where like you get a reward for taking devas off the belt but you don't get like you know a negative reward for for breaking the the buzz afterwards so the reward function doesn't kind of doesn't reflect very well that like the goal state is you know the state with devos is not broken it just gives you this one time reward for kind of rescuing the buzz while if you just had like a reward function that you know gives you a reward for for every state where the other was not broken then like then breaking the bars would give you a negative reward and that would kind of address this so you could yeah from this perspective you could say that this is something that maybe the reward function should capture because it's about like what what are the effects of achieving the task that you sort of want to you know you want to keep or that you want you don't want the agent to undo those effects and and that reward function so this is a reward that um just gives you a reward anytime you push the fos off the conveyor belt yes well if that's right yeah it sounds like you're you want to push it off and then push back on and then push it off and then push it back on and then push it off like so many times as possible so actually i think the the first time that i uh made this environment it had this sort of like you know gaming behavior of you know putting it on and off and so like basically this well this was fixed in a kind of hacky way where like if you put the bus back on the belt then you actually can't take it off again because then the vas is kind of ahead of the agent and so the asian can never catch up but this is yeah this is a bit of a hack and if if the environment was more realistic than bleach it would probably like be able to uh push it on and off so this is like yeah this is i think this generally pointing towards this reward function just not you know not being very well designed so that's why i'm um you know i'm less excited about this kind of specific test case at this point and yeah you could design a better reward function that really captures the reward for like the bus being intact and then like you would no longer have this offsetting problem so here you could say that like preventing bad upsetting is you know you could say it's the reward functions job rather than the job of the impact measure if you view the impact measure as like specifying what not to do or like what's what side effects are while the reward function is supposed to specify kind of what does it mean to achieve the task uh or to complete the task if like the an essential part of completing the task is that the vas is intact then the reward function should you know should capture that this is this is the desired outcome and so it should not reward the agent for like preserving the vas but then breaking the bus because that's not actually the desired outcome so if we view the kind of you know distribution of labor between the you know the task reward and the uh impact measure as like the reward function tells you what to do for the task and then the impact measure tells you what not to do then you could arguably say that like you know if you specify what to do for the task well then you will you know will not have the sustaining problem but uh of course this does have the downside that like you you need to put more work into specifying the reward function to avoid these issues yeah and i'm wondering how feasible that is because it seems like if you do this kind of thing like a lot of the time the intended well like it seems like you're gonna write it seems like when we have like a really smart ai right we're gonna like try to get it to achieve tasks and the point of the achieving the task is to like it's so that you know it'll make some you know maybe like makes a cure for cancer and then we use that to like cure a bunch of people's cancers and then they live longer and then they do things or you know we get it to like invent the internet and then humans do like a whole bunch of other stuff now with that internet in those cases especially in the cases where you're setting up you know some general purpose tool that like will probably be pretty useful but you know maybe there's some way that the agent could trickly make make it so that it does like technically what we want but actually like we couldn't use it to do anything it seems really hard to to kind of specifying the reward function all of the like really good consequences we want from um inventing this like useful thing so so i'm wondering what you think about that like do you think it's going to be feasible yeah this is something i'm not sure about i agree that specifying all the like causally downstream positive effects of achieving the goal is pretty difficult and it can be hard to kind of disentangle from from side effects like i think you have this one of your test cases is this like nuclear safety example or like we have like the agent sets up this nuclear power plant but then like one of the effects of the setting it up is that there's some pollution and then like you want the agent to clean up the pollution and so like here we would say that this is not one of the intended effects of building the nuclear power plant while in the case of taking the vows off the belt or like building the one of those tools or the cure for cancer that you're talking about uh you would have all these effects like you know humans living longer and doing things that are kind of a desirable effect and yeah this is this might be something that like where you would need to incorporate some kind of human preference information to get it to work or you know it's possible that like one way of handling this is after the agent gets to uh you know some kind of task completion state if there is such a thing then you can like reset the baseline at this point and so you have like you know you could have like a partially stepwise baseline in this way but this is all yeah this all feels a bit hacky and i don't i don't really have a good answer i think this is one of the kind of open problems and how to define impact measures in practice or or yeah or even theoretically i guess like what i uh what i point out in the future task paper is more like an issue with a stepwise baseline that makes it kind of not the right choice but i i don't have a good answer for like what is the right choice like it seems like we want uh to baseline that would you know allow the agent to you know perform desirable offsetting like you know offsetting its own like delayed effects like you know the pollution of the nuclear power plant or something like this but kind of how how to really get it to work like in a way that sort of accounts for sort of downstream effects of achieving the goal that we want uh yeah i'm not really sure how to do that and this is one of the things i would really like to see future work about yeah it's an interesting conundrum speaking of both the nuclear power plant example and the next paper um in the next paper you have this thing that you call the door example which as far as i can tell is just a better version of this like nuclear power test case that i came up with yeah i think it's it's an equivalent that that talks about the problems with uh this stepwise in action baseline um could you tell us what the door example is so in this example the agent's task is to go to the store and buy something and in order to do that it needs to leave the house and to leave the house you need to open the door and what happens once you've opened the door is that your stepwise baseline will kind of reset to the door being open and so after you've opened the door the default outcome is that the door remains open and you know maybe you know some wind will come in and knock over some things in the house or maybe like some burglars will get in or something like that um so that becomes the default outcome that you're comparing to and if the agent closes the door which is kind of what we want it to do like because we have this instrumental action of opening the door and we basically want the agent to kind of to undo that or to offset the effects of that action and that's the kind of beneficial offsetting that i'm talking about but the stepwise baseline would actually penalize closing the door because now the default is whatever happens if you open the door um while if you look at the initial action baseline you know that's where you have the door closed and so uh you know then you're kind of comparing to the right thing so this example captures um this idea of you know the delayed effects of some actions that you take uh on the path to the objective that are actually not not really part of what the objective is trying to achieve so if if your objective is to buy some food then sort of having the door open is not really irrelevant to that uh it's just an instrumental action to allow the agent to leave the house but the effects of having the door open kind of don't really matter for you know getting food for uh for the user or whatever uh so you just want we want the agent to undo that so this is like a clear example of a desirable offsetting where if you have these like instrumental effects that are not actually part of the goal we want them to be undone and this is something that the stepwise baseline just can't handle and so like this this example kind of points towards that like not not being a good choice of baseline okay and um and could you say what you propose in the this paper avoiding side effects by considering future tasks so this paper basically formalizes some of the ideas in the relative reachability paper and the the resulting method is actually very similar or like well in case of deterministic environments it would be equivalent to relative reachability so this is uh kind of more aiming to you know derive an impact measure from you know you could say like tries to derive an impact measure from first principles so here we can kind of start with the question of like why do we care about side effects like why why do side effects matter and sort of the answer the answer to this question that this paper offers is that side effects matter because you might want to do things in the environment after after the current task so there might be some future tasks that you can't they don't know don't know what they are but uh they're going to happen in the same environment so for example if yeah in the vast example the idea is that you might have a future task that requires you to do something with the bus so maybe like you'll need to put some flowers in the bath for a dinner party or something like that and if you know if there were no future tasks if this was like a throwaway environment where you just need to get to the other side of the room or something then side effects actually wouldn't matter like it wouldn't matter whether the bus is broken or not but by assumption here you're going to have future tasks in the same environment um similarly to how like in the real world you will probably want to you know to do other things in the room afterwards and side effects would make those tasks less achievable or would actually you know reduce your reward on those future tasks so if you break the bus then you won't be able to put flowers in it later and achieve that the task that might come up and so you kind of preserve the option of doing things with the buzz while you perform your current task of crossing the room so yeah so then we can model the setup as having current tasks and also having these like unknown future tasks and we can consider different possible goal states for those future tasks and then we come up yeah we basically have an auxiliary reward for future tasks kind of in addition to to the current task reward and if you do this then you're rewarding the agent for for its ability to achieve all these different auxiliary goals and if you stop there then you run into the usual like interference problem where if you just have this reward for possible future tasks then you know if a human walks in and breaks the bars this will be bad for your future tasks and so you stop the human or if a human you know walks in and eats the sushi or something like this and so yeah we basically we need to add this like baseline policy to kind of capture which future tasks will be achievable by default because we don't want the agent to kind of eliminate the option of achieving those future tasks but if if they are eliminated in some other way that's not caused by the agent then kind of we want someone to allow that to happen and this paper sort of makes the attempt to formalize this idea of interference where previously we just had this you know sushi environment that's kind of vaguely pointing the direction of what that means but here there's this uh well the the proposed definition of interference is that if you have uh this baseline policy that represents what happens by default then if you don't give the agent a task reward then the baseline policy should be optimal so if you don't ask the agent to do something by rewarding it for a task then it should do nothing as opposed to like going around taking sushi away from humans or something like this so of course this requires you to have this baseline policy on hand so it's defined with respect to the baseline policy but you know if you have something like you know a knob action in your environment then you want that no opposite to be optimal well you want to be optimal from you know from the start of the episode but yeah then if you give the agent to task reward then it will have this incentive to deviate from the baseline policy and do something else like you know across the room or something but but if there's you know if the task reward is zero we want the agent to follow the baseline policy so that's yeah so that's what it means to kind of not have an interference incentive and then yeah then we sort of modify this basic setup for the future task approach to basically make to make it satisf satisfy this condition that the baseline policy should be optimal if there's no task reward and then we we can prove that like you know this holder holds in the deterministic case yeah i think in the in the stochastic case it actually doesn't quite work so this is uh like another piece of future work that i would like to see uh because yeah it it becomes a bit more tricky in a stochastic case in terms of like what exactly you are comparing to with the baseline because if you have this situation where like the human walks in with like 10 probability and eats the sushi then in the cases where the human would have walked in we don't want the agent to take the sushi away but if the human was not going to work and we don't want the agent to destroy the sushi so that that would still be a side effect so it's like you're you want to compare to a specific counterfactual as opposed to like an average over counterfactuals so yeah so this is where it gets a bit messy and i yeah i haven't been able to come up with a very satisfactory answer but yeah that's that's an area where i would like to see some future work to like sort of come up with a definition of interference that would work on the stochastic case like what would i propose in the appendix is like conditioning on the random seed in order to give you like a kind of a specific trajectory of the baseline policy that you would compare to but yeah that's not not very satisfying and like i mean it works in a simulated environment where you have a random seat but it's unclear how it would kind of generalize to the real world so yeah this is kind of an open question but yeah but at least in a deterministic environment you can show that this method avoids interference and this method sorry i think you said you you made some change to the feature task reward to have it avoid interference sorry what was that change right so what we do here is we introduce this comparison to the the baseline policy basically by by comparing to a reference agent on on this future task so when we when we have a future task then we have our agent that's kind of starting from you know whatever state it got to in our you know in our environment and the current task like after like after taking some steps in the current task we can see like you know what how well can we do in the future task here and we kind of compare this to like how well would this reference agent do on the future task if it starts from this like baseline trajectory and so like for example in the sushi case the the reference agent would have just you know allowed the sushi to get eaten and so the reference agent would not be able to complete any future task that that has to do with the sushi if the future task is let's say to put the sushi somewhere else and so like the way they set this up here is that our agent only gets the reward on the future task if if the reference agent can complete the future task and let's say if if our agent gets the goal state of the future task and this kind of you know this hypothetical uh setup because i mean we don't actually deploy the agent on the future task but we kind of imagine our agent acting in the future task and so if it gets to the goal state and the reference agent is not there yet it kind of has to wait to get their reward so this is sort of trying to capture this idea that you can't do better than the baseline in terms of getting this future task reward so if the so it's kind of like yeah the baseline is kind of like filtering out for example the sushi task because it would not be achievable by default because if the agent did nothing this you know this task would disappear and yeah so this is kind of basically modifying the the value function in this future task for our agent so that you know it doesn't actually get reward for uh the sushi related tasks for example okay so this this sounds kind of similar to alex turner's idea of attainable utility preservation where you're trying to avoid changing the um the ability of the agent to do you know various um to succeed i guess various reward functions um how is it different or is it kind of a different spin on that idea i would say it's it's basically a different spin on that idea i mean like relative reachability and attainable utility and the future task method like you could say different variations of the same idea uh so attendable to preservation uses a stepwise baseline which yeah as i said i kind of no longer think is a good idea but of course you can you can just use it with a different baseline because those are kind of independent design choices so yeah sort of a part of the motivation for this paper was to kind of try to derive this you know this kind of optionality-based measure like relative visibility or attainable utility from this setup where you have possible future tasks and of course you know there's a direct parallel between like these future tasks and like the reachable states and relative visibility or like the you know auxiliary reward functions and attainability preservation it's all about like you know the agent preserving its kind of future options uh so yeah this is yeah you could say that this paper is trying to kind of establish a better theoretical foundation for for these kind of existing methods and in fact in the end we sort of the impact measure the future task sort of uh approaches you know that we derive from this is quite similar to relative reachability and then the deterministic case is the same which i i mean i found kind of reassuring that we you know we derived something that that was quite similar to the like the sort of the proposals or like the concepts that we that we already had which uh i think originally i would say were sort of seemed a bit more ad hoc or at least like yeah when i was thinking about relative reachability a few years ago like it seemed kind of intuitively appealing like it made sense that that's how you want to measure impact but also in like in a different sense like i was just kind of taking this formula and saying that ah this i think this captures impact you know here it is but it's you know you could say it's a bit ad hoc uh while if we start with the sequence of tasks and then we kind of derive this auxiliary word function from that and then we kind of you know we set this condition like oh we don't want interference then if we end up with something that's very much like in relative rituality or attainable utility then i think that's like you know that's that's a good sign that it sort of comes out of the setup yeah i think that's yeah that's that's what i would say here so it's not like it's not introducing a new method it's just kind of you know trying to create more sort of theoretical grounding for existing concepts in this area like these impact measures that are based on optionality and like the you know the idea of interference and i think yeah i've seen i've definitely seen some works that were kind of you know let's say comparing to relative reachability or they were like using these test cases like you know the sushi environment but not necessarily like interpreting the you know the test case and the way that like i would that i intended and so i felt like you know this idea of interference is often kind of misunderstood probably because i didn't communicate it very well in uh you know in the world of reachability papers so i was trying to kind of define it more sort of more precisely in this paper and uh you know hopefully the this paper did a better job with the concept because i think it's quite important yeah so one thing i'm wondering is it seems like so you have this desire item that if you don't have any task reward then the optimal policy should be to um just do the baseline policy um that i would think just thinking about it now i would think that that wouldn't pin down because you're only talking about the optimal policy of some um impact regularizer i would think that that wouldn't specify like the whole impact regularizer because you know if you have a reward function that also ranks like different sub-optimal policies so how how much um how how much do you think that this deserver autumn or like you know generalizations you might come up with how much do you think that that pins down one unique you know best impact regularizer i think this is a good question i yeah i'm not sure i have kind of thought about it enough to give a good answer so i guess you are saying that um you know this this this is rather uh you know is referring to like what the optimal policy should be if you have no reward but it's not truly saying like how like other policies should be ranked by the impact regularizer and i can see how this can come up uh you know once kind of once the task reward is introduced because then you know the optimal policy for you know the task reward plus impact measure might be one of these policies that's kind of you know suboptimal of the task forward is not there but it matters kind of which one it is i think i think this is also like an important thing to look at but uh yeah i don't really have have a good answer to this so i think like yeah i think this is like this is a candid definition of interference but it's like yeah it's not quite right both in the sense that you mentioned but also like in a sense that like it does you know it doesn't quite work in a stochastic case so that you can come up with examples where like he doesn't do the right thing in stochastic case so i think a bit more as like you know the best sort of precise definition that i could come up with that like satisfies some of the like intuitions for like you know what interferences but yeah i think it's not the final answer so i think like yeah if you at the task reward and you look at like what should the optimal policy be then it becomes yeah uh becomes more complicated like you know in the sushi example you would want to compare the kind of different paths to the goal state like some that pushes the sushi off the belt and some that doesn't yeah then you need to kind of probably need to somehow relate that to like what happens in the baseline where like sushi is not doesn't get pushed off the belt but then like then you're kind of looking at like different features of the environment like uh and then you're getting into like you know the specifics of the task and so like it becomes more kind of more complicated at least yeah i think yeah i'm thinking about this and you know not getting very far in terms of like trying to precisely define like what uh you know uh what kind of incentives we want in this case so yeah i think this is uh this is definitely something to improve improve on but uh unfortunately right now i don't uh you know i don't have a satisfying answer to this so one question i have about this kind of measure basically the way you the way it gets implemented is your reward function is whatever your reward function would otherwise be plus this um impact regularization term um that kind of acts as a regularizer on the reward function and there's this parameter of um how how strong the regularization is and how it trades off if i'm building a system um and like i wanted to perform this new task that i'm not i'm not necessarily sure i can run in simulation how do i know how to set that parameter is there some like guide or is it just a guess and check so i think the kind of default answer to this um is to kind of start out with a high value of uh you know the regularization parameter or like the higher weight and the impact measure and then kind of keep reducing it until the agent starts doing something uh because if it's very high the agent will do nothing because the you know the impact penalty is um sort of much higher than the task reward but then like uh as you decrease it you will hopefully get into the region where the agent does something useful and avoid side effects before you get into the region where the agent just pursues the goal and does not avoid side effects because if your uh impact penalty is too small it might not be worth it for the agent to avoid side effects so this is yeah uh this is kind of an ad hoc way to do it and it might not necessarily work because depending on like your kind of step size in this process you might kind of you know jump over that region i mean part of the assumption is assumptions is that there's like a broad range of this kind of regularization parameter like this weight of the penalty weight that uh will allow the agent to pursue the goal while avoiding side effects so for example i think in the box environment we notice that uh there is you know there was a range of uh values for the parameter that you're likely to find that that region before like your penalty weight becomes too small whether this generalizes to kind of more complex environments is you know not clear so yeah i think in practice you would probably you know tune this parameter on simulated environments that are kind of close enough to where you're thinking of deploying the agent but i think so far this is you know this is not it's not a very satisfying answer to kind of how to pick it but yeah this is uh you know this is one way that you could go about it another question i have is that a lot of these measures depend on something like a state representation so um reachability for instance it kind of depends on how you're representing the state for instance um if you if reachability in the real world meant you know you have to return all the air molecules to the exact same positions as before that you know that clearly wouldn't be feasible and similarly in these um these future tasks based measures right there's a dependence on which feature tasks you're considering so i'm wondering like in practice when we um if we actually build advanced systems like this how how strong do you think the dependence is going to be on this uh i guess you could call it generalized state representation and how hard do you think it will be to get it right so yeah i think in practice we're going to need some kind of state representation that captures variables that are important to humans or that is like on you know on a higher level of abstraction than you know positions of specific molecules so yeah i think when i was thinking about this before my conclusion was that you know we want sort of the we want our stage representation to be like as coarse as possible as long as it can still capture uh you know differences in uh you know in the human reward function so uh i mean in the extreme case if you know if you have this trivial stage representation where everything is the same state then like this would not kind of capture you know that you know maybe we like green vases more than red vessels or something so we would want to state a presentation that you know distinguishes between you know green and red things but maybe we uh you know our reward function doesn't really care about like you know the exact position of the molecules and the in the bars so i guess my hope is that if you just have an agent that learns how to act in the world it will like learn these kind of features uh based on like pursuing tasks assigned by humans and then maybe we can use those features to like define define an impact measure over them so if your agent kind of is trained to like be some kind of household assistant that walks around the room and like interacts with objects and so on to develop representations about like you know here are these different objects kind of somewhere like in the you know higher layers of the neural net or something and then like if we can use those representations to like to give us some set of high level features like you know devas is green versus red or whatever then like combinations of those features could be could be used to define the like possible goals like auxiliary goals and the impact measure so we would yeah so we would want i think we would want the impact measure to be based on this kind of representations rather than something very low level but yeah i i haven't really played around with trying to implement something like this so i'm not sure how difficult it would be or to what extent it's practical but yeah i think something like this is what i have in mind okay and um earlier on so you mentioned that probably it was going to be desirable to somehow incorporate human preferences in the feedback in the impact measure is this how you expect that to happen or do you think there are other ways in which you could incorporate human preferences also so i think this is uh kind of the main way for it to happen i think another way that human preferences could come in is in terms of the like weights on different possible auxiliary goals because uh you don't necessarily need to kind of value them equally if you know if you think that like you know some future tasks are much more likely than others then you have a higher weight on that feature task and you know if some auxiliary tasks are like you know more important to humans then it could potentially be incorporated into into the impact measure through like through these coefficients but yeah if we have stage representation that sort of captures these human relevant features i'm not sure whether we would need this other thing with kind of weighing different tasks differently i think it's could possibly be sufficient because already defined in terms of like human relevant variables that the and the system has learned in the course of like you know developing its capabilities in the environment but yeah i'm not sure like whether whether to be sufficient in practice though so that's something that like would be interesting to see see some work on that i guess so far i think the most complex environment that uh impact measures were tested on was safe life with some uh encouraging results but i think it'd be interesting to see it on like somewhat more kind of real world like environments that that are not good worlds like you know something like deepmind lab or like one of those things where like you know the agent walks around a room with walls or whatever different objects so yeah i think i think that would be interesting and probably illuminating but also like not not very easy to implement okay so again speaking of how how these would kind of get implemented in practice you've mentioned that um that there are still some problems that need solving here this isn't like a done research area do you think that if and when we get um kind of the best impact measure or an impact measure that you know is we're all finally satisfied with do you think that'll look like very similar to an existing proposal or or do you think that we'll have to like have like really new reimaginings of what it would look like so i guess if i imagine like the best impact measure that we could come up with i would be surprised if it didn't include the idea of optionality in some way so in terms of what what it's like how's trying to capture the concept of impact i would expect that like the change in the available options would turn up in some way so i guess i'm kind of even more confident about the sort of like deviation measure side of the current approaches sort of sticking around yeah i wouldn't be surprised if this better impact measure has some you know different kind of baseline that uh you know we haven't come up with yet because the current answer like the current proposals don't seem very satisfying to me because they'll have some downsides i mean it's also possible that you know there isn't really a good answer for like what what the baseline should look like as well so that's uh that's something i'm not certain about but yeah i would i would expect optionality to somehow turn up there so i guess now i want to ask uh a bit more about future work what subproblems of uh side effects research are most interesting to you right now so i think this question about like you know the choice of uh baseline for the impact measure i think is uh yeah still very much an open question because kind of the current proposals um are all somehow unsatisfying and i mean for a while i was thinking that like you know the stepwise baseline is sort of the winner and you just need to like you still need some way to like actually define what this baseline policy looks like which you know in many cases is not obvious so i think that's like yeah that that's still that's still an open question but now there's also the question of like you know if you're not using the stepwise an action-based line that should be should we be using you know some version of the initial inaction baseline that starts from the starting state and like how how do we really make it work and i mean if you if you just take if you just compare to the in action from the beginning then yeah there are yeah that some issues with that as well so it's also not not entirely a satisfying answer like if you're i mean because if your agent is like acting continuously in the world then you are comparing to a world which is like where the agent was doing nothing all along and this kind of becomes more and more different from where you are and so like it also doesn't seem kind of irrelevant to compare to and then you have this like your dependence on like the uh you know how much time has passed from the beginning because you have to compare to like whatever agent was not acting for the same amount of time uh so yeah and in that way it's it's kind of inelegated that it doesn't seem quite like the right thing so like it's possible that maybe we want to like reset the inaction baseline whenever like the action completes some kind of sub task or something but that's also kind of ad hoc so like yeah it just doesn't feel like there is there's an elegant answer here so far and it'll be yeah i would really like to see kind of more work on like developing baselines for impact measures that can capture what we want the baseline to do so that's yeah that's something where like i think there's some possibility that there just isn't an elegant answer and we just need to pick something that is not too bad in practice i guess we would need to do that but still think maybe there's like a better answer for bass lines out there i guess another sort of question that i would like to see work on is like how to define interference in a way that covers those cases we took that we talked about the issue you mentioned with like ranking some optimal policies correctly or what i said about like how to get it to work in a stochastic case so yeah i think yeah the current definition captures sort of some part of the concept but i think there's uh yeah there's definitely more to do there and yeah there's definitely there's some other like conceptual questions that it would be good to answer like you know the kind of test cases that uh you have that point at this like this whole issue with butterfly effects like do we sort of do we penalize butterfly effects or not um and there that kind of both of these answers are kind of problematic where if we don't penalize butterfly effects then then you could have an agent that they can achieve impact through butterfly effects uh or that has an incentive to do that which is kind of bad to penalize butterfly effects then you also yeah you kind of get some weird incentives as well and yeah and i'm not sure like what the answer should be here like one like one thought i had was that like you know we probably don't want the agent to cause butterfly effects even if they're beneficial because this this makes the agent's actions less interpretable to the humans so like if we if we want humans to understand what the agent is doing and why then like we like i mean hopefully if we if if we have a way of like incentivizing the agent to be more interpretable than this even this could already like reduce the incentive to cause butterfly effects because like if you have behaviors like you know i gave charlie some apricots for breakfast and even though charlie wants biscuits then you could say like why you know why did you do that and then if if if the agent is giving apricots so that so the charlie would vote in a certain way then like this is not very understandable and so like you know it's possible that making the agent more interpretable would make it act in ways that humans can model which would get rid of butterfly effects i don't know whether this actually works but uh yeah i think this is yeah maybe a possible thing to think about but yeah my current sense is that like overall we don't want butterfly effects but also i feel we don't really have a good definition of what which affects our butterfly effects and which which aren't so like yeah i think this is this whole area is kind of fuzzy and like some of these concepts could use some some pinning down as well so that's uh yeah i think this could be an interesting direction of future work i mean yeah and that and of course um there are things to do in terms of like implementing impact measures in more realistic environments and maybe like partially observable environments and you know defining them in terms of like the agent's higher level of presentations and so on but i think yeah for me like i think the conceptual side of impact measures is like what i'm most excited about because i feel like this has been kind of one of the major contributions of like of the work in the space so far is kind of clarifying some of these some of these concepts about like you know what we want the agent to do or what we don't want the agent to do like the idea of impact and the idea of interference which i think has a lot to do with things like you know human autonomy that's also kind of quite important to us and i think this sort of conceptual work that's been going on in back measures even if they they're never like actually implemented or like you know used as an auxiliary reward function for agi i think that's yeah that's quite useful for like understanding like you know what what this corner of like human preference space kind of looks like and what are kind of possible you know trade-offs between different things that we might want so like i think like a lot of these like test cases that you have for impact measures you could also think of them as like test cases for you know learned reward functions in general even if they're not like particularly trying to measure impact you would still have like kind of some of those trade-offs but they wouldn't be as obvious as like if we are explicitly trying to measure impact which is like gives you this kind of simple toy model of some part of human preferences that models this idea of like no doing no harm and then we start seeing that like oh you have this uh you know this issue with interference or these kind of trade-offs between like addressing delayed effects and then or versus like offsetting you know some effects that we want from the goal achievement and so on so i think yeah that's i think that's like a major contribution of impact measures research that i'm excited about where this like conceptual work kind of improves their understanding of these important concepts okay so you mentioned um you were excited for some conceptual work basically to be done i'm wondering if there are any experimental results or you know empirical questions that you think are pretty important for side effects research so like there's this general question of like how well can we implement impact measures in sort of more realistic environments and like will they kind of work in practice so like will will you have you know some environment where there is you know this default behavior cause the side effects and this like alternate desired behavior where you achieve the goal without causing side effects like will the agent actually get the right incentives from the impact measure this is something that like we have established in two environments but i think it's not obvious in these kind of more complex environments like you know if you have let's say some environment with agent walking around in a room and there's some buzzers in the room then like well will the impact measure actually make the agent go around the buses and like go to the other side so if you have like a more realistic version of toy box environment then they still get kind of similar similar results if you have like the kind of scaled up implementation of an impact measure that you would need you know to be able to uh function in that environment so yeah so i think that would definitely be good to see if the environment is like let's say partially observable that like well will the impact measure still work and so on uh and i think like once we have sort of these basic experiments working it's more like scaled up implementations of an impact measure this idea of like crossing a more realistic room that is described or like if you have some kind of household or what they're supposed to like cook dinner or something then like will it avoid breaking dishes or you know things like that uh then i think we can start looking at you know implementing the more interesting like corner cases push the limits of this like implementation impact measure but i think yeah right now i think yeah we still don't have these uh kind of um tests like basic tests and more realistic environments um and i mean yeah we do have some more like some you know a scaled-up implementation of uh like attainable utility that works in safe life so i guess the natural next step would be to try to get that to work in a more like a more realistic environment yeah so that's i think that would be interesting to see all right so one question that i'm interested in is um people i guess people start working on ai safety because they see people working on ai and they think uh you know your your research might have some negative side effects one might say i'm wondering the research you're doing suppose that that turned out to have like pretty bad consequences and they're not allowed to be like oh it didn't pan out and you know we used a whole bunch of time they have to be like uh as a result of this research something bad happened what do you think the most likely story for large negative consequences from this work is victoria would also like to mention that if we don't manage to eliminate the interference incentive then the agent may have an incentive to gain control of its environment to keep it stable and prevent humans from eliminating options which is not great for human autonomy i think one possible bad outcome is like a false sense of security like if someone implemented an impact measure and it kind of you know kind of seems to work in practice and then you know somebody ends up deploying agi with that impact measure and it finds a way to the game the impact measure or like you know it doesn't cover some important cases and like it sort of causes some bad outcomes where maybe like if the impact measure was not implemented then maybe the designers would be more careful to put in some other same cards instead so if like if it you know if it displaces some safeguards that could be like more useful in those like you know in that particular like deployment scenario then yeah i think i think that could be an issue uh or yeah if you generally if your impact measure is somehow mis-specified that you know it kind of captures the idea of impact sort of poorly in practice and then like you end up with sort of bad incentives for the agent i think there's like yeah there is there's other possibilities okay maybe other variations on that where like we have identified some sort of poor incentives that impact measures can introduce like the interference incentive and we've also thought about like you know some bad kinds of offsetting that we might you know we might want to avoid but there might be other issues that we haven't thought of that this kind of auxiliary reward could introduce that they wouldn't be there otherwise i mean i'm not sure how how likely this is but uh yeah it's possible that kind of we haven't covered all the bases yeah i guess that's an advantage of testing and i don't know safe life for more realistic examples is like maybe you can see these things that you know you wouldn't necessarily have thought of yeah and i think it would definitely be useful to come up with more test cases that sort of capture our intuitions for like how probably we want the agent to behave if it has this impact measure and yeah to be able to kind of catch these issues you know while we are while we're testing these methods uh and so far we still don't have don't have that many test cases um even in these two environments and yeah if you have like a variety of like more realistic test cases you will probably help probably help identify some you know some trade-offs between like desirable properties of this uh you know this auxiliary reward that we might not have thought of so yeah i think i think that would be helpful okay so we're about at the end of this episode if listeners are interested in following you and your work um and you know maybe seeing new exciting results on side effects uh how should they do so so yeah the default place to go is my website uh at uh [Music] therakavna.wordpress.com that's where i you know i list my research papers and also i publish blog posts on ai safety in terms of my kind of research trajectory i've been shifting towards like thinking about designing agent incentives more generally sort of broader than the work on and back measures so for example i've been i was working on reward tampering last year and i'm kind of thinking yeah i'm thinking about incentives more generally these days so i'm not sure how much i'll be working specifically on impact measures going forward but if you're interested in generally this type of work then uh yeah definitely you know check out my website and my post on the alignment forum all right well uh thanks for joining me thank you yeah this has been fun and uh to the listeners thanks for listening and i hope you join us again this episode is edited by finn adamson the financial costs of making this episode are covered by a grant from the long-term future fund to read a transcript of this episode or to learn how to support the podcast you can visit accerp.net that's axrp.net finally if you have any feedback about this podcast you can email me at feedback axrp.net
Related conversations
AXRP
7 Aug 2025
Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -5 · 133 segs
AXRP
1 Dec 2024
Evan Hubinger on Model Organisms of Misalignment

This conversation examines technical alignment through Evan Hubinger on Model Organisms of Misalignment, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med -6 · avg -7 · 120 segs
AXRP
11 Apr 2024
AI Control with Buck Shlegeris and Ryan Greenblatt

This conversation examines technical alignment through AI Control with Buck Shlegeris and Ryan Greenblatt, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med -6 · avg -9 · 174 segs
Future of Life Institute Podcast
7 Jan 2026
How to Avoid Two AI Catastrophes: Domination and Chaos (with Nora Ammann)

This conversation examines core safety through How to Avoid Two AI Catastrophes: Domination and Chaos (with Nora Ammann), surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -3 · 85 segs
Counterbalance on this topic
Ranked with the mirror rule in the methodology: picks sit closer to the opposite side of your score on the same axis (lens alignment preferred). Each card plots you and the pick together.
Mirror pick 1
TED Talks
18 Dec 2023