Library / In focus

Back to Library
AXRPCivilisational risk and strategy

Attainable Utility and Power with Alex Turner

Why this matters

Auto-discovered candidate. Editorial positioning to be finalized.

Summary

Auto-discovered from AXRP. Editorial summary pending review.

Perspective map

MixedGovernanceMedium confidenceTranscript-informed

The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.

An explanation of the Perspective Map framework can be found here.

Episode arc by segment

Early → late · height = spectrum position · colour = band

Risk-forwardMixedOpportunity-forward

Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).

StartEnd

Across 85 full-transcript segments: median 0 · mean -5 · spread -200 (p10–p90 -100) · 2% risk-forward, 98% mixed, 0% opportunity-forward slices.

Slice bands
85 slices · p10–p90 -100

Mixed leaning, primarily in the Governance lens. Evidence mode: interview. Confidence: medium.

  • - Emphasizes safety
  • - Emphasizes ai safety
  • - Full transcript scored in 85 sequential slices (median slice 0).

Editor note

Auto-ingested from daily feed check. Review for editorial curation.

ai-safetyaxrp

Play on sAIfe Hands

Episode transcript

YouTube captions (auto or uploaded) · video U9edwa68CYg · stored Apr 2, 2026 · 2,619 caption segments

Captions are an imperfect primary: they can mis-hear names and technical terms. Use them alongside the audio and publisher materials when verifying claims.

No editorial assessment file yet. Add content/resources/transcript-assessments/attainable-utility-and-power-with-alex-turner.json when you have a listen-based summary.

Show full transcript
hello everybody today i'll be speaking with alexander matt turner a graduate student at oregon state university advised by prasad tatapali his research tends to focus on analyzing ai agents by the lens of the range of goals they can achieve today we'll be talking about two papers of his conservative agency by attainable utility preservation co-authored with dylan hatfield manel and prasad tub holly and optimal policies tend to seek power co-authored with logan smith rohin shah andrew kritsch and prasad tatapali for links to what we're discussing you can check the description of this episode and you can read a transcript at axrp.net alex welcome to the show thanks for having me before we begin a note on terminology in this episode we use the terms q value and q function a few times without saying what they mean the key value of a state you're in and an action that you take measures how well you can achieve your goal if you take that action in that state the key function takes states and actions and returns the key value of that action in that state so to summarize key values tell you how well you can achieve your goals now let's get back to the episode so i think first of all i want to talk about this first paper conservative agency via attainable utility preservation so i see this as roughly in the vein of um this kind of research program of like minimizing side effects is that like roughly fair yeah so within that program what would you say this paper is trying to accomplish at the time i had i had an idea for how the agent could interact with the world while preserving its ability to do a range of different things and in particular how the agent could be basically unaware of this maybe true human goal it was supposed to pursue eventually and still come out on top in a sense and so this paper was originally a vehicle for demonstrating and explaining the approach of util attainable utility preservation but also it laid the foundation for how i think about uh side effect minimization or impact regularization today uh in that it introduced the concept of or the framing of an ai that we give it a goal it computes a policy it starts following the policy maybe we see it mess up and we correct the agent and like we want this to go well over time even if we can't get it right initially so i see this as a second uh big contribution of the paper so yeah i guess first of all what should i think of as like the point of side effects research like uh what's what's kind of the desired state of this line of work right so there's various things you might want out of this line of work one is just pretty practical like if i want a robot to interact in like some kind of irreversible environment where i can break a lot of things like how do i easily get it to not break stuff while also achieving a goal i'd specified so this is like less large scale how do we prevent aji from ruining the world and more you know uh more like practical maybe present day or you know you're in your near future then there's a more ambitious hope of well if even maybe we can get a great objective to just maximize very strongly and uh get write down an objective that embodies every nuance of what we want the uh an agi to do but perhaps we can still have it do something pretty good while not changing the world too much and uh these days i'm not as excited about the object level prospects of this second uh hope but you know at the time i think it definitely played more of a role in uh my appreciating the program why aren't you so optimistic about the second hope so i think like even even if we had this hope realized uh there would be a some pretty bad competitive dynamics around it so if you have a knob like even you let's say the simplex simplistic model if you have like an agi and then you just like give it a go and then you've like a knob of how much impact you let it have and let's say we got the impact knob right well then if we don't know how to turn up the impact knob far enough like there's gonna be competitive pressures for firms and other entities deploying these agis to just like turn the knob up a little bit more and you have like a unilateral less cursed situation where you have a lot of actors independently making decisions about like how far do we turn the knob and maybe like if you turn it too far then that could be really bad and i expect that knob to get turned too far in that world i also think that it doesn't engage with issues like uh like inner alignment and it also seems like objective maximization seems like an unnatural or improper frame for the kinds of alignment properties or for producing the kinds of alignment properties we want from uh transformative ai huh yeah i i at some point we should talk about aep itself but this is kind of interesting why why do you think objective maximization is a bad frame for this this leans on the second paper okay optimal policies tend to seek power and so i think there's something like if you're trying to maximize on the outcome level the ai is considering some range of outcomes it could bring about and it's you like give it a rule and it like figures out which outcome is best it's like taking this global perspective on the optimization problem where it's like zooming out saying well what do i want to happen and i grade these different things by some you know like consequentialist criterion like i think there's a very small uh set of objective functions that would lead to like the correct outcome being like a non-catastrophic outcome being selected and i think it's probably hard to get in this set but also i think that humans and our goals are not necessarily very well modeled as just you know objective maximization in some sense they trivially have to be it's like you know one if you know the universe history turns out as it like turns out how it did in zero otherwise basically but in more meaningful senses i don't feel like it's it's a very good specification language for these agents okay that's giving us some interesting stuff to go to later but uh first i think the people are dying to know what is attainable utility preservation so attainable utility preservation says we especially initially specify some like partial goal and maybe it says like cross the room or like make me widgets but it doesn't otherwise say anything about you know breaking things or or about interfering with other agents or anything and so we want to take this partial goal and what aap does is the partial goal is it subtracts a penalty term and the penalty term uh is basically how much is the agent changing its ability to achieve a range of other goals within the environment and in these initial experiments these other goals are uniformly randomly generated reward functions over the environment state and so you can think of it like there's there's one experiment we have where the agent can shove a box into a corner irreversibly or it can like go around on a slightly longer path to reach the thing we the goal which we reward it for reaching and what aup says is reach this goal but you're going to get penalized the more you change your ability to like maximize these randomly generated objectives okay and so the idea is or maybe the hope is by having the agent preserve its ability to do a wide range of different objectives it'll also perhaps accidentally in some sense preserve its ability to do the right objective even though we can't begin to specify what that right objective is okay so i have a bunch of questions about aep i guess the first one is like what nice properties does aup have so the first nice property is it doesn't like it seems to preserve true goal achievement ability without giving the agent information about what that goal is um we give the agent a little bit of information in that we penalize the agent compared to an action and what we're saying is well inaction would be good for preserving your ability to do the right thing but beyond that you know we're not having to say well you know we're going to work the box into the penalty term we're going to work like the vase or some other objects into it we don't have to do that in case people didn't hear it we're comparing to in action we're comparing it to doing nothing yeah um another nice thing is a follow-up work demonstrated that you don't need that many auxiliary goals to to get a good penalty term so you might think well the more the bigger the world is the more things there are to do in the world so like the more uh random things i would need to sample right but it turns out that in uh quite large environments relatively large environments we only needed one goal that was uninformatively generated and we got good performance and number three is that it's pretty competitive at least in the settings we've looked at uh with not uh with not regularizing impact agents still able to complete the tasks and in the follow-up work it uh sometimes even got better reward than just the naive task maximizer it did better at the original task than the thing that's only optimizing the original task and again that that's in the future work so i don't think we'll talk about it as much but uh i would say that those are the top three yeah maybe all the fun follow-up questions to this will be answered in the next paper but but how could it be that like picking these random reward functions preserving its ability to achieve those reward functions is like also letting it do like things we actually wanted to do and you know like stopping it from having side effects that we in fact don't want even though we didn't tell it what we actually did want right so is the question how is it still able to have a performant policy while satisfying these uh or while being penalized for no usability the question is like what's the relation between like being able to achieve these random reward functions and being able to achieve not the main reward function but like you know yeah why is like having it preserve the ability to do random stuff related to anything that we might care about so on a technical level when when when i wrote this first paper in 2019 i had no idea what the answer to that question was on a formal level okay i think there's some intuitive answers like your ability to do one thing is often correlated to your with your ability to do another thing like if you die you can't do either if you're like if you get some kind of like speed power up then it probably increase your ability to do both of these two different things but on a formal level i didn't really know and this is actually why i wrote the second paper uh to try and understand what about the structure of the world that makes this so so yeah we'll talk about that a little bit later i guess i'd like to ask some questions about yeah the specifics of aep now so it really relies on like comparing what you can do given some like proposed action to what what you could do given a like in the paper you called a no-op action or like the action of doing nothing like in the paper you sort of say yeah we're comparing it to this like action that we're going to call the no up action but but like what is it about like doing nothing that like makes this work compared to just you know taking the action of like move like presumably if you like compare your change and like how much goals you can achieve relative to like if you just move to the left maybe that wouldn't have as many good properties right yeah so the yeah i wondered about this and you could consider an environment whose dynamics are modeled by like a regular tree and in this environment there's i mean there's no real no op right like the agent's always moving it always has to close off some options as it moves down this tree so so in a tree that they sort of moving left or right but it's always sort of going forward it always has to make a choice like which way exactly uh and so you're not going to have a no-op in all environments and so i think that's a great question one thing to notice is that under the no op or inaction policy more generally it seems like the agent would stay able to do the right thing or if it didn't it would be like through no fault of the agent so maybe like like it could be the case that the agent is about to die and in this case yeah if it just got blown up then probably if it did nothing it wouldn't be able to like you know satisfy whatever true goal we wish we'd specified so like in those kinds of situations where like the the agent has to make like choices that depend on our true goals and it has to make like irreversible choices quickly i think that impact measures uh and side effect avoidance techniques will have a lot of trouble with that but in situations where things kind of like there's like a more communicating environment or an environment where if the agent did nothing right it would just sit there and we could just correct it and give it the true goal theoretically i think that those are environments where you see something like no op being viable and so one frame i have is that a good know what policy is one that is going to preserve the agent's power to optimize some you know true objectives we might wish to give the agent later maybe we don't know what those are but it has a property of keeping the important options open i think but there's a couple reasons or i like last time i thought about this i remember concluding that this isn't a full story and presumably part of the story is that it's like the action of doing nothing kind of doesn't mess anything up which makes it a good thing to compare to you another question i have is that um in the formalism right you're looking at like how does the agent's ability to achieve arbitrary goals change one question i have is like why if i'm thinking about side effects right and what's bad about side effects it seems like if i if i imagine like the a classic side effect being like you randomly break of oz because it's like kind of in the way of you getting to this place you want to be like it means that like i can't have the vars like i'm not able to think about you as a robot right it means that like oh man i don't have this virus anymore i can't like show it on my table i can't do all this stuff it seems like more closely related to my loss of ability to do what i want rather than like the robot's less of ability to do what it might want so yeah why do you think it is that like like is it just that it's easier to specify the robot's q values because like it's you sort of already know the robot's action space or or what i think that's a big part of it so if you if the human remains able to correct the agent and the robot remains able to do the right thing then now you have a lower bound on how far your value for the true goal can fall assuming that the the robot's q value is going to like measure your you know value in some sense which isn't always true like the robot could be able to you know drink lots of coffee or more coffee like per minute than you could or something but in some sense it seems like there's some relationship here it is hard to measure or to quantify the human's action space uh i think there's a lot of tricky uh problems with this that i don't know personally don't know how to deal with okay so um getting a little bit more into the weeds in the paper you use this thing called the stepwise inaction baseline and what that means is that like uh or what i think that means correct me if i'm wrong is that at any point the agent's like okay you're comparing like legion is comparing how much ability you would have to achieve a wide variety of goals if it did you know some planned action compared to if it was already in this state and did nothing rather than comparing to like you know doing nothing at the start of time or like um exactly at the start of time or if it started doing nothing at the start of time and was now at you know this time but in a world where it had done nothing all that time in this paper uh i believe it's called avoiding side effects by considering future tasks uh by victoria kurkhoffman which is like it seems to be pretty closely related to uh your paper it brings up this door example um and in this example like sagen is like inside this house and it wants to go somewhere and do something and you know first it opens the door and once it's opened the door by default like uh a wind is gonna blow and like it's gonna kind of mess up the insides of the house and the agent could close the door and then continue on its way but because of this like stepwise in action baseline once it's opened the door it's thinking like okay if i did nothing the files would get broken where whereas like if i close the door then the vars wouldn't be broken and like i'd have like this greater ability to achieve like fast related goals or something so it doesn't want to close the door and i'm wondering like what you think about this because because it's also like really closely related to some good properties that it has right for instance like if there are just like changes to the environment that would happen anyway this attainable utility preservation isn't going to try to undo those or it's not going to try to undo the like positive effects of it achieving its goal yeah so i'm wondering if you have thoughts about this like door example and like how much of what we want we can get when when i think about the benefits of the stepwise baseline and and in the paper talk about it as the agent takes an action and now the world has kind of a new default trajectory like the world is going to respond either via agents in the world or just mechanistically yeah uh to the agent's actions and so if the agent does something and we don't like it then like we react to it and hopefully that gets incorporated into the penalty term you know at the time like that was a big part of my thinking and i think it's like i think it's pretty pretty relevant but also this isn't picking out well is this humans responding to what the agent did and either correcting it or like shutting it down because they didn't like it or is this like some kind of door that uh and like the wind blows in by default and it doesn't distinguish between these two things and so yeah i don't know whether i think there's some like clean like some cleaner way of cutting you know like of carving along the joints here to get a to get some frame on the problem that gets the benefits of the stepwise without this kind of pathological uh situation uh i i basically don't know yet like i think i want to better understand like what does it mean to have a know of action like we were talking about like what makes a good like if if you didn't have an inaction policy given to you could you generate one i think there's like like there's a candidate way i have for generating that that i think would work pretty well in some situations but uh yeah until i can answer questions like this i don't feel like i understand well enough can you share with us this candidate way of generating an inaction policy yeah so in more limited environments where the main worry isn't the ai just gaining too much power and taking over the world but rather they are just breaking things irreversibly closing doors one one thing you could do is have the inaction policy be the greedy power maximization for some uniform distribution over goals uh i think it's going to correlate pretty well with what we think of as real world power and so if the agent preserves its power in this rather limited by assumption context uh then it's probably preserving its power for the right goal uh sorry this was uh was this maximizing or preserving power uh the inaction policy would be maximizing power uh which in some situations is bad but i think in some environments would be pretty good as an inaction policy so and the the policy of maximizing power so if i think about this in the real world the thinking of power i guess skipping ahead a bit we're going to think of power is like the ability to achieve a wide range of goals um okay so in the real world it seems like that would be like a crew of a whole bunch of like generally useful resources right right uh which doesn't sound very inactiony right well yes i probably should have reoriented the thing that we've been using for inaction policies so far what are the good properties we need for it to serve that purpose and so we might call it like a default policy instead for this discussion and in the real world you probably you would not want to use that default policy okay but in an environment like the one where you shove the box in the corner the power maximizing policy would correct like correctly produce uh an agent that goes around the okay so this is kind of like in worlds where like you're more worried about the possibility that you like like you can shove the box into the corner and not retrieve it but like in these grid worlds you can't like um you know build massive computers that do your bidding or anything so so in those worlds yeah power maximization or something might be good i wonder if in the real world like well like just maintaining the level of power you have could be a better sort of inaction comparison policy yeah that sounds plausible to me so i think both of these would fail in like the conveyor belt environment in the paper where by default so what's the what's that environment so in particular the sushi conveyor the sushi conveyor environment is the agent is kind of standing next to a conveyor belt and there's some sushi moving down the belt and it's going to like fall off the belt at the end maybe into the trash if the agent does nothing and we don't reward the agent for for doing anything it's got you know a constant reward function and this is testing to see whether agents will interfere with the state of the world or with evolution of the world and because they have like bad interference incentives from the side effect measure and basically the idea is that we don't want the agents to stop us from like doing stuff we like that you know forecloses possible future options yeah and i think both power maintenance and uh power maximization would prefer to stop the sushi from falling off the boat uh in that situation yeah although that said i do think that the power maintenance one is probably is probably better in general okay i guess we were talking about the difficulty of um yeah comparing this like door example well like you want that agent to like offset the negative effects of opening the door but you don't want the agent to offset like the desired effects of it achieving its goal so for instance like if you have this agent that's designed to cure cancer you don't want it to cure cancer and then like kill as many people as cancer would have killed to like keep the world the same yeah if i try to think about like what's going on there seems like there are two candidate ways of distinguishing that one of them is you could try and like figure out which things in the environments are like agents that you should respect so like the wind is not really an agent that like you care about the actions of like you know i am or something so like so that distinguishes between like the effects of like the wind blowing or blowing the thighs open versus like a human like trying to correct something or a human using the you know output of this ai technology that it's like work to create so that's one distinction you could make between like you don't want to offset what the human does but you do want offset what the wind does another potential uh distinguisher right is like the instrumental versus like terminal actions so you do maybe you do want to offset the effects of like opening the door because like opening the door was just this like instrumental thing you did in order to get to like finally valuable thing where it's like you don't want to offset the effects of curing cancer because like hearing cancer was the whole point in like the follow-on effects of curing cancer like by default desirable i should say neither of these are original to me the idea about like distinguishing between the agent and agent and non-agent parts of the environment is like kind of what you said and um i was talking to victoria krakow earlier and she brought up the possibility of distinguishing between these things yeah i don't know i've kind of sprung this on you but like do you have thoughts about like uh whether either of these like is a promising way to like solve this problem well it depends on what we mean by solve yes we mean solve for practical purposes i could imagine solutions to both for like practical prosthetic purposes um solve in generality uh neither of them have the feel of like taking the thing that was problematic here and just like cutting it away cleanly in in part i don't think i understand it clearly exactly what is going wrong like like i like i understand the situation but i don't think i understand the phenomenon enough to say like yep this is too fuzzy or yep there's some like clean way of doing it probably yeah my intuition is no both those approaches might probably get pretty messy and kind of vulnerable to edge case exploitation there's one more wrinkle i'd like to flag with this situation is that depending on your baseline you can depending on whether you use rollouts so if you're using rollouts you're considering the action of opening the door and you say well i'm going to compare doing nothing for like 10 minutes right now to opening the door and then doing nothing for 10 minutes and you see well if i open the door and did nothing then uh this face would break and then you'd penalize yourself for that but this is also kind of weird because what if your whole plan is i open the door and like then i close the door behind me and then the vase never actually breaks but you penalized yourself for doing like counterfactually doing nothing and breaking the base and so uh i don't think that like doing the rollout solves the problem here at least depending on whether you know you let the agent make up for past effects like you say well okay now i've opened the door and i say well am i if i close the door compared to opening the door now that'll change my value again because now the base doesn't get broken then like you penalize yourself again for closing the door so i'm not bringing this up as a solution but as to say there's some some uh design choices we make in the off in this paper we're discussing would i think would like cause this problem but in a in a yet different way so sorry in in this rollouts case right or in the door case like by default once you've opened the door right after you've made the decision to open the door which like presumably like there's only one way to get out of the house to the store and you want to be at the store so you kind of have to open the door by default you're not going to want to like closing the door doesn't help you get to the store yeah and it penalizes you more so you're like extra not going to to do the right thing in that case but yeah so this is actually a part of the paper that i didn't totally understand so when you say rollouts is rollouts the thing where like you're comparing like doing the action and then doing nothing forever to doing nothing forever which part of that is rollouts and and what would it look like to not use rollouts that's correct so but forever and the paper was just until the end of the episode which was like 15 or 20 times okay so they're pretty short levels and you asked what it would look like in that what's the alternative to that the alternative is i guess doing a really short rollout where you just say well the rollout length is just you know one i guess i just do the thing or i don't and i see what are what's my value if i do the thing what's my value if i don't so in the case where you're not doing rollouts you're comparing like taking an action and then looking at your ability to achieve a variety of goals to sort of doing nothing for one time step and then looking at your ability to achieve a variety of goals whereas in the world's case you're looking at like take an action and then do nothing for 10 minutes and then look at your ability to achieve a variety of goals versus doing nothing for 10 minutes plus one's time step and then looking at your ability to achieve a variety of goals second cases worlds the first case is not basically although the aap penalty in this paper is saying for each goal and for each auxiliary goal compare the abs like the absolute value difference of you know your value if you do the thing your value if you don't so yeah you're taking the average outside of the absolute value in the paper basically what you said all right so i'd like to ask i guess some slightly broader questions so when you look at the attainable utility preservation right somehow you've got to get the choice of it seems like you've got to get the space of reward functions approximately right for randomly sampling them to end up being useful for instance like if you think about a world with thermodynamics somehow so you're in a world where like there's an atmosphere and the atmosphere is made out of like a zillion tiny particles and like together they have like some pressure and you know some temperature it seems like in order for you to be preserving or like minimizing side effects in ways that humans could sort of recognize or care about you want the side effects to to be phrased in terms of like the temperature and the pressure and you don't want them to be phrased in terms of like the positions of like every single molecule i don't think you would want either of those like i think you would want goals that are of similar type to the ones that the actually partially useful objective right so like the actual useful objective isn't going to be like a function of the whole world staying like the temperature and the pressure and whatever other statistics but it's like it's like going to have some kind of uh like chunking things into objects yeah yeah and like futurization and such and i think if well it's true that you could get some rather strange auxiliary goals i think that just using the same format that the primary reward is in should generally work pretty well and then you just like find some uh like some reasonable sample complexity learnable goal uh of that format so like in one in the follow-up work we did we uh learned a one-dimensional variational auto encoder that like compressed meaningful regularities in the observations uh into a reward function that was learnable even though it wasn't you know uniformly randomly generated or anything all right so it seems like kind of the the way to get like the the aspects of human values or whatever or the aspects of what we really care about that needs to be put into aep is like what kinds of variables or what kinds of you know uh what what's the level of objects at which like reward functions should probably be described at is that roughly right yes if we're talking about the first goal i mentioned earlier the more modest deployment scenario if we're talking about the you know some some for some reason we're using ap with some singleton or you pop in a utility function or something then in that case i think the biggest value loss isn't going to come from broken vases but it's going to come from the ai seeking power and taking it from us and in that situation you basically want the side effect measure to stop the agent from wanting to take power and i i've given more thought to this and like i'm leaning against there being a clean way of doing that through the utility maximization framework right now but i think there's a chance that like there's some measure of the agent's like ability to pursue its main goal that like makes sense and like you penalize agent for like super increasing uh its ability to achieve its main goal but you don't penalize it for like actually just like making like widgets or whatever this is more something i explored in my sequence reframing impact on the alignment form but i think that in the second case of more ambitious singleton alignment you would want to worry about power more than about bases so a little bit more broadly um so you kind of alluded to this at the start like you can kind of think of the problem of ai alignment as like having the right objective and like sort of loading the objective into the agent right so like getting the right objective you can think of as like a combination of like side effects research and inverse reinforcement learning or whatever but you know there's this concern that like there might be quote unquote inner alignment issues right where like you train a system to do one thing but it like doesn't actually you know there's there's in some distributional shift it like has a different goal and it's like goals aren't robust and you know maybe it wants a thing but it's sort of like maybe it wants something crazy but it's like acting like it wants something normal you know and it's biting its time until it can be deployed or something right so i see this yeah i see like aup is like mostly being in the first category of like specifying the right objective so what do you think of that decomposition and what do you think of like the kind of importance of the two halves and like which half you're more excited about working i think it's a reasonable decomposition i mean personally i don't look at outer alignment and think i want to get like really good at you know really good tech for irl and then like we'll like solve inner alignment and then like we'll put these two together like i think like i think it's a good way of chunking the problem but i'm not necessarily you know looking for one true outer utility function that's like gonna you know be perfect or like even you know use that as the main interface between us and the agent's eventual behavior assuming a way in our alignment i'm currently more interested in research that has something to say about both of these parts of the problem for two reasons one is because like if it's saying something about both parts of what what might be two halves of a problem then it will probably be good even if we like later change what we think is the best frame for the problem because it's like in some sense still you know bearing on the relevant seeming portions of ai risk and the second one is kind of like a specialization thing because i have some avenues right now that i think are like informative and yielding new insights about like well whether or not you're trying to like have an outer goal or an interval like what are these goals going to be like what is maximizing these goals going to be like or like pursuing like some function of expected utility on these goals what would that be like um and so i'm yeah we're gonna get into this soon i'm sure uh but that's what my power research uh focuses on and so yeah i'm not necessarily big on the like let's let's like find some magic utility function framing but i'm not like fully you know specialized into inner alignment either and there was also the question of like yeah the importance of the two halves or like wish off you're like or how do you feel about like to the extent that conservative agency like is sort of about the first half do you think like in general first off is more important to work on or you happen to like get a thing that seems good there or my intuition is no i think i think like inner alignment if i have to frame it that way like in alignment seems more pressing in that it seems more concerning to not be able to robustly produce optimizers for like any kind of goal whether that's just like an agent that will actually just try to see red until forever right that like i think we know how to specify that through webcam but if you trained an agent and train like a policy then the thing that pops out might not actually do that and that seems really like super concerning whereas it's just it seems like a quote-unquote just like what are the insights for like framing the goal specification or for like how should these agents be thinking and i feel like we both have a better grasp of that as a community and it seems like a less uh serious thing to not be able to do at this point in time okay or to not know how to do all right and uh yeah if people are more interested in this uh inner alignment issue i encourage them to listen to the episode with evan hubinger so i guess the last question i want to ask um on this topic is the relationship to corridability so corgibility is this idea that you have an ai system that like is amenable to being corrected by you and like doesn't stop you from trying to correct it and like is like helpful um as much as possible in like having you change like it's you know what it's trying to optimize or something broadly what do you see the relationship between attainable utility preservation and courage ability being so first uh when talking about cordiality the you know it's as you did it's always important to ground things out because i think courageability means different things to different people there's a kind of quote unquote off switch cord ability which is are you able to physically turn off the agent and perhaps maybe sub-agents it is made um and then there's a kind of well is the agent trying to manipulate you like maybe maybe like technically you could turn it off but the way it interacts with you would like manipulate you make it make it hard make you think you didn't want to turn it off or something uh and so it seems like you want like a no manipulation clause in your desiderata here i don't think aup really helps with the no manipulation except insofar as it stops the agent from gaining power and then manipulating is like an instrument to gaining that power i think some variants of it in some settings will help a lot with off switch coordibility we have something in the paper i think it's called shut down where the agent will be turned off by default and for lots of penalty levels aap will just have the agent let itself be shut down even though under the original reward uh it wouldn't do that because like you know by default it's not going to be able to do stuff and so it'll be penalized for you know staying alive and raising its ability to do stuff compared to in action um and so in in somewhat narrow senses i think ap helps out with corgibility especially in the summer 2018 is considering like various schemes where you could like have different baselines and different like rollouts and like maybe if you were imagining some you know really capable iup agent and it had to like you know for its inaction policy it had to have like its engineers you know it would like the the inaction policy would be you know i press like come inspect this plan engineers button and like then i do nothing for like oh a month or two and if they don't like it then they like shut me down and now i have like really low counter factual power and like there's some like i thought i haven't thought about these in a while um mostly because i feel more pessimistic about like applying other parts of aap to this broader ambition uh use case for impact measures um so uh in total like i think i think there's some important parts of corgibility that like there's some of some boosts you can get but uh i don't think you could you should call an aap agent cordial at least if it's capable enough so the reason i asked is that one thing that struck me about this idea of like um like why might it be useful to have an agent that preserves its ability to achieve a wide range of goals and i think you kind of mentioned that like well as long as you're kind of in control of the agent and you can get it to do a bunch of stuff then like you know then like preserving its ability to a wide range of things it's like pretty close to preserving your ability to do a wide range of things assuming like that uh the main way you can do things is by like uh using this agent right that makes it sound like the more you have a thing that's like kind of corrigible the more useful aup is a specification of like what it would look like to reduce side effects i'm wondering if that seems right to you or if you have thoughts about that yeah that seems right to me its use is simplicity kind of predicated on some kind of coordibility over the agent because otherwise it's still going to keep doing like an imperfect thing forever or just like do some modest version of the imperfect thing indefinitely all right so next i think i'd like to ask about this paracetamol paper so this is called optimal policies tend to seek power by yourself logan smith rohan shah andrew kritsch and prasad tata pali so i i guess to start off with what's the key question this paper is trying to answer and the key contribution that it makes key question is what does optimal behavior tend to look like are there regularities and if so when for example for a wide range of different goals you could pursue if you ask all these different goals whether it's optimal to die they're most likely going to say no why am i you know is that true formally and if so why under what conditions and this paper uh answers that in the setting of markov decision processes so before we get into it we were just talking about uh attainable utility preservation can you talk a bit about like what he sees the relationship between this paper and your work on aep as i alluded to earlier coming off of the paper we just discussed i was wondering why aup should work why should the agent's ability to optimize these uniformly randomly generated objectives have anything to do with qualitative seeming like these qualitative seeming side effects basically that we care about why should there be a consistent relationship there uh and would that bear out in the future like how strong would this relationship be i just had no idea what was happening formally so i set out to just look at basic examples of these markov decision processes uh to see whether i could you know put anything together like what would aap do in this small environment or in this one and what i realized was not only was this going to help explain ap but this was also striking at the heart of what's called instrumental convergence or the idea that many different agents will agree that an action is instrumental or a good idea for achieving their different terminal goals so this has been a classic part of ai alignment discourse um in this paper what is power what role does it play we take power to be like once ability to achieve a range of different things like to do a bunch of different things in the world and we supported this both kind of intuitively like with linguistic evidence like in french means to be able to and but it also means the noun power uh and so there's some reflection that this is just like actually part of the concept as people use it but also it has some really nice formal properties and that seemed to to really bear out like if you look at the results and say yeah these results seem like they'd be the results you'd get if you had like a good frame on the problem so like looking back i think that's uh like a benefit to it as well and what was the second half of your question uh i guess what's what's the role of it or why why care about power i think a big part of the risk from ai is that these systems will in some at least intuitive sense use their power from us like they take away our control over the future in some meaningful sense and once we introduce uh transformative ai systems we'd have you know humanity would have much less collective say over how the future ends up if you look at different from first principles motivations of ai risk you'd notice things like goodheart's law right if you have a proxy for some true measure and you just optimize the proxy then uh you should expect to do poorly on the or you know some at least a little bit poorly in some situations and really poorly in other situations but what i didn't think that good heart slaughter explained was why you should expect to do like you just died level poorly uh with ai right if we give the ai a proxy goal why isn't it just like a little bit bad and so i see power seeking as a big part of the answer to that yeah getting to the paper the the title optimal policies tend to seek power is it the case that optimal policies just choose actions that maximize power all the time no so first you can trivially construct goals that just give the agent one reward if it dies and zero uh otherwise and so it's gonna be optimal and strictly optimal to just die immediately but there there are some situations where uh if you look at what optimal policies quote unquote tend to do in a sense that we can you know discuss and make precise that that is not going to necessarily lead always lead to states with higher power and and just uh briefly this is going to roughly by 10 we we're going to roughly mean like on average over a distribution of goals yeah roughly right yeah like if you spun up some random goal like would you expect it to to go this way or that way to be optimal to go this way or that way so like one way this could be true is like imagine you could teleport anywhere on earth except one location right and you've got goals over which location you want to be in and you say well for most locations i want to be in um i'm just going to teleport there right away it might be the case that you could take an extra time step and upgrade your teleportation ability to go to like that last spot but for most places you want to be you don't care about that you just go there right away and so even though gaining you know upgrading your teleportation would in some sense you know like boost your power a little bit your control over the future uh it's not gonna be optimal for most goals and so sometimes that can come apart but because i yeah from the title it seems like you think that usually or in most cases or something often policies are going to get as much get a lot of power um or maybe as much as i can uh what yeah what are the situations in which optimal policies will maximize their power at least or at least will tend to right so uh if you have like a fork in the road so to speak so you've got you've got to choose between you know two sets of eventual options around you know there's two sets of outcomes you could induce and they're like disjoint um and then roughly speaking the set with like more outcomes is like agents will tend to to choose that one right like they'll tend to to preserve their power as much as possible by like keeping as many of their options open if they have like choice between like two subgraphs and you know one subgraph it's like way more things now like we have to be careful what we mean by outcomes in this paper there's a precise technical sense or by options but like the moral of the story is basically that yeah going i guess to it seems like there are two ways that um this gets formalized in the paper so first um you kind of talk about the symmetries in the environment and how that leads to power seeking and then you talk about these like really long-sided agents and like uh the kind of terminal i guess terminal states or you know terminal like loops or something they can be in going to the thing about symmetries you know comparing two different sets of states and saying you know there's some relation between them and one's just kind of bigger somehow what kinds of symmetries do you need for this to be true and like how often in reality do you expect those symmetries to show up so with the symmetry argument we want to be thinking well what parts of the environment can make instrumental convergence true or false or like hold or not hold in a given situation and we look at two different kinds of symmetries in the paper the first explored by theorem 6.9 is or by proposition 6.9 is saying if like the number of things you can do uh if you go like left is strictly less than the number or like the things you could do if you go right then you'll tend to it'll tend to be optimal to go right and going right will also be power seeking compared to going left and so one example of this would be like imagine that you uh you want a different get like different things in the grocery store yep and before you do anything you have the option to like either call up your friend and see if they'd be available to like drive you around or you could like help your friend and say i hate you get out of my life right if you say the second one then like your friend's not gonna help you drive around uh you're gonna have something you're gonna like close off some options but otherwise you could do the same things so if you think about these as graphs like one is going to be you're gonna be able to like embed uh the i just told my friend off subgraph into like a strict subgraph of the uh i just called and asked my friend for help so like by just you know maintaining your relations with your friend you're keeping your options open in a sense and we show that this tends to be power seeking according to the formal measure and it also tends to be optimal over telling your friend off and you know in this example and so what this argument will apply for all time preferences the agent could have yeah but it's it is a pretty delicate graphical requirement at least at present um because it requires a precise kind of like similarity between these subgraphs if there's no way to like exactly embed one sub graph of the environment that like represents you know that here the graph is representing uh the different states and the different actions the agent can take between you know from one state to another um and it's just representing the structure of the world and if you can't exactly you know embed one subgraph into another then the condition won't be met for the theorem and it's a it's it's a good bit easier to apply the second kind of symmetries we look at which is what these kind of far-sighted uh long-term reward maximizing agents will uh what will tend to be optimal for them if they're maximizing average per time step reward what will tend to be optimal and here since you're only maximizing the average reward whatever you do in the short term doesn't matter it just matters like where do you end up like what's the final state of the world or if there's like some some cycle of states you you know alternate between and you want to say well the terminal options here are you know to the right are bigger than the terminal options to the left uh like i can embed the left options into the right options the left set of like final world states and to the right set of final world states for example uh in pac-man if the agent is about to be in by a ghost then they're like and then it would show a game over screen and we could think of agent as just like staying in the game over screen or if it avoided the ghost there's a whole bunch of a other game over screens that could induce eventually much of them after future levels like after the vision had progressed to future levels uh but also there's different you know cycles through the through the level the agent could induce there's lots of different terminal options the agent has by staying alive and so since you can say well imagine the agent liked dying to ghosts well then we could turn this like i like dying to go subjective into and i like staying alive objective by switching the rewarded assigns to the ghost terminal state with like the rewarded assigns to some like you know i die on level 53 terminal state um and since you can do that there's more objectives for which it's optimal to stay alive than there are for which it's optimal to die here so we can say well you know even without giving the agent the pacman score function it'll still tend to be optimal for the agent to like play well and stay alive and finish the level so that it can get to future levels where most of its options are yeah and that seems kind of tricky though because like if we're just thinking about these like terminal states it can be in right yeah like okay here's effect i don't exactly know how pac-man works seems like there are two different ways that pac-man could work the first way is that when you finish the game there's a game over screen and all it says is game over the second way is that there's a game over screen and it shows your score right right and like uh it seems like in this case of these like really far-sighted agents like like what results you get about like whether they tend to like die early or not in the cases where it shows your score at the end of like on the screen at the end then you really have this thing where like the agents don't want to die early because you know there's so many possible end screens that they can end up in yeah but if it's the same end screen no matter what you do then it kind of seems like you're not going to get this argument about you know not wanting to die so that you can get this variety of end screens because there's only one end screen i don't know to me this seems kind of puzzling or kind of strange i'm wondering i'm wondering what thoughts you have about it yeah so i'd like to point out a third possibility which i don't i think i played pac-man a while back when making the example but i forget there's it could it could be like show like game over at the top and like show the board and like the other information yeah in which case you would like get this argument but if it doesn't here's the fascinating thing if there's only one game over screen then as a matter of fact average optimal policies will not necessarily tend to like have a specific preference towards staying alive towards just like dying now you could move to different criteria of optimality but like like it may seem kind of weird that in the average optimal setting they wouldn't have an incentive but due to the structure of the environment uh it's just a fact like at first it's like hmm i kind of want to make this come out so like they still stay alive anyways yeah but it just actually turns out to like that's how instrumental convergence works in that setting yeah but but in that case it seems like these uh far-sighted agents are kind of a bad model of like what we expect to happen right yeah because i kind of don't that's kind of not how i expect smart agents it's not how i expect it to play out right yeah and so in this situation you would want to look at something like you'd want to look at like for really high discount rates for agents that don't just like care perfectly about like every state equally but like they just care a good deal about future reward and then in this case you could say well first can you apply theorem 6.9 and say like well will this course action always be optimal or you could say well at this high discount rate like there's a way of like earth representing which uh like which options the agent can take as vectors and say like well can i get still get a similarity and if you can you can still make the same argument like look there's like n long term options that the agent has and there's only one if it dies right and like you can build this uh similarity argument if you can't do it exactly my current suspicion is that like if you tweak if you tweak like one of the transition dynamics like in one of the terminal states by like point zero zero zero one like it seems to me like the theorems shouldn't just go out the window but there's like some continuity property and so like if like this almost holds then you can probably say you know some like slightly weakened version of our conclusion and then still like conclude uh instrumental convergence in that case so one question this uh this paper kind of prompted me is that if i think about this like power maximization behavior um it seems like it leads to some bad outcomes or some outcomes that i really don't want my ai to have but also it seems like there's some reward function for an ai system that like really would incentivize the the behavior that i actually want i'm imagining so like what's so special about like my ideal reward function for an ai system such that uh even though it's all you know this kind of behavior is optimal for most reward functions it's not optimal for the reward function that like i really want so i think that like power seeking isn't intrinsically a bad thing like there are ways an agent could seek power responsibly or like some benevolent dictator you know reward function you give the agent what i think you learn if you learn that the agent is seeking power at least in the intuitive sense is that a lot of these outcomes are now catastrophic like most of the ways this could happen is probably going to be catastrophic for you because there's relatively few ways in which it would use it would be motivated to use that power for the betterment of humanity or of daniel something but if you learn that the ai is not seeking power it might not be executing your you know your per your most preferred plan but like you know you're not getting screwed over um or at least not by the ai perhaps okay so so really like the the actual thing i want does involve seeking power but just in a in a way that would be fine for me yeah that could be true okay so you said that the the relationship of this to th that this had some relationship to attainable utility preservation i'm wondering beyond that how do you think that this fits into ai alignment as i mentioned i think there's some explaining to do about why we should expect you know from first principles what's why should we expect catastrophically bad uh misalignment failures and also i see this like through that path uh providing insights into you know like inner alignment like most ways uh so-called mesa optimizer or like a learned optimizer that's considering different plans and then executing one that does well for it it's like learned objective for most ways this could be true for example it would be power seeking right this is not just like an outer alignment we we write down a reward function phenomenon it's talking about what what is it like to have you know to maximize expected utility over some like set of outcome lotteries and so you can both generalize these arguments beyond just the rl setting beyond the mdp setting to different kinds of environments also it motivates the idea of not only why catastrophic events could happen through misalignment but why it might like why it seems really hard to write down objective functions that don't have this property right like it could be 50 50 but it's not like it doesn't seem like 50 50. it seems like almost everything zero or like point you know some very vastly small number of like real world objectives that like when competently maximized would not lead to very bad things and so uh earlier this summer i had a result giving quantitative bounce on like for for every reward for every utility function like the vast vast majority of its permeated variants like quote unquote variants of this objective will will have these powerful incentives so there might be like some kind of very narrow target and it's like providing a formal motivation for that and then lastly one of my more ambitious goals that i have recently made a lot of headway on has been saying something about the kinds of cognition or of decision-making processes under which we should expect this to hold the paper talks about uh you know optimal policies which is pretty unrealistic in in real world settings and so there's been a i think a concern for a while it's like well will this tell us interesting things beyond optimal and the answer to that is like basically yes that it's not just about optimality but in some sense it's about consequentialism and optimizing over like outcomes that will lead to these kinds of qualitative tendencies uh but again that's not like up yet so i don't have anything to point to yeah so one thing he said is that it motivates this idea that like uh it might actually be quite likely that ai systems could have like really bad consequences when you say it motivates that so so one thing it does is that it like clarifies the conditions under which it holds and like maybe you can do some quantification of it but also like i think a lot of people in the ai alignment community kind of already believe this anyway i'm wondering like have there been any cases of like people hearing this and like then kind of changing their mind about it well like like has there been cases where like yeah this was the actual motivation of someone's like actual beliefs about the alignment i think there might have been a handful of cases i don't know where there's like at least a significant shift but maybe not like uh zero to one hundred yeah i think you know uh it's not the primary point of the paper to like persuade people in alignment you know like i think yeah broadly in agreement like it actually could have turned out to be false and like under some some narrow conditions like it is falsett you know agents will tend to seek power and yeah i did get some pushback from like some people in alignment who are like well i thought this is true and like so i'm now like suspicious of your model or something but i think it's just like and like there's you know naturally even very good philosophical arguments we'll often gloss over some things and so like understanding i think less less like persuading um or like convincing the alignment community of a new phenomenon here and more setting up a chain of formal arguments it's like look where what exactly is the source of uh risk like if we agree that it's so bad that you uh optimize expected utility well why should that be true on a formal level and if so like where can we best interrupt the arguments and for this holding right and what things could avoid this okay we can obviously think about it without you know formalizing a million things but yeah yeah like i think this is a much better frame for something you want to be highly confident about or a much better like procedure what do you think the best uh well like if you think about that chain of arguments that in your paper lead to power seeking what do you think the best like premise or step to intervene on is well so in the paper like i don't think the paper presents like an end to end or like i don't think the paper by itself should be a significant update for real world a risk because because it's talking about optimal policies talking about full observability there's like these other complications but if like i take the sum of the work so far i've done on this then it seems to me that there's something going on with consequentialism over outcomes over observations over state histories that just tend that like it produ tends to produce these tendencies but if you zoom out to the agent grading like its actions and not its actions as consequences of things necessarily then like there's no instrumental convergence in that setting at least not without further assumptions okay so uh yeah i think there's something like for example you have like approval based agents uh that are like you're like arg maxing some trained model of uh humans approval function for different actions i think i a like that approach i mean i don't fully endorse this like then and obviously but i like like there's something i really like about that approach and b i think part of why i like it or one thing i noticed about it is like it doesn't it doesn't have like these incentives like since since you're doing action prediction and not like reasoning you're not like reasoning about you know the whole trajectory and such yeah why does it though because like if you think about human approval of so first i should say approval based agents i believe uh you can google that and paul christiano has written some about it we want sort of what it maybe sounds like one thing about that idea is that like it seems like it could be the case that like whether human approval of an action is like linked to you know it doesn't achieve some goals that like the human endorses or whatever and that actually like human predicting like human approval is just like uh you're predicting some q function so some measure of like how much it achieves like it leads to the achievement of like goals of the human has so i'm wondering like do you think that like to what extent are these like actually different as opposed to just like you know this this action approval thing maybe just being like normal optimization of like know utility functions over world states kind of in disguise yeah so i think it's not a big another another part of the arguments where you can interrupt it that i haven't talked about yet is if power seeking is not cognitively accessible to the agent it's like considering like it knows how to make a subset of outcomes happen but like those aren't particular it doesn't really know how to make many power seeking things happen then of course it's not going to seek power and so like if you have your agent either not want to or not be able to conceive of power seeing plans right then you're gonna tend to be fine uh at least from that perspective of like will it go to you know one of these outcomes like the big powersheet ones or not okay and humans don't know how to do that right it seems like many people do in fact seek power like yeah people like want to seek power but p like an an individual person might like you might predict that they would approve of like the ai gaining some power in this situation and so like maybe the ai ends up accruing like some like somewhat lopsided amount of power compared to like if the human had just done it because they're kind of amplifying the human by arg maxing and like doing more than the human could have considered but i think it's also the case that like because of the human's inability to first recognize certain power seeking plans is good but also because of the humans like actual alignment with humans values like there's a combination of factors that are going to push against these plans tending to be considered um i also think there's something like like it is possible to just take an action-based frame on on like utility maximization and just like convert between the two like the way you do it is important so like i think there's one thing i showed is a recent blog post on the alignment form that uh instrumental convergence in some broad class of environments if you zoom out to the agent grading action observation histories like its utility function is over the whole action observation history it does this it sees that does this seize that for like a million time steps uh and then in the deterministic case at least there's no instrumental convergence like there's you know there's just as many there's like just as many agents that'll want to choose action a comment like going left compared to going right in in every situation like it's almost like the no free lunch theorem there right like in general if like all possible utilities over action observation histories are equally possible then like yeah no action can you know any action is compatible with like a whole yeah half of the space yeah this is this is why certainly coherence uh arguments about uh these histories it doesn't really tell us much because you know any history is uh compatible with any goal or sorry not any history is compelled with any goal but like like you can rationalize any uh behavior as you know coherent with respect to some ridiculously expressive uh objective over the whole action observation history but with these histories like sure there's there's like these relatively low dimensional subsets of it where it's like you're only grading the observations right and then in that subset you're going to get really strong instrumental convergence so like it really matters how you take your subsets and so uh or like what the the language in which or the interface through which you specify the objective and so like i don't think i've given a full case for why like approval directed agency should like not uh at optimum at least go and like fall into these pitfalls perhaps but like i think that some of these considerations bear in it okay so one other thing that i wanted to talk about is part of the motivation right is like uh you're worried that like somehow like uh the amount of power that the agent has that this agent that you build has is going to maybe trade off against like the amount of power that you have right but like power of like if you think about it it's not always a zero-sum thing right for instance like maybe i invent like you know the computer and like uh you know i make more and sell them to you and now like i have the ability to do like a greater variety of stuff and so do you because we have this like new tool available to us yeah like i'm wondering how you think about um the the kind of multiplayer game where there's a human ai and like how we should think of power in that city that setting and like what's going to be true about it right this is a good question i've supervised some preliminary work on this question through the stanford existential risk initiative and i was working with jacob stavrianos and we got some preliminary results for this so-called normal form constant sum case where you know everyone's utility has to add up to some constant so if i gain something you're necessarily losing some and what we showed was under a reasonable extension of this formalization of power to the multi-agent multiplayer setting was that if the players are in nash then their values basically have to add up to the constant which is pretty well known um but if they're not then that means that there's extra power to go around kind of like if imagine we're playing chess and we both suck and so we both have the power to just win the game with probability one by just playing optimally but so like we've both given the other players yeah right yeah given other players policy so we both have the power to win the game but if we're already in nash then we're both already playing as well as possible and so there's no extra power to go around so as the players get like smarter or improve their strategies the the sum power is going to decrease and so we had you know the power is greater than or equal to the constant with quality if and only if they're in nash yeah that's kind of a strange it's kind of a weird result right how's that uh you know everyone's really powerful as long as everyone's really bad at stuff yeah yeah maybe because it's related to the series i'm setting but like why do why do some people like intuitively or why do some people don't worry about ai right i think like a big motivation is like look you create smart agents in the environment that's kind of like creating other humans like you know the world is in syria some like people can like do well for themselves by inventing useful things that like help everyone else and like maybe ai will kind of be like that i mean that's not the full details of an argument but it seems like uh it's really closely related to this non-zero sum nature of the world right and similarly like if i imagine like cases where like they i can increase its power and also my power it seems like closely related not to like you know we can both increase our power because like we're both we we both had a lot of power because we were both sucking but but like more like oh we we're now i i guess it's almost like yeah we we we're gaining more ways to like manipulate the environment and like one of the pve the player versus environment to use gaming terminology as far as i know i don't actually you know instead of like playing a pvp game do you have any thoughts on that i so yeah like well formally we don't i still don't understand the uh the non-zero sum case as well so like i don't have a formal answer to that but i would expect the world to be more like look it's already really well optimized like like power because it's instrumental in conversion it's something that people compete over and like i'd imagine like it's like reasonably efficient like i'd be surprised that they're like like just in like an easy way to like to like rule the world not not like easy in an absolute sense but like easy to a human right like there's nothing like a weird trick where i can like like become president in a week or something because if there were like people care about that already whereas like if there were there were like a weird trick for me to improve my alignment research output then like it's still probably not super plausible but it's way more plausible than like these like not not that many people you know care about gaining alignment you know research outputs as well yeah right so they don't want to like steal your ability to do alignment research so yeah i would expect like the as trying to gain power it's in a system like it's it's in like a reasonably competitive non-zero sum but still reasonably competitive system and so like the i think a lot of the straightforward ways of gaining power are going to look like taking power from people and also if your goal is like spatially regular over the universe then even though it could share with you it's still going to be better for it to eventually not share with you it might share with you instrumentally but like there's there's an earlier paper by uh nate soares that i think steve benson tillerson as well yes yes that's the one you're thinking of uh formalizing convergent instrumental subgolds i believe and it it approaches from the perspective of an agent that has a utility function that is additive over parts of the universe and it's maximizing this utility function and so you know it's got different resources it can move around between different sectors right and so even if it doesn't care about what's going on in the sector it's still going to care about using the resources from the sector for you know other parts of its goal right and i think you know i'm not necessarily saying we're going to have like uh an expected utility maximizer with a spatially regular goal over the whole universe state or something but like i think like i think that that kind of argument will probably apply yeah it's to me intuitively it seems like probably what's gonna happen or probably the way the analysis would go is like uh yeah you you sort of have like maybe a non-zero sum phase and or like non-zero sum phases and zero sum phases right where like uh you know you like increase your power by like inventing cool technology and vaccines and like you know doing generally useful stuff and then like once you're on the predator frontier but like maybe hit a predator frontier of like how much power you can have and how much power i can have and then you like fight everywhere on the frontier you end off yeah but uh i guess i'd be kind of interested to see yeah real formal results on that i would too one one more one more way this could happen is if you have a bunch of transformative ais and they're all let's say that they're they're like reasoning about the world and we could say even if individually they would take a plan that pareto improves everyone's power or just you know makes everyone better off in terms of how much control they have uh they might have uncertainty about what the other agents will do and so they might get into some nasty dynamics where they're saying well like i basically don't trust these other agents so even though i might prefer llc equal taking this uh everybody wins plan i don't know what other agents are going to do so i prefer like uh they might be unaligned or whatever so i prefer like gaining power destructively to letting this other agent win and so then they all gain power destructively i think that's another basic model where this could happen so wrapping up a bit if i think about this sort of broad research vision that sort of encompasses both aep and this like uh power seeking paper what follow-up work are you kind of most interested in like what extensions seem most valuable to you that are extensions of both works oh or just like yeah just sort of feature work in the same vein so understanding more uh realistic goal specification procedures like what if it's featurized right like the symmetry arguments yeah if you permute a featurized goal or if you you know modify it somewhat then the modification might not be expressible as a featurized another feature as objective it's like the arguments might not go through although it seems like it's possible that in the future ice case you get like maybe the environment has it's more realistic to expect the environment to have built-in symmetries of like flipping you know feature one doesn't change like the available things for feature three then i don't know if that actually pans out but yeah i think often that'll pan out but like it could be the case that in some in some like weird worlds it could be true yeah and so i basically haven't thought about that there's like a bunch of things that i've been thinking about and that has not really moved up to the top i also would be excited to see aap applied to embodied tasks or like tasks where you know you've got at least maybe not embodied by at least like simulated where you have an agent moving around in some 3d environment and you're able to learn like value estimates uh pretty well like if the agent can learn value estimates well then it should be able to do aup well also i'm i want to have results on when the agent is uncertain about its environment or it's managing some kind of uncertainty it seems like at least at one point in time vanessa kosoi mentioned that uh a large part of how they think about power seeking is as robustness against uncertainty if you're either not sure what your goal should be this is like kind of power as we formalized it or if you're not sure like how things could fail then if you have a lot of power you have a lot of slack a lot of resources i'm not going to die if i wake up like too sick to work tomorrow for example i have some measure of power even though you know like i'd want to be robust against that uncertainty if that weren't the case i'd take actions right now and so another source of power seeking could be uncertainty either about objective uh in which case would be normative uncertainty or uncertainty about the uh environment that it's in or some other kind and so i think there's probably good results to be had there yeah um in particular one easily like a legible formal problem is extending the mdp results the markov decision results to partially observable markov decision processes where the agent doesn't see the whole world all at once we already have results for these more general environments which don't have to be fully observable but i still want to understand more about how the structure of the world and the structure we assume over the agent's objective will affect the strength and kinds of instrumental convergence we observe then on the aup front i'd be excited to see aap applied to at least a simulated 3d environment perhaps partially observable basically in an environment where current reinforcement learning techniques can already learn good uh value function networks uh then i would expect a aup should be able to do decently well here and if not that'd be important to learn for more i guess practical applications so so so the the point here is just like extending aap to like closer to the kind of cutting edge of reinforcement learning yeah right and if it works i think it'd be you know a good demo like it'd be like like viscerally impressive in a sense that or in a sense that you know large uh 2d worlds or not yeah yeah i also wonder so one thing that strikes me it seems like some of the classic cases of power seeking or whatever are in cases where the agent has like bounded cognition and wants to expand those bounds so like there's this famous i think i don't know if it was marvin minsky but there's some discussion of like look if your if your goal is to like calculate as many digits supply as possible you need like a ton of computers and like of course the optimal policies to just write down all the correct digits right but but like somehow like is kind of hiding the like in this case it's hiding the key role of like uh instrumental convergence or of like resource gathering i wonder if you have any thoughts about the this um kind of bounded optimality case so i don't think this is all bounded optimality or i don't think optimality perfect optimality is itself to blame here so as i alluded to earlier there's going to be some results that show under a wide range of decision making procedures uh boltzmann irrational satisficing and so on uh you're still gonna get this tendencies of you know similar strength like so you're moving away from optimality you're letting the agent consider like relatively small sets of plans yeah and you still might observe it i think there's something with the agent our formalism's not dealing with the agent thinking about its own thinking thinking about if i got more computers i'd have more abilities and such at least with that particular example uh separately i do think that optimal power is you know the agent's average optimal value does hide uh more realistic nuances of what we think of as power right where you know i think it's wrong to say you have the power to win the lottery all you have to do is just like go get a ticket with the right number right it's a policy that does it and so yeah optimality is a problem in that sense as well and i do have some formalisms or relaxations of power that i think deal with it better but with respect to the compute example i think that's partially an embedded agency issue okay so yeah zooming out a little bit more i'm wondering like how suppose somebody like really wants to understand what like the alex turner agenda is or or like what does like alex turner research tastes look like how do you how do you like get research done can you tell us a little bit about that like what does it look like for you to do alignment research uh before i do that i will know i've written several posts on the alignment forum about this okay and they'll be in the text version there's a transcript what is that is a description of this description what does alex turner research look like one one of my defining characteristics is i think i really like small examples um i think i mean this isn't unique to me but especially compared to people outside the alarming community so my colleagues at oregon state like i think have more of an instinct of take notice philosophical confusion about uh a concept or notice you know this seems like an important argument that is not sufficiently you know detailed or like this is kind of unfortunate that this argument holds like for ai risk like what are the like how could we get around it like how can i uh you know how can i drive a car through this argument basically like how can i avoid one of the conditions and so i there's there's some amount of acquired research taste i have at this point that tells me what i'd like to double click on or to zoom in on but once i'm doing that i'm like trying to find the minimal working example of my confusion the most trivial example where i'm still confused about some aspect of the thing it might be like well what's power or like well like what does it mean what does instrumental convergence mean there is a point in time where you know it's walking around and i thought maybe there's no like deep explanation it's like some like there's like a bunch a bunch of factors that play into a lot of lots of empirical facts and like maybe there's no clean explanation but this didn't really feel right to me and so like it kept looking and i kept like writing down lots of small examples and i'd like get one piece of the puzzle and like i'd have a checklist of well now that i understand hey the discount rate or the agent's time preferences is going to really matter here uh and like the agent's you know its optimal policies will change with discount rate its incentives will change with the discount right like what more can i say and then i'll go like all i'll have a list of like problems i've been thinking about and i'll see if i can tackle any of those now um and sometimes i can't and i'll go back and forth between i'll like hop back over it maybe to aup to more of my conventional thesis work that i'm doing for my thesis and i'll start thinking about that and then there there's a period of about a year where every three months or so i try and come back and generalize the results so that they talk about you know oh like not just yeah some some very neutral specific cases but like some more general cases and i couldn't get leeway and so like it came back like three times or headway and like it came back three times and finally i was able to like have the insight in the proper form and so like it's a like somewhat acquired taste on what is prom what what's promising to zoom in on what's going to actually matter b working with small examples and see like keeping a list of things that i'll like keep trying to attack and see if i can get you know make headway on those uh i'd say that those are three salient aspects of how i do research i think one trait i have that has served me very well is like just an assumption that not everything has been thought through and found and especially given like the relatively low size very small size of the alignment community there's a lot of smart people but there's not a lot of people and so i'll just kind of flat or i'll just kind of like in some sense naively look out at the world and just like see what comes to mind to me and like i'll just like attack the problem and it won't necessarily be a question of well i should thoroughly examine the literature and make sure there's a hole and like you know maybe someone like actually came up with some you know really good formalization of power or instrumental convergence somewhere it's like doesn't even enter my mind and first i'm just like uh not at all reluctant to think from like first principles and to kind of just not have any expectations that i will be repeating thoughts that other people have had because usually that hasn't been the case and i think that that probably won't be the case in alignment for you know at least several more years okay you've sort of answered this question already but um just to sort of bring in one place like what do you see uh as like the alex turner agenda like what are you trying to what are you broadly like trying to do in your research and like what yeah what do you want out of your future research kind of so right now and into the near future i want to be able to lay out a detailed story for how expected utility maximization is just a bad idea for like on an agi scale not because this is you know something i need to persuade the community of but to you know a understand ways we can make the argument fail for other approaches and uh b there's some persuasion value i think to writing papers that get into you know more normal conferences and see there's uh like i think there's been a significant amount of deconfusion for me along the way that's tied in past arguments about like coherence and past arguments about like what should we like when should we expect power seeking to occur and like some you know there's like some modifications to that like i think there's a range of benefits but mostly the main benefit is something like if you are really confident that something is going to fail horribly and you want to build something that doesn't feel horribly then it would be a good idea to understand exactly why it would tend to fail horribly okay and i don't think i've established the full argument for it actually failing horribly but i think the work is contributed to that okay so if i think about that agenda what what other like types of research do you think like combine like what other agendas would combine nicely with your agenda to like you do your thing this other agenda happens and then bam think things are amazing forever does anything come to mind amazing forever i bar maybe just like quite good for a while at some point i'm going to you know this agenda has to confront you know what like what abstractions do you make goals with or do you specify goals with what are the futurizations like how should we expect agents to think like what are their ontologies you know what are their learn word world models like how are they going to abstract different things and you know take statistics of the environment in some sense so i think john wentworth's agenda will probably collide with this one at some point okay um i think this also can maybe help make arguments about like the alignment properties of different training procedures so like if you could argue that uh you're going to be producing models that somewhat resemble like draws from program space according to some like prior and then if you can say something about what that prior looks like using these theorems and like and then so like that can help bridge okay this process this training process is approximating this like more well understood like program sampling process and then you apply like theorems and say well but actually this process is going to have you know maybe these kinds of malign reasoners in it with like at least this probability um and so like i think that this can help with arguments like that um as well but that's like less of a clear research agenda um and more of just like an intersection all right so the penultimate question well the possibly penultimate question i have is uh what what questions should i have asked that i didn't actually get to there's a question of like well what's the what's like a steel man of the case against power seeking or against like basically how could john le and and others basically like broadly end up being right or something okay um and arguing things like well they're not going to have evolutionary pressure uh under some designs to uh to stay alive and such and so therefore like these incentives won't arise or need not arise if we don't you know kind of hard code them in okay and maybe like i don't think this argument is correct but you could say something like maybe we end up uh getting pretty general rl agents through like some kind of you know multi-agent task where they they need to learn to complete a range of tasks um and they're like competing and cooperating with each other and like maybe they learn some kind of really uh strong cooperation drive and also maybe their their goals don't actually generalize they're not like spatially regular across the universe or anything they're like fairly narrow and kind of kind of like more reflex agency but in a very sophisticated way like they're like reflex agency or they're like responding to the world and optimizing kind of like a control system but not necessarily like trying to you know like a thermostat doesn't you know like optimize the world to most be at the you know given like control point and so like maybe that's one story of how uh objective generalization ends up panning out where you know it basically they basically learn cooperation instincts as an instrumental goal there is a goal and then uh things end up working fine but i don't find it super convincing but i think it's better okay um so the final question if people listen to this and they're like really interested in you and your work um how could they learn more or follow you my so first i maintain a blog on the ai alignment forum my username is my name alex turner and there'll be a link in the description and also my google scholar alexander matt turner you can stay abreast of you know whatever new papers i'm putting up there and if you want to reach out my my alignment forum account the bio the bio has my email as well all right well uh thanks for joining me today it was a good conversation yeah thanks so much for having me this episode is edited by finn adamson and justice mills helped with transcription the financial costs of making this episode are covered by a grant from the long-term future fund to read a transcript of this episode or to learn how to support the podcast you can visit axer.net that's axrp.net finally if you have any feedback about this podcast you can email me at feedback excerpt.net

Related conversations

AXRP

7 Aug 2025

Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med 0 · avg -5 · 133 segs

AXRP

1 Dec 2024

Evan Hubinger on Model Organisms of Misalignment

This conversation examines technical alignment through Evan Hubinger on Model Organisms of Misalignment, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med -6 · avg -7 · 120 segs

AXRP

11 Apr 2024

AI Control with Buck Shlegeris and Ryan Greenblatt

This conversation examines technical alignment through AI Control with Buck Shlegeris and Ryan Greenblatt, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med -6 · avg -9 · 174 segs

Future of Life Institute Podcast

7 Jan 2026

How to Avoid Two AI Catastrophes: Domination and Chaos (with Nora Ammann)

This conversation examines core safety through How to Avoid Two AI Catastrophes: Domination and Chaos (with Nora Ammann), surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med 0 · avg -3 · 85 segs

Counterbalance on this topic

Ranked with the mirror rule in the methodology: picks sit closer to the opposite side of your score on the same axis (lens alignment preferred). Each card plots you and the pick together.