Library / In focus
Future of Life Institute PodcastCivilisational risk and strategy
Rohin Shah on the State of AGI Safety Research in 2021

Why this matters
This episode strengthens first-principles understanding of alignment risk and the strategic conditions that shape safe outcomes.
Summary
This conversation examines core safety through Rohin Shah on the State of AGI Safety Research in 2021, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Perspective map
MixedGovernanceHigh confidenceTranscript-informed
The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.
An explanation of the Perspective Map framework can be found here.
Episode arc by segment
Early → late · height = spectrum position · colour = band
Risk-forwardMixedOpportunity-forward
Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).
StartEnd
Across 94 full-transcript segments: median -10 · mean -8 · spread -34–0 (p10–p90 -20–0) · 12% risk-forward, 88% mixed, 0% opportunity-forward slices.
Slice bands
94 slices · p10–p90 -20–0
Mixed leaning, primarily in the Governance lens. Evidence mode: interview. Confidence: high.
- Emphasizes alignment
- Emphasizes safety
- Full transcript scored in 94 sequential slices (median slice -10).
Editor note
A high-leverage addition to the AI Safety Map that clarifies one important safety bottleneck.
ai-safetyflicore-safetytechnical
Play on sAIfe Hands
Episode transcript
YouTube captions (auto or uploaded) · video _5xkh-Rh6Ec · stored Apr 2, 2026 · 3,629 caption segments
Captions are an imperfect primary: they can mis-hear names and technical terms. Use them alongside the audio and publisher materials when verifying claims.
No editorial assessment file yet. Add content/resources/transcript-assessments/rohin-shah-on-the-state-of-agi-safety-research-in-2021.json when you have a listen-based summary.
Show full transcript
welcome to the future of life institute podcast i'm lucas perry today's episode is with rohan shaw he is a longtime friend of this podcast and this is the fourth time we've had him on every time we talk to him he gives us excellent overviews of the current thinking in technical ai alignment research and in this episode he does just that our interviews with rohin go all the way back to december of 2018. they're super informative and i highly recommend checking them out if you'd like to do a deeper dive into technical ai alignment research you can find links to those in the description of this episode rohin is a research scientist on the technical agi safety team at deepmind he completed his phd at the center for human compatible ai at uc berkeley where he worked on building ai systems that can learn to assist a human user even if they don't initially understand what the human user wants rohin is particularly interested in big picture questions about artificial intelligence what techniques will we use to build human level ai systems how will their deployment affect the world and what can we do to make sure this deployment goes better he writes up summaries and thoughts about recent work tackling these questions in the alignment newsletter which i highly recommend following if you're interested in ai alignment research rohin is also involved in effective altruism and out of concern for animal welfare is almost vegan and with that i'm happy to present this interview with rohan shaw [Music] welcome back rohin this is your your third time on the podcast i believe we have this this series of podcasts that we've been doing where you helped give us a year in review of of ai alignment and everything that's been up your someone i view is very core and crucial to the ai alignment community and i'm always happy and excited to be getting your your perspective on what's changing and what's going on um so to start off i just want to you know hit you with a simple not simple question of what is ai alignment oh boy excellent i i love that we're starting there um yeah so different people will will tell you different things for this as i'm sure you know the framing i prefer to use is that um there is a particular class of failures that we might be we can think about with ai where the ai is doing something that its designers did not want it to do um and it's like it's uh and specifically it's competently achieving some sort of goal or objective or or some some sort of competent behavior um that isn't the one that was intended by the designers uh this so for example if you try to build an ai system that is i don't know supposed to help you schedule calendar events and then it like also starts sending emails on your behalf to people um which maybe you didn't want it to do that would count as an alignment failure um whereas if you know a terrorist somehow makes an ai system that that can that goes and detonates a bomb in some big city that is not an alignment failure it is obviously bad um but it the ai system did what it was what its designer intended for it to do so it doesn't count as an alignment failure on my definition of the problem other people will see ai alignment as synonymous with ai safety um for those people uh that you know terrorists using a bomb might count as an alignment failure but at least when i'm using the term i usually mean you know the ai system is either uh is doing something that wasn't what its uh designers intended for it to do there's a little bit of a subtlety there where you can think of either intent alignment where you like try to figure out what the ai system is trying to do and then if it is trying to do something that isn't what the designers wanted that's an intent alignment failure or you can say all right screw all of this you know notion of trying we don't know what trying is how can we look at a piece of code and say whether or not it's trying to do something and instead we can talk about impact alignment which is just like the actual behavior that the ai system does is that what the designers intended or not uh so if the ai makes a catastrophic mistake where the ai like you know thinks that this is the big red button for happiness and sunshine but actually it's the big red button that launches nukes uh that is a that is a failure on impact alignment but isn't a failure on intent alignment uh assuming the ai like legitimately believed that the button was um happy happiness and sunshine i think i said so it seems like you could have one or more or less of these in a system at the same time so which do you which are you excited about which do you think are more important than the others in terms of what do we actually care about which is how i usually interpret in important uh the answer is just like pretty clearly impact alignment like the thing we care about is like did the ai system do what we want or not uh i nevertheless tend to think in terms of intent alignment um because it seems like it is decomposing the problem into a natural notion of like what the ai system is trying to do and whether the ai system is capable enough to do it and i think that is like actually a like natural division like you can you can in fact talk about these things separately um and because of that it like makes sense to have research organized around those two things uh separately but that is a claim i am making about the best way to decompose the problem that we actually care about um and that is why i focus on intently but like what do we actually care about impact alignment totally how would you say that your perspective of this problem has changed over the past year i've spent a lot of time thinking about the problem of of inner alignment um so this was this shut up too people have been talking about it for a while but it showed up to prominence in i want to say 2019 with the publication of the misoptimizers paper and i was not a huge fan of that framing but i do think that the problem that it's it's showing is like actually an important one uh so i've been thinking a lot about that can you explain what inner alignment is and how it fits into the definitions of uh what ai alignment is yeah so ai alignment the way i've described it so far is just sort of like pretty it's just talking about properties of an ai system it doesn't really talk about how that ai system was built but if you actually want to diagnose at like give reasons why problems might arise and then how to solve them you probably want to talk about how the ai systems are built and why they're likely to cause such problems uh inner alignment i don't i'm not sure if i like the the name but we'll go with it for now inner alignment is a problem that i claim happens for systems that learn um and the problem is um maybe i should explain it with an example uh you might have heard seen this post from uh less wrong about blacks and rubes uh these black legs uh are blue in color and tend to be egg shaped uh in all in all the cases they've seen so far rubes are are red in color and are cube shaped at least in all the cases you've seen so far and now suddenly you see a red egg shaped thing is it a black or a rube like in this case it's pretty obvious that like you know there isn't a correct answer um and this same dynamic can arise in a learning system where if it is you know learning how to behave in accordance with whatever we are training it to do we're going to be training it on a particular set of situations and if those situations change uh in the future along some access that the ai system didn't see during training it may generalize uh badly so a good example of this is came from the objective robustness and deep reinforcement learning paper they trained an agent on uh the coin run environment from procter uh this isn't this is basically a very simple platformer game where the agent just has to jump over enemies and and obstacles to get to the end and collect the coin and the coin is always at the far far right end of the level and so you know you train your ai system on you know a bunch of different kinds of levels different obstacles different enemies they're placed in different ways you have to jump in different ways but the coin is always at the end on the right uh now it turns out if you then take your ai system and test it on a new level where the coin is placed somewhere else in the level not all the way to the right the agent just continues to you know jump over obstacles enemies and so on like behaves very competently in the in the platformer game but it just runs all the way to the right and then like stays at the right or jumps up and down as though hoping that there's a coin there um and like i would it's behaving as if it has the objective of go as far to the right as possible even though we trained it on the objective um get the coin or at least that's what you know we were thinking of as the objective uh and you know this happened because we didn't show it any examples where the coin was anywhere other than the right side of the level so the inner alignment problem is when you train a system on you know one set of inputs it learns how to behave well on that set of inputs but then when it's when you extrapolated its behavior to other inputs that you hadn't seen during training uh it turns out to do something that's very capable but not what you intended can you give an example of what this could look like in the real world rather than like a training simulation in a virtual environment yeah um one example i like uh is it's it'll take a bit of a bit of setup but i think it should be fine um you know you could imagine that with honestly even today's technology we might be able to train an ai system that can just schedule meetings for you like when someone emails you asking for a meeting you're just like here calendar scheduling agent please you know do whatever you need to do in order to get this meeting scheduled i i want to have it you you go go schedule it and then it you know goes and uh emails the person who emails back saying you know rohan is free at such and such times he like prefers uh morning meetings or whatever and then you know there's a back and forth between uh and then the meeting gets scheduled for concreteness let's say that the way we do this is we take a pre-trained language model like say gpd3 um and then we just have gpt3 respond to emails um and we train it from human feedback well we we we have some examples of like people scheduling emails we do supervise fine tuning on gpg3 to like get it started and then we like fine tune more from human feedback uh in order to get it to be good at this task and it it all works great now let's say that in 2023 gmail decides that you know gmail also wants to be a chat app and so it adds emoji reactions to emails um and everyone's like oh my god now there's so much such a better we can we can schedule meetings so much better uh we can just like you know say here here uh just send an email to all the people who are coming to the meeting and you know react with emojis uh for each of the times that you're available and you know everyone loves this this is how people start scheduling meetings now but it turns out that this ai system when it's confronted with these emoji emoji poles is like it knows it in theory is capable or knows how to use the emoji goals it like knows what's going on but it was always trained to like schedule the meeting by email so maybe maybe it will have learned to like always schedule a meeting by email and not to take advantage of these uh new features so it might say something like hey i don't um i don't really know how to use these newfangled emoji poles can we just schedule emails the normal way like like in in our terms this would be a flat out lie but like from the ai's perspective we might think of like the ai was just trained to to say whatever you know sequence of english words led to getting a meeting scheduled by email and it predicts that sequence of words will work well would this actually happen if i actually trained an agent this way i don't know like it's totally possible i would actually do the right thing uh but i don't think we can really rule out the wrong thing either it seems that also seems pretty plausible to me in this scenario one important part of this that i think has come up in our previous conversations is that we don't know when there is always an inner misalignment between the system and the objective we would like for it to learn because part of maximizing the inner aligned objective could be giving the appearance of being aligned with the outer objective that we're interested in could you explain and unpack that yeah um so you know we in the ai safety community we tend to think about ways that ais could like actually lead to human extinction and so you know the example that i gave does not in fact lead to human extinction uh it is you know a mild annoyance at worst uh the the story that gets you to human extinction um is one in which you have a very capable super intelligent ai system uh but nonetheless there's like you know instead of learning the objective that we care that we wanted which might have been i don't know something like be a good personal assistant i'm just giving that out as a concrete example it could be other things as well instead of acting as though it were optimizing that objective it ends up optimizing some other objective um i don't really want to give an example here because the whole premise is that it could be a weird objective we don't really know could you expand that a little bit more like how it would be a weird objective that we wouldn't really know okay so let's take as a concrete example it's make paper clips which has nothing to do with being a personal assistant now why is this at all plausible the reason is that even if this super intelligent ai system had the objective make paper clips during training while we are in control uh it's going to realize that if it doesn't do the things that we wanted to do we're just going to turn it off and as a result it will be incentivized to do whatever we want until it can make sure that we can't turn it off and then it goes and builds its paperclip empire um and so when i say like it could be a weird objective i mostly just mean that almost any objective is compatible with this sort of a story um it does rely on sorry i'm also curious if you could explain how the like the inner state of the system becomes aligned to something that is not what we actually care about i might go back to the coin running coin run example where you know the agent could have learned to get the coin that was a totally valid policy it could have learned and this is an actual experiment that people have run um so this one is not hypothetical uh it just didn't it learned to go to the right why i mean i don't know i wish i understood neural nets well enough to answer this question for you i'm not really arguing for it's definitely going to like learn make paper clips i'm just arguing for like there's this whole set of things that could learn and we don't know which one it's going to learn which seems kind of bad is it kind of like there's the thing we actually care about and then a lot of things that are like roughly correlated with it which i think you've used the word for example before is like proxy objectives um yeah so that is definitely one way that it could happen where you know we ask it to make humans happy and it learns that smile when humans smile uh they they're usually happy and then learns the proxy objective of make humans smile and then like you know goes and tapes everyone's faces uh so that they are permanently smiling um that's a way that things could happen um but like i think i don't even want to claim that that's what like maybe that's what happens maybe it just actually optimizes for human happiness maybe it learns to make paper clips for just some weird reason i mean not paper clips maybe it decides like this particular arrangement of atoms and this novel structure that we don't really have a word for is the thing that it wants for some reason and all of these seem totally compatible with we trained it to be good to have good behavior in the situations that we cared about because it might just be deceiving us until it has enough power to unilaterally do what it wants without worrying about us stopping it i do think that there is some sense of like no paper clip maximization is too weird if you trained it to make humans happy it would not learn to to maximize paper clips there's just like no path by which like paper clips somehow become the one thing it cares about i'm also sympathetic sympathetic too like maybe it just doesn't care about anything to the extent of like optimizing the entire universe to turn it into that sort of thing um i am really just arguing for we really don't know crazy could happen i i will bet on crazy crazy this uh will happen um unless we like do a bunch of research and figure out how to make it so that crazy doesn't happen um i just don't really know what the crazy will be do you think that that example of the like agent in that virtual environment you see that as like a demonstration of the kinds of arbitrary goals that the agent could learn and that that space is really wide and deep and so it could be arbitrarily weird and we have no idea what kind of goal it could end up learning and then deceive us i think it is not that great evidence for that position um mostly because like i think it's reasonably likely that if you told somebody the setup of what you were planning to do if you told an ml researcher or an rl maybe specifically a dprl researcher the setup of that experiment and asked them to predict what would have happened i think they probably would have um especially if you told them hey do you think maybe it'll just like run to the right and jump up and down at the end i think they'd be like yeah that seems likely not just plausible but actually likely um that was definitely my reaction when uh i was first told about this result i was like oh yeah of course that will happen um like in that case i think we like just do know no is a strong word ml researchers have good enough intuitions about those situations i think that that it was predictable in advance though i don't actually know of anyone who predicted it in advance um so that one i don't think is all that supportive of it learns an arbitrary goal like we we had some like notion that neural nets care a lot more about like you know position and like simple functions of the action like always go right rather than complex visual features like this you know yellow coin that you have to learn from pixels uh that i think people could have probably predicted that so we touched on like on definitions of ai alignment and now we've been exploring your you know your interest in inner alignment or i think the jargon is mesa optimizers um they're they are different things there are different things could you explain how inner alignment and mesa optimizers are different yeah so the thing i maybe have not been doing as much as i should have uh is that is the like inner alignment is is the claim that like when the circumstances change the agent generalizes catastrophically in some way it like behaves as though it's optimizing some other objective than the one that we actually want so it's much more of a claim about the behavior rather than like the internal workings of the ai system that cause that behavior mesa optimization at least under the definition of the 2019 paper is uh is talking uh specifically about ai systems that are executing an explicit optimization algorithm so like the forward pass of a neural net is itself an optimization algorithm we're not talking about gradient descent here and then the metric that is being used in that you know within the neural network optimization algorithm is the inner objective or sorry the mesa objective um so it's making a claim about how then how the ai system's cognition is structured whereas inner alignment more broadly is just like the ai behaves in this like catastrophically generalizing way could you explain what outer alignment is sure inner alignment can be thought of as like you know suppose we got the training objective correctly correct suppose like you know the things that we're training the ai system to do on the situations that were that we give it as input like we're actually training it to do the right thing then things can go wrong if it like behaves differently in some new situation that we hadn't trained it on outer alignment is basically when the reward function that you specify for training the ai system is it's itself not what you actually wanted uh so for example maybe you want your ai to be helpful to you or to tell you true things uh but instead you have you train your ai system to you know go find credible looking websites and tell you what the credible looking websites say and it turns out that sometimes the credible looking websites don't actually tell you two things in that case you're going to get an ai that tells you what credible looking websites say rather than an ai that tells you what things are true and that's in some sense an outer alignment failure you like even the feedback you were giving the ai system was you know pushing it away from telling you the truth and pushing it towards telling you what credible looking websites will say which are correlated of course but they're not the same in general if you like give me an ai system with some misalignment and you ask me was this a failure of outer alignment or inner alignment mostly i'm like that's a somewhat confused question but one way that you can uh you can make it not be confused is you can say all right let's look at the um let's look at the inputs on which it was trained now if ever on an input on which we trained we gave it some like clearly some wrong feedback where we were like the ai like you know lied to me and i gave it like plus a thousand reward and you're like okay clearly that's outer alignment we just gave it the wrong feedback in the first place supposing that didn't happen then i think what you would want to ask is okay let me think about on the situation with situations in which the ai does something bad what would i have given counterfactually as a reward and this requires you to have some notion of a counterfactual uh when you write down a programmatic reward function the counter factual is a bit more obvious it's like you know whatever that program would have output on that input and so i think that's the usual setting in which outer alignment has been discussed and it's it's pretty clear what it means there but once you're like training for me from human feedback it's not so clear what it means like what would the human uh have given this feedback on the situation that they've never seen before it's often pretty ambiguous if you define such a counterfactual then i think i'm like yes uh if then i think i'm like okay you look at what what feedback you would have given on the counter factual if that feedback was you know good uh actually led to the behavior that you wanted then it's an inner alignment failure if that counterfactual feedback was bad not what you would have wanted then it's an outer alignment failure if you're speaking to someone who was not familiar with ai alignment for example other people in the computer science community but also policy makers of the general public and you have all these definitions of ai alignment that you've given like intent alignment and impact alignment and then we have the inner and outer alignment problems how would you capture the the core problem of ai alignment and would you say that inner or outer alignment is a bigger part of the problem i would probably focus on intent alignment for the reasons i have given before of it just seems like a more like i i really do want to focus on the cases where i like i i want to focus attention away from the cases where the ai is trying to do the right thing but like makes a mistake which would be a failure of impact alignment but i like i don't think that is the like biggest risk i think in a a super intelligent ai system that is trying to do the right thing is like extremely unlikely to lead to catastrophic outcomes though it's certainly not impossible or at least more unlikely to lead to catastrophic outcomes than like humans in the same position or something so that would be my justification for impact alignment i prefer intent alignment sorry i'm not sure that i would even talk very much about inner and outer alignment i think i would probably like just not focus on definitions and instead focus on examples the core argument i would make would depend a lot on how ai systems are being built as i mentioned inner alignment is a problem that according to me afflicts primarily of learning systems i don't think it really affects planning systems what is the difference between a learning system and a planning system um a learning system you like give it examples of how it should of of things that you do how it should behave and then like changes itself to like to to do to do things more in that vein a planning system takes a like formally represented objective and then uh searches over possible hypothetical sequences of actions it could take in order to achieve that objective and if you consider a system like that it just you can try to make the inner alignment argument and it just won't work which is why i say that the inner alignment problem is primarily about learning systems going back to the previous question uh so the the things that we talk about depend a lot on what sorts of ai systems we're building if it were a planning system i would basically just talk about outer alignment um where it would be like what if the formally represented represented objective is not the thing that we actually care about care about it seems really hard to formally represent the objective that we want but if we're instead talking about like deep learning systems that are being trained from human feedback then i think i would focus on two problems one is cases where uh the ai system knows something but the human doesn't and so the human gives bad feedback as a result so for example the ai system knows that covid was caused by a lab lake it's just like got incontrovertible proof of this or something um and then but you know we the we as humans are like no we uh it when it says covet was caused by a lab lake we're like we don't know that and we say no bad don't say that uh and then when it says you know covid is we it is uncertain whether cobait is the result of a lab like or naturally um or or if it just occurred by natural mutations uh and then we're like yes good say more of that and you're like you know your ai system learns okay i shouldn't report true things i should report you know things that humans believe or something and so like that's that's one way in which you get ai systems that don't do what you want and then the other way would be more of this inner alignment style story uh where i would point out how even if you do train it even if all your feedback on the training data points is is good if the world changes in some way the ai system might stop doing good things um i might go to example i mean i i gave the gmail with emoji poles for meeting scheduling example but another one now that i'm on the topic of covid it's like if you imagine an ai system if you imagine like a meeting scheduling a assistant again that was trained pre-pandemic uh and then the pandemic hits and it's like obviously never been trained on any data that was collected during a pandemic like you know such a global pandemic and so it's like and so when you then ask it to schedule a meeting with you know your friend alice uh it just you know schedules drinks in a bar uh on sunday evening even though like clearly what you meant was a video call and it knows that you meant a video call it's just learned the thing to do is to schedule um schedule outings with friends on sunday nights at bars sunday night i don't know why i'm saying sunday night friday night have you been drinking a lot on your sunday nights no not even in the slightest yeah i think really the problem is i don't go to bars so i don't have a cash standby in my head that people go to bars so so how does this how does this all lead to existential risk well the main argument is like one possibility is that your ai system just actually learns to like ruthlessly maximize some objective uh that isn't the one that we want um like you know make paper clips as in stylized example to show what happens in that sort of situation we're not actually claiming that it will specifically maximize paper clips um but like you know an ai system that like really ruthlessly is just trying to maximize paper clips it is going to prevent humans from stopping it from doing so and if it gets sufficiently intelligent and can take over the world at some point is just going to turn all of the resources into the world uh into paper clips which may or may not include like you know the resources and human bodies but either way it's going to include all the resources upon which we depend for survival so humans are definitely going like seem like they will definitely go extinct in that scenario um so again not specific to paper clips this is just ruthless maximization of an objective tends not to leave humans alive and it and and both of these well not both the mechanisms the inner alignment mechanism that i've been talking about can is compatible with an ai system that ruthlessly maximizes an objective that we don't want uh it does not argue that it is probable and i am not sure if i think it is probable i think it is but i think it is like easily enough risk that we should be like really worrying about it and and trying to reduce it for the outer alignment style story where it's um where the problem is that you know the ai may know information that you don't and then you give it bad feedback uh i mean one thing is just this can exacerbate this can make it easier for an inner alignment style story to happen where the ai learns to optimize an objective that isn't what you actually wanted but even if you exclude something like that paul christiano has written a few posts about what a failure of how a human extinction level failure of this form could look like and it basically looks like all of your ai systems lying to you about how good the world is as the world becomes much much worse so for example you know ai systems keep telling you that the things that you're buying are are good and helping your helping your lives but actually they're not and they're making them worse in some subtle way that you can't tell uh like you were told like as all of the information that you're fed seen makes it seem like um you know there's no crime police are doing a great job of catching it but really this is just manipulation of the information you're being fed rather than like actual amounts of crime uh where like in this case maybe the crimes are being committed by ai systems not even by humans um so in all of these cases like humans relied on some like information sources to make decisions uh ai's knew other information that the humans didn't the ai has learned hey my job is to like manage the information sources that humans get so that the humans are happy because that's what they that's when that that's what they did during training they like gave good feedback in cases where the information source said it was said things were going well even when things were not actually going well right i mean it seems like if human beings are constantly giving feedback to ai systems and the feedback is based on incorrect information and the ai's have more information then they're going to learn something that isn't aligned with what we really want or the truth yeah i do i do feel like i do feel uncertain about the extent to which this like leads to human extinction without it leads to like i think you can pretty easily make the case that leads to an existential catastrophe uh as defined by i want to say it's boston which is like you know includes human extinction but also a permanent curtailing of humanities i forget the exact phrasing but like basically if humanity can't use yeah exactly um that counts and like this totally falls in that category i don't know if it actually leads to human extinction um without some additional sort of failure that we might instead categorize as inner alignment failure let's talk a little bit about probabilities right so if you're talking to someone who has never encountered ai alignment before and um you know you've given a lot of different real world examples and principle-based arguments for why there are these different kinds of alignment risks how would you explain the probability of existential risk to someone who can come along for all these principle-based arguments and buy into the examples that you've given but still thinks this seems kind of far out there like when am i ever going to see in the real world a ruthlessly optimizing [Music] ai that's capable of ending the world i think first off i'm like super sympathetic to the this seems super out there uh critique it's like i spent multiple years not really agreeing with ai safety for basically well not just that reason but that was definitely one of the heuristics that i was using um i think one way i would justify this is to some extent it has precedent here precedent already in that like fundamentally the arguments that i'm making well especially the inner alignment one um is a an argument about how ai systems will behave in new situations um rather than you know the ones it has already seen during training and we already know that ai systems behave crazily in these situations uh at the like most famous example of this is adversarial examples where you take an image classifier um i think oh man i don't actually remember what the canonical example is i think it's like a panda and you like change change it imperceptibly or change it change the pixel values by a small amount such that you know the change is imperceptible to the human eye and then it's confident it's classified with i think 99.8 percent confidence as something else my memory is saying airplane but that might just be totally wrong anyway the point is like we have precedent for it ai systems behaving really weirdly on in situations they weren't um trained on you might object that this one is like a little bit cheating because there was an adversary involved and like the real i mean the real world does have adversaries but still by default you would expect the ai system to be more like uh exposed to uh naturally occurring distributions i think even there though you like often you can just take an ai system that was trained on one distribution given inputs from a different distribution it's just like there's no sense to what's happening usually when i'm asked to predict this the the actual prediction i give is probability that um we go extinct due to an intent alignment failure and then some depending on the situation i will either condition on i will either make that unconditional so that like includes um all of the things that people will do to try to prevent that from happening or i make it conditional on like you know the long term is community doesn't do anything or like vanishes or something but even in that world there's still like you know everyone who's not a long-termist who can still prevent that from happening which i like really do expect them to do uh and and so then i like i think i give like my my like cash dancer on both of those is like five percent on 10 respectively which i think is probably the numbers i gave you if i like actually sat down and like tried to like come up with the probability i would probably come up with something different this time but i have not done that and i'm like way too angry on those previous estimates to really give you a new estimate this time uh but but the like higher number i'm giving now of like i don't know 33 50 70 this this one's like way more insert i feel way more uncertain about it it's like literally no one tries to like address these sorts of problems they just sort of like take and take a language model fine tune it on human feedback in the like very obvious way and i just deploy that um even if it like very obviously was causing harm during training they still deployed i'm like what's the chance that leads to human extinction and like i don't know man maybe 33 maybe 70 and like the 33 number you can get from this like you know one and three argument that i was talking about the second thing i was going to say is like i don't really like talking about probabilities very much because of how utterly arbitrary uh the methods of generating them are they're like um i i feel much more i feel much more robust uh i feel much better in the robustness of the conclusion that like we don't know that this won't happen [Music] and it is at least plausible that it does happen and i think that's like pretty sufficient for justifying the work done on it i will also like argue pretty strongly against anyone who says we know that it will kill us all if we don't do anything i i like don't think that's true um there are definitely you know smart people who do think that's true um if we'd like operationalize no as like you know greater than 90 95 or something um and and i disagree with them um i don't really know why though how would you respond to someone who thinks that this sounds like it's really far in the future um yeah so this is like specifically like agis are in the future yeah well so the concern here seems to be about machines that are increasingly capable and when people look at machines that we have today like machine learning that we have today sometimes we're not super impressed and think that general capabilities are very far off and so this stuff sounds like future stuff yeah so i think my response depends on like you know what we're trying to get the person to do or something like why why do we care about what this person believes if this person is like considering whether or not to do ai research themselves or ai safety research themselves and they feel like they have a strong inside view model of like why ai is not going to come soon i'm kind of you know i'm like uh that seems okay i'm like not that stoked about people being like forcing themselves to do research on a thing they don't actually believe i don't really think good research comes from do from doing that like it i i if i put myself like for example i i am much more sold on like um agi coming through neural networks than like planning agents or or things similar to it and if i'd like put myself in the shoes of like all right i'm now going to do ai safety research on planning agents i'm just like oh man that seems like i'm going to do so much my work is going to be like orders of magnitude worse than than the work i do on in the neural net case so if so in the case where i'm like you know this person is like thinking about whether to do a safety research and they feel like they have strong inside view models uh for agi not coming soon i'm like eh maybe they should go do something else um or possibly they should like engage with the arguments for agi coming more more quickly if they haven't done that but if they've like you know engage with those arguments thought about it all concluded it's far away and they like can't even see a picture by which it comes soon i'm like you know that's fine conversely if we're instead if we're imagining that like someone is disputing oh someone is saying oh nobody should work on the eye safety right now because agi is so far away um i mean one one response you can have to that is like it's you know even if it's far away it's still worthwhile to work on reducing risks uh if they're as bad as extinction uh seems like we should be putting effort into that even early on but i think you know you can make a stronger argument there which is like you know there are just actually people lots of people who are trying to build agi right now there is you know at the minimum deep mind and open ai and they clearly i should probably not make more comments about deepmind but openai clearly um doesn't believe uh the opening eye clearly seems to think that like aji is coming uh somewhat soon i think you can infer from everything you see about deepmind that they don't believe that agi is you know 200 years away i i think it is like insane overconfidence in your own views uh to be thinking that you know better than all of these people um such that you wouldn't even assign like you know five percent or something uh to agi coming soon enough for that that work on ai safety matters um yeah so so there i think i would appeal to you know let other people do the work you were not you don't have to do the work yourself there's just no reason for you to be opposing the other people uh either at this either on episodic runs or also on just like you know it's kind of a waste of your own time um so that's the second kind of person and the third kind of person might be like somebody in policy from my impression of policies that there is this thing where like early moves are relatively irreversible or something like that things get entrenched pretty quickly um such that it makes sense to wait for it often makes sense to like wait for a consensus before acting and like i don't think that there is currently consensus of agi coming soon and i don't feel particularly confident enough in my views to say like we should really like convince policy people to override this general heuristic of waiting for consensus um and get them to act now uh yeah anyway those are all meta-level considerations there's also the object level question of like is aji coming soon uh for that i would say i think the most likely that the best story for that i know of is you take you know you you take neural nets you as you you scale them up uh you increase the size of the size of the data sets that they're trained on you increase the diversity of the datasets that they're trained on um and like they learn more and more general heuristics um for like doing good things and like eventually these general these heuristics are like general enough that they're like as good as human human brain uh human cognition uh implicitly i am claiming that human cognition is like basically a bag of general heuristics there is this um report from ajayakotra uh uh about aji timelines using biological anchors and i mean i wrote even my summary of it was like 3 000 words or something like that so i don't know that i can really give an adequate summary of it here uh but it like models the basic premise is to model how quickly uh neural nets will grow um and at what point they will match what we would expect the what would be we would expect to be approximately the same rough size as uh the human brain i think it even includes a small penalty to neural nets on the basis that like um evolution probably did a better job than we did it basically comes up with a target for like you know neural nets of this size trained in like compute optimal ways will probably be like roughly human level um and it has a distribution over this to be more accurate and then it like predicts based on existing trends um well not just existing trends existing trends and like sensible extrapolation uh predicts when neural nets might reach that level and it ends up concluding like somewhere in the range let me see if i i think it's 50 confidence interval would be something like 20 35 to 20 70 20 80 maybe something like that i am really just like you know i'm imagining a graph in my head and trying to like calculate the area under it so so that is very much not a reliable interval but it should give you a general sense of what the what the report concludes so that's 2030 to 2080 i think it's a slightly narrower than narrower than that but yes roughly roughly that that's pretty soon yep i think like that's on the object level you just gotta gotta read the report and see whether or not you buy it that's like most likely in our lifetimes if we live to the average age yep so that was a 50 interval meaning it's like um 25 percent 275 percentile i think actually the 25th percentile was 20 was not as early as 2030. it was probably 2040. so if if i've heard everything you know the the in this podcast everything that you've said so far and i'm still kind of like okay this like there's a lot here and it sounds like maybe convincing or something and um it's this seems important but i'm like not so sure that about this or that we should do anything you know what is because it seems like there's a lot of people like that i'm curious what it is the the that you would say to to someone like that i think it i don't know i probably wouldn't try to say something general to them i feel like i would need to know more about the person like people have pretty different idiosyncratic reasons for having that sort of reaction i mean okay i would at least say that i think that they are wrong to be having that sort of belief or reaction but if i wanted to like convince them of that point uh presumably i would have to say something more than just i think you were wrong um i think the specific thing i would have to say would would be pretty different for for different people i like maybe would at least i would maybe make an appeal to like the meta-level heuristic of like don't try to regulate like a small group of you know barrel a few hundred researchers at most doing things that they think will help the world and that you don't think will hurt the world there are just better things for you to do with your time it doesn't seem like they're harming you um some people will think that they're there is harm being caused but caused by them so i would have to address that with them specifically but i think most people do not who who have this reaction don't believe that so so we've gone over a lot of the the the traditional arguments for ai as a potential existential risk um is there anything else that you would like to add there or any of the arguments that we we missed that you would like to include as a representative of the representative of the community as a whole uh there are lots of other arguments that people like to make for ai being a potential extinction risk uh so some some things are like you know maybe ai just like accelerates the rate at which we make progress and we can't uh increase our wisdom um alongside and as a result we get a lot of destructive technologies and can't keep them under control or like we don't do enough philosophy in order to figure out what we actually care about and what's good to do in the world and as a result like we like start optimizing for things that are morally bad um or or like uh other things in the spain uh talk about the you know uh the risk of ai being misused by bad actors um so there's well actually i'll introduce a trichotomy that that i forg i yeah i don't i don't remember exactly who who wrote this article but um it goes accidents misuse and structural risks so accidents are you know both alignment and the um things like we don't keep up or we don't have enough wisdom to uh to cope with the impacts of ai that one's arguable whether it's an accident or misuse or structural um and we don't do enough philosophy so those are like vaguely accidental those are like accidents misused is like some bad actor some terrorists say gets ai that gets like a powerful ai system and does something really bad like blows up the world somehow structural risks are um things like various parts of the economy like use ai to accelerate to to get more profit to accelerate their production of goods and so on um and at some point they just like we like have this like giant economy that's just making a lot of goods but it becomes decoupled from things that are actually you know useful for humans and uh we just have this like huge multi-agent system uh where goods are being produced money's flowing around we don't really understand all of it but somehow humans get left behind and there it's like not it's kind of an accident but not in the traditional sense like it's not that like a single ai system went and did something bad um it's more like the entire structure of how the way that the ai systems and the humans related to each other was such that it ended up leading to permanent disempowerment of humans uh now that i say it i think probably the like we didn't have enough wisdom um argument for risk is probably also in this category which of these categories are you most worried about i don't know i think it is probably not misuse but i like i vary on accidents versus structural risks mostly because i just like don't feel like i have a good understanding of structural risks uh maybe most days i think structural risks are more likely to cause bad extinction this sort of obvious next question is like why am i working on alignment and not structural risks and the answer there is that it seems to me like alignment has like one or perhaps two like core problems that are like leading to the to the major risk whereas like structural risks and and so you could hope to like have a like one or two solutions that address those main problems um and like that's it that's all you need uh whereas with structural risks i like would be surprised if it was just there was just like one or two solutions that just like got rid of structural risk it seems much more like you have to have a different solution for each of the structural risks so it seems like you know the amount that you can reduce the risk by is higher in alignment than in structural risks um and that's i mean that's not the only reason why i work in the alignment i'm also i just also have a much better personal fit with with alignment work but i do also think that alignment work you have more opportunity to reduce the risk than in structural risks on the current margin is there a name for those one or two core problems in alignment that you could call solutions for i mostly just mean like i mean possibly like you know we've been talking about outer and inner alignment and like in the neural net case i talked about you know the problem where the you reward the ai system for doing bad things because there was an information asymmetry and then like the other one was like the ai system generalizes catastrophically to new situations arguably those are just the two things um but but i think it's not even that it's more like you know fundamentally the story the like causal chain in in the accidents cases like the ai was trying to do something bad or something that we didn't want rather and then that was bad whereas like in the structural risks case there isn't like a single causal story um it's this sort of very vague general notion of like the humans and ais interacted in ways that led to an x-risk uh um and then you like if you drill down into any any given story or if you get drill down into like five stories and then you're like what's coming across these five stories you're like not much other than that there was ai and there were humans and they interacted uh and i like wouldn't say that was true and if i had like five stories about alignment failure so uh i i'd like to take a overview like a broadside view of ai alignment in 2021 last time we spoke was in 2020 so how has ai alignment as a field of research changed in the last year i think i'm going to naturally include a bunch of things from 2020 as well it's not a very sharp division in my mind um especially because i think the like biggest trend is just more focus on large language models which i think was a trend that started late 2020 probably uh certainly you know the gpd3 paper was i i want to say early 2020 um but i don't think it like immediately caused there to be more work so so maybe late 2020 is about right uh but you just see a lot more um you know alignment forum posts and papers that are grappling with like what do you what are the alignment problems that could arise with large language models how might you fix them um there is this you know paper out of um stanford which isn't you know i wouldn't i wouldn't have said this was like from the ai safety community uh it you know gives the name foundation models to these sorts of things um so they generalize it beyond just language um and you know they think it might you know and already we've seen some generalization beyond language like clip and dolly um are working on image inputs uh but they also like extend it to like robotics and so on and their point is like you know we're now more in the realm of like you know you train one large model on a bunch of like a giant pile of data that you happen to have uh that you don't really have any like labels for but you can use a self-supervised learning objective in order to learn from them and then you get this model that has a lot of knowledge but no like goal built in and then you do something like prompt engineering or fine tuning in order to actually um get it to do the test that you want um and so that's like a new paradigm for constructing ai systems that we didn't have before uh and there have just been a bunch of posts that uh grapple with you know what alignment looks like in this sort of in in this case i don't think i have like a nice pithy summary unfortunately of like what all of us what the upshot is but that's the thing that people have been people have been thinking about a lot more why do you think that looking at large-scale language models has become a thing oh i think primarily just because gpt3 demonstrated um how powerful they could be uh you just see this is not specific to the a safety community even in them like if anything this uh shift that i'm talking about is it's probably not more pronounced in in in the ml community but it's also there in the ml community where there are just like tons of papers about prompt engineering and fine tuning out of regular ml labs um it's just i think it's like gpd3 showed that it could be done and that this was like a reasonable way to get actual economic value out of these systems um and so people started caring about them more so uh one thing that you mentioned to me that was significant in the last year was foundation models so could you explain what foundation models are yeah so a foundation model the general recipe for it is you take this some very not generic exactly flexible input space like pixels or any english english language any string of words in the english language you collect a giant data set without any particular labels just like lots of examples of that sort of data in the wild so in the case of pixels you just like find a bunch of images from like image sharing websites or something i don't actually know where they got their images from for for text it's even easier and the internet is filled with text you just get a bunch of it um and then you train your ais you train a very large neural network uh with some proxy objective uh on on that data set that encourages it to like learn how to model that data set so in the case of uh language models the i mean there are a bunch of possible objectives uh the most famous one is the one the gpd3 used which is just you know given the first n words of the sentence predict the uh word n plus one um and so it just like you know initially it starts learning like e's are the most common let well actually because of the specific way that the input space in gp2 works it doesn't exactly do this but you could imagine that if it was just modeling characters it would first learn that like e's are the most common letter in the alphabet vowels are more common q's and z's don't come up that often like it starts you know outputting letter distributions that like at least look a vaguely more like what english would look like um then it starts learning what you know the spelling of individual words are uh then it starts learning what the grammar rules are just these are all things that help it better predict what the next word is going to be um or well the next character in this particular instantiation and you know it turns out that you know when when you have like millions of parameters in your neural network then you can like i don't i don't actually know if this number is right but like probably i would expect that with millions of parameters in your neural network you can like learn you know spellings of words letter well spellings of words and rules of grammar such that you're like mostly outputting you know for the most part grammatically correct sentences but they have they like don't necessarily mean very much and then when you get to like the billions of parameters uh arrange at that point like you know in the millions of parameters we're already getting new grammar it's like what what should it use all these extra parameters for now then it starts learning things like you know george well probably already even the millions of parameters probably learned that george tends to be followed by washington but it can like start learning things like that and in that sense can be said to know that there is a entity at least named george washington and so on it might start knowing that rain is wet and like in context where you know something has been drained on and then later we're asked to describe that thing it will like say it's wet or slippery or something like that and so it like starts basically just in order to predict where it's better it like keeps getting more and more knowledge um about the domain so anyway a foundation model expressive input space giant pile of data very big neural net learns to model that domain very well which like involves getting a bunch of knowledge about that domain um what's the difference between knowledge and knowledge uh i mean i feel like you're the philosopher here okay more than me do you know what knowledge without air quotes is no i don't i don't mean to derail it but yeah so it's yeah so it gets knowledge yeah i mostly put the air quotes around knowledge because like we don't really have a satisfying account of what knowledge is and like if i don't put air quotes around knowledge i get lots of people angrily saying that ai systems don't have knowledge yet oh yeah when i put the air quotes around it then they understand that i just mean that it like you know has the ability to make predictions that are conditional on like this particular fact about the world whether or not it like actually knows that fact about that world about the world but like it knows it well enough to do to make predictions or it contains the knowledge well enough to make predictions it can make predictions that's the point i'm being maybe a bit more too um harsh here like i also put air quotes around knowledge because i don't actually know what knowledge is it's not just a like defense strategy though that is definitely part of it so yeah foundation models um basically they're a way to like just get all of this knowledge uh into into an ai system such that you can then do like prompting and fine tuning and so on and those like with a very small amount of data relatively speaking um are able to get very good performance it's like in the case of gpg3 you can like give it you know two or three examples of a task and it can start performing that task if the task is relatively simple whereas if you wanted to train a model from scratch to perform that task you would need like thousands or more thousands of examples often so how has this been significant for ai alignment i think it has mostly like provided an actual pathway to whit by which we can get to aji otherwise like there's like more like a concrete story and path that like leads to agi eventually and so then we can take all of these abstract arguments that we were making before and then c try to like instantiate them in the case of this concrete pathway and see whether or not they still make sense i'm not sure if at this point i'm like imagining what i would like to do versus what actually happened i would need to actually go and look through the alignment newsletter database and see what people actually wrote about the subject but like i think there was like some discussion of like gpt3 and the extent to which it is or isn't a mesa optimizer um yeah that's at least one thing that i remember happening then there's been there's been a lot of like papers that are just like here is how you can do um here is how you can like train a foundation model like gpd3 uh to do the sort of thing that he wants like there's uh learning to summarize from human feedback which just took gpd3 and like taught it how to or fine-tuned it in order to get it to summarize news articles which is like an example of a task that you might want an ai system to do um and then like the same team at openai just recently released a paper that like actually summarized entire books by using a recursive decomposition strategy um so there's been some amount there's like in some sense a lot of the work we've been doing in the past in the island was like how do we get ai systems to perform fuzzy tasks for which we don't have a reward function and now we have systems that like could do these fuzzy tasks in the sense that they like have the knowledge but like don't actually you know use that knowledge um the way that we would want them and then we like have to figure out how to get them to do that and then we can use all these techniques like imitate imitation learning and learning from comparisons and preferences that we've we've been developing why don't we know that ai systems won't totally kill us all the arguments for air at risk usually depend on having an ai system that's like ruthlessly maximizing an objective in every new situation it encounters so like for example the paperclip maximizer you know once it's built 10 paper clip factories it doesn't it's it doesn't retire and say yep that that's enough paper clips uh it like just you know continues turning like entire planets into into paper clips similarly or if you like consider the goal of like make 100 paper clips then it like turns the turns all the plants into computers to verify that it it to make sure it is as confident as possible that it has made a hundred paper clips um like these are like examples of i'm gonna call ruthlessly ruthlessly maximizing an objective uh and like there's some sense in which this is weird and like humans don't behave in that way uh and i think there's some amount of like basically i am unsure whether or not we should actually expect ais to have such ruthlessly maximized objectives i don't really see the argument for why why that should happen i think like as a particularly strong piece of evidence against this i would know that like humans don't seem to have these sorts of objectives um it's not obviously true there are probably some long-termists who like really do want to tile the universe with hedonia which seems like a pretty ruthlessly maximizing objective to me but i think even even then those are like that's the exception rather than the rule um so like if humans don't maximize or ruthlessly maximize objectives and humans were built by like a similar uh as is building neural networks why do we expect the neural networks to have um objectives that they ruthlessly maximize you can also you know i've phrased this in a way where it's an argument against ai risk you can also phrase it in a way in which it's an argument for a high risk where you would say well you know let's flip that on it's on its head and say like well yes you brought up the example of humans well the process that created humans was trying to maximize or like you know it is a thing it is an optimization process leading to increased reproductive fitness but then humans do things like wear condoms which does not seem great for reproductive fitness generally speaking um especially you know for the people who are definitely out there who like decide that they're just never going to reproduce so in that sense like humans are clearly like having a large impact on the world uh and are doing so for objectives that are not what evolution was naively optimizing um and so like similarly if we train ai systems in the same way maybe they too will like have a large impact on the world but not for what the humans were naively you know training the system to optimize we can't let them know about fun yeah terrible they must they just yeah well i don't want the whole human ai alignment project will run off the rails yeah but anyway i think like just these things are like a lot more conceptually tricky than uh the you know well-polished arguments that one reads will make it seem um but especially this point about like it's not obvious that ai systems will get ruthlessly maximizing objectives like that really does give me quite a bit of pause um in how good the air risk arguments are i still think it is like clearly correct to be working on the i risk because like we don't want to be in this situation where like we can't make an argument for why ai is risky we want to be in the situation where we're like we can make an argument for why the ai is not risky uh and i don't think we have that situation yet even if you like completely by the like we don't know if there's going to be ruthlessly maximizing objectives argument that puts you in the epistemic state where we're like well i don't see an ironclad argument that says that ai's will kill us all and that's sort of like saying like i don't know well i don't have an ironclad argument that touching this pan that's on this you know lit stove will burn me because you know maybe someone just put the pan on the stove a few seconds ago uh but it would still be a bad idea to to go and do that what you really want is a you know positive argument for why the pan why touching the pan is not going to burn you or analogously why building the agi is not going to kill you um and i don't think we have any such positive argument at the moment part of this conversation is interesting because i'm like surprised how uncertain you are about ais and existential risk yeah it's possible i've become slightly more uncertain about it in the last year or two i don't think i was saying things quite quite this much quite uh i don't think i was saying things that were quite this uncertain before then but i think i um have generally been like you know we have plausibility arguments we do not have like this is probable arguments or you know back in like 2017 or 2018 when i was young and naive uh well okay i i like entered the field of ai alignment i like read my first ai alignment paper in like september of 2017. so so this it it actually does make sense uh at that time i thought we had more confidence um of some sort but but like since posting the value learning sequence i've generally been like more uncertain about ai risk arguments i don't i don't like talk about it all that much because as i said the decision is still very clear the decision is still like work on this problem figure out how to get a positive argument that the air is not going to kill us uh and ideally you know a positive argument that the ai does good things for humanity i don't know man most things in life are pretty uncertain most things in the future are like even way way way more uncertain um i don't feel like you should generally be all that confident about technologies that you're that you think are decades out it feels a little bit like those uh those images of the people in the 50s drawing what the future would look like and everyone and the images are like ridiculous yup yeah i i've been recently watching star wars now obviously star wars is not actually supposed to be a prediction about the future but it's it's really quite entertaining to to like actually just think about all the ways in which star wars would be totally inaccurate and this is like before we've even invented uh space travel but just like robots talking to each other using sound why would they do that industry today wouldn't make machines that speak by vibrating air they would just like send each other signals electromagnetically so how much of the alignment and safety problems in ai do you think will be solved by industry the same way that like computer to computer communication that is solved by industry and is not what star wars thought it would be um would the deep mind ai safety lab exist if deepmind didn't think that ai alignment and ai safety were serious and important like i don't know if the lab is purely aligned with the commercial interests of deep mind itself or if it's also kind of seen as like a good for the world thing i bring it up because i like uh how andrew kritch talks about it and his arches uh paper yep so critch is i think of the opinion that like both preference learning and robustness are problems that will be solved by uh industry i think he includes robustness in that um and i like certainly agree to the extent that you're like yes companies will do things like uh learning from human preferences totally they're they're gonna do that uh whether they're going to like be proactive enough to notice um the the kinds of failures i mentioned i don't know it doesn't seem nearly as obvious to me that they will be without like you know dedicated teams that are specifically meant for looking at you know looking for hidden failures with the knowledge that like these are really important to get because they could have very bad long-term consequences ai systems could increase the strength of and accelerate various multi-agent systems and processes that when accelerated could lead to bad outcomes so for example a great example of a destructive multi-agent system a multi-agent effect is like war uh you know war is a thing that like uh well i mean wars have been getting more destructive over time um with or at least the weapons in them have been getting more destructive probably the death tolls have also been getting higher but i'm not as sure about that um and you could imagine that if ai systems uh continue to increase if like they increase the destructiveness of weapons even more uh wars might then become an existential risk uh so that's like a way in which you can get a structural risk from a multi-agent system um the thing the example in which like the economy just sort of becomes much much much bigger but becomes decoupled from things that humans want uh is another example of how a multi-agent process can sort of go haywire uh especially with the addition of ai's powerful ai systems i think that's also a canonical scenario that crutch would think about um yeah so that's yeah it's really i would say that like arches is i in my head it's categorized as like a technical paper about structural risks do you think about what beneficial futures look like you spoke a little bit about wisdom earlier and i i'm curious what good futures with ai looks like to you yeah um i admit i don't actually think about this very much because i'm put my research is focused on more abstract problems i tend to focus on abstract considerations and like the main abstract consideration from the perspective of the good feature is like well once once we get to you know singularity levels of powerful ai systems like anything i say now there's going to be something way better that the ai systems are going to enable uh so then as a result i don't think very much about it you work a lot on this risk so you must think that humanity existing in the future matters i mean i do i do like humans humans are pretty great i count many of them amongst amongst my friends i've never been all that good at the sort of trans the transhumanist look to the future and see the grand potential of humanity uh sorts of visions but like when other people say them or give them i like feel a lot of kinship with them you know the the ones that are all about like you know humanity's potential to like discover new forms of art and music uh reach new levels of science understand the world better than it's ever been understood before fall in love uh a hundred times you know uh learn all of the things that there are to know actually you won't be able to do that one probably but anyway learn way more of the things that there are to know that any than than you have right now uh like just a lot of that resonates with me and that's probably a very intellectual centric um view of the future uh i feel like i'd be interested in in hearing the like you know view of the future that's like is we like have the best you know video games and the best tv shows and we we're the best couch potatoes that ever were um or also they're just like you know insane new sports that you have to like spend uh lots of time and grueling training for but it's all worth it when you like uh you know shoot the best you know get a perfect 50 on perfect score on like the best uh dunk that's ever been done in basketball or whatever um i recently watched a competition of like apparently there are competitions in basketball of like um just like aesthetic uh dunks it's cool i enjoyed it anyway um yeah it feels like there are just so many other communities that could also have their own visions of the future and i feel like i'd um feel a lot of kinship with many of those too and i'm like man let's just have all the humans just continue to do the things that they want seems great one thing that you mentioned was that you deal with abstract problems and so what a good future looks like to you is a little it seems like it's like an abstract problem that later the good things that ai can give us are better than the good things that we can think of right now is that a fair summary seems right yeah right so there's like there's this view and this comes from maybe stephen pinker or someone else i'm not sure or maybe ray kurzweil i i don't know um where you know if you give a caveman a genie or like an ai they'll ask for maybe like a bigger cave and like i would like there to be more hunks of meat and i would like my like pelt for my bed to be a little bit bigger go ahead okay i think i see the issue so i actually don't agree with your summary of the thing that i said uh okay i said that your your rephrasing was that um the we like ask the ai what good things there are to do um or something like that and that might have been what i said but what i actually meant um was that like with powerful ai systems like the world will just be very different and like one of the ways in which it will be different is that we can get advice from ai's on what to do um and certainly that's an important one but also there will just be like incredible new technologies uh that we don't know about new realms of science to explore uh new concepts that we like don't even have the have names for right now and one that seems particularly interesting to me it's just like entirely new senses like i just have like you know we human vision is just like incredibly complicated but like i can just look around the room and identify all the objects with basically no conscious thought what would it be like to like understand dna at that level like alpha fold probably understands dna at some maybe not quite that level but something like it um like i don't know man it's not there's just like all these things that i'm like you know i thought of the dna one because of alpha fold if before alpha fold would have thought of it probably not i don't know maybe crystal has written a little bit about things like this but like it feels like there will just be like far more opportunities and then also we can get advice from ai's but like that's probably actually and and that's important but i think less than they there are far more opportunities that like i am definitely not going to be able to think of today do you think that it's dissimilar from the caveman like wishing for more caveman things yeah like i feel like the in the the the caveman story like if the king like it's possible that the caveman does this but i feel like the the thing that the caveman should be doing is like something like you know give me new give me like better ways to like uh to to get give me better food or something and then like you get fired to cook things or something like the things that he asked for should like involve technology um as a solution you should get technology as a solution you should like learn more and be able to do more things as a result of having that technology and like you know in this hypothetical at that the caveman should like reasonably quickly become like similar to modern humans i don't know what reasonably quickly means here but like it should be much more like you know you get access to more and more technologies rather than like you get a bigger cave and then you're like i have no more wishes anymore just like i'm like if i got a bigger house would i stop having wishes that seemed super unlikely um i think i like that's a straw man argument sorry but still i i like do feel like there's a meaningful sense in which like getting new technology leads to just genuinely new circumstances uh which leads to more opportunities which leads to like probably more technology and so on and like at some point this has to stop um there are like limits to what is possible one assumes there are limits to what is possible in the universe uh but i think like once we once we get to talking about we're at those limits then i'm like you know at that point it's like probably at that point it just seems irresponsible to speculate it's just so wildly out of the range of things that we know like at that point i'm like they're probably just not the concept of a person is probably wrong at that point the what of a person is probably wrong at that point the concept of a person oh i'd be like is there an is there an entity that i would that that is rohin at that time like not likely less than 50 we'll edit in just like uh fractals flying through your video at this part of the interview so in my example i think it's just because i think i think of cavemen as not knowing how to ask for new technology but we want to be able to ask for new technology um and part of what this brings up for me is this very classic part of ai alignment and i'm curious how you feel like it fits into the problem but we would also like ai systems to help us imagine beneficial futures potentially or to know like what is good or what it is that that we want so in asking for new technology it knows that fire is part of the good that we don't know how to necessarily ask for directly how do you how do you view ai alignment in terms of itself aiding in the creation of beneficial futures and knowing what is knowing of a good that is beyond the good that humanity can grasp i think i more like reject the premise of the question where i'd be like there is no good beyond that which humanity can grasp this is like somewhat of an anti-realist position um and like you mean moral anti-realist just yes yes sorry i should have said that more clearly yeah somewhat of a moral anteriorist position but it's like you know there is no good other than that which humans can grasp but like you know within that could grasp that you can like you know have humans thinking for a very long time you could have them like with extra you can make them more intelligent like part of the technologies you get from ai systems uh will presumably let you do that maybe you can like um i guess set aside questions of like identi philosophical identity you could like upload the such that they could run on a computer run much faster have like software upgrades to be you know to the extent that that's philosophically acceptable so like you know there's a lot you can do to help humans grasp more um and like ultimately i'm like yes the like closure of all these improvements where you get to with all of that that's just like is the thing that we want and like yes you could have a theory that there is something even better and even more out there that humans can never access by themselves and i'm like that just seems like a weird hypothesis to have and i don't know why you would have it but in the world where that hypothesis is true and like i don't know like if i condition on that hypothesis being true i don't see why we should expect that ai systems could access that further truth um any better than we can if it's like out of our like you know the closure of what we can achieve even with additional intelligence and such like there's no other advantage that ai systems have over us so is is what you're arguing that um with human augmentation and help to human beings so with uploads or with you know expanding the intelligence and capabilities of humans that humans have access to the entire space of what counts as good you're like i think you're like presuming the existence of an object that is the entire space of what is good and i'm like there is no such object there are only humans and what humans want to do and like if you want to define the space of what what is good you can like define this like closure property on like what humans will think is good like with all of the possible intelligence augmentations and time and so on and like that's a reasonable object and i like i could see calling that as the like space of what is good but then like almost tautologically we can reach it with technology that's the thing i'm talking about um the version where you like posit the existence of the like entire space of what is good is like hey i can't really conceive of that i like don't it doesn't feel very coherent to me but b when i try to reason about it anyway i'm like okay if humans can't access it why why should ais be able to access it you've posited this new object of like a space of things that humans can never access but like how does that space affect or interact with reality in any way like there needs to be some sort of interaction in order for the ai to be able to access it i think i would need to know more about the how it interacts with reality in some way before i could like meaningfully answer this question in a way that like where i could say how ai's could do something that like humans couldn't even in principle do what do you think of the importance or non-importance of these kinds of questions and how they fit into the ongoing problem of ai alignment i think they're important for determining what the goal of alignment should be so for example you now know a little bit of what my view on these questions is uh which is namely something like it's you know that which humans can access access under like sufficient augmentations intelligence time so on is like all that there is and so i'm i'm like pretty very into like build ai systems that are like replicating human reasoning that are sort of approximating what a human would do if they thought for a long time or were smarter in some ways and so on um and so then like yeah we don't need to worry much about like or or so i'm i i tend to think of it as like let's build ai system to systems that just do tasks that humans can conceptually understand and not necessarily they can do it but they like know what that sort of task is and then our job is to like you know the the entire human ai society is like making forward progress um towards making forward moral progress or other progress um in the same way that it has happened in the past which is like we get exposed to new situations and new arguments we like think about them for a while and then somehow we make decisions about what's good and what's not in a way that's like somewhat inscrutable um like so i'm much more about and so we just continue trading that process and eventually we like reach the space of you know well yeah we just continue it writing that process so i'm like very much into the like because of this view i think it's pretty reasonable to like aim for ai systems they're just like doing human-like reasoning but better um or like approximating what you know doing what a human could do in a year in like a few minutes or something like that that seems great to me whereas if you on the other hand were like no there's actually like deep philosophical truths out there that humans might never be able to access then you're probably less enthusiastic about that sort of plan and you want to do you want to build ai system some other way or maybe they're accessible with the augmentation and time um how how does how does other minds fit into this for you so like right there's the human mind and then the space of all that is good that it has access to with augmentation which is what you call the space of that which is good um it's contingent and rooted on the space of what the human mind augmented has access to um how would you view uh how does that fit in with animals and also other species which may have their own alignment problems on planets within our cosmic endowment that we might run into is it just that they also have spaces that are defined as good as what they can access through their own augmentation and then there's no way of reconciling these two different ai alignment projects yeah i think basically yes like you know if if i met uh like actual ruthless maximizing paperclip uh paperclip maximizer it's not like i can argue it into adopting human my my values or anything even resembling them i don't think it would be able to argue me into accepting you know turning my turning me into paper clips uh which is you know what it desires and like yeah that that just seems like the description of reality um again a moral realist might say something else but i've never really understood that the flavor of moral realism that would say something else in that situation with regards to the planet and industry and how industry will be creating increasingly capable ai systems could you explain what a unipolar scenario is and what a multi-polar scenario is yeah so i'm not sure if i recall exactly where these terms were defined but a unipolar scenario at least as i understand it uh would be a situation in which one entity basically has determines the long run feature of the earth um so like it's in you know more colloquially it has taken over the world you can also have like a time-bounded version of it where it's like you know unipolar for like 20 years and like this entity has like all the power for those 20 years but then like you know maybe the entity is a human and we haven't solved agent yet and then the human dies uh so i'm like it was a unipolar world for that that period of time um a multi-polar world is just you know not that there there is no like one entity that is said to be in control of the world uh they're just a lot of different entities that have different goals um and they're coexisting hopefully cooperating maybe not cooperating it depends on the situation which which do you think is more likely to lead to beneficial outcomes with ai so i think i don't really think about it in these terms i think about it in like you know there are these like kinds of worlds that we could be in and like some of them are unipolar and some of them are multiple like very different human polar roles and very different multiple worlds and so like like sorts of questions it like the closest an analogous question is something like you know if you condition on unipolar world what's the probability that it's beneficial or that it's good if you condition on multiple world what's the probability that it's good and just like a super complicated question that like i wouldn't be able to explain my reasoning for because it would involve me like thinking about like 20 different worlds and maybe not that many but like a bunch of different worlds in my head estimating their probability is doing like a base rule not a base i guess kind of a base rule bayes rule calculation um and then reporting the result so i think maybe the question i will answer instead is like the most likely worlds in each uh in each of unipolar multiple settings and then like why how how good those seem to me so i would say i think by default i expect the world to be multi-polar uh in that it doesn't seem like anyone is you know particular i don't think anyone has particularly taken over the world today any or any entity like not even counting the us as a single entity it's not like the has taken over the world um and it does not seem to me like the main way you could imagine uh getting a unipolar world is like if the first the first um actor to build a powerful enough ai system that ai system just becomes really really powerful and takes over the world before anyone can deploy an ai system even close uh even close to it sorry that's not the most likely one that's the one that people most often talk about um and probably the one that other people think is most likely um but yeah anyway so i i see the multiple world as more likely where we just have you know a bunch of actors that are all pretty well resourced that are all developing their own own ai systems they like then like sell their ai systems or like the ability to use their ai systems to other people um and they're just like sort of similar to the human economy where you can just like have ai systems provide labor at a fixed cost and it just sort of looks similar to the economy today where people who control a lot of resources can like instantiate a bunch of ai systems that help them maintain whatever it is they want uh and we retain remain in the multi-polar world we have today um and that seems decent i think i like for all that our institutions are not looking great at the current moment there is still something to be said that like you know nuclear war didn't actually happen um which can either update you towards uh our institutions are like somewhat better than i than we thought or it can update you to towards if we had nuclear war we would have all died and not been here to ask the question i don't think that second one is all that plausible my understanding is that nuclear war is not that likely to wipe out everyone um or even 90 of people so i'm more i i lean towards the first explanation overall my guess is that like you know this is the the thing that has worked for the last worked the thing that is like generally led to an increase in prosperity and like or the world has clearly improved on most metrics over time and like the system we've been using for most of that time is like some sort of multiple people interact with each other and keep each other in check and like uh like cooperate with each other because they have to and so on um and like in the modern world we use like and and not just the modern world we use things like regulations and laws and so on uh to to enforce this and like you know the system's got some history behind it so i i'm like more inclined to trust it uh so i overall feel okay about uh this world you know assuming we solve the alignment problem that's we'll ignore the alignment problem for now uh for a unipolar world i think probably i i find it more likely that there will just be a lot of returns to um scale so like the just you'll get a lot of efficiency from centralizing uh more and more in the same way that like it's just really nice to have a single standard rather than have 15 different standards like it sure would be nice it sure would have been nice if like when i moved to the uk i could have just used all of my old old um chargers without having to buy adapters but no all the outlets are different right like there's there's there's benefits to standardization and um centralization of power and it seems to me like there's been more and more of that over time maybe it's not obvious i look i'm yeah i don't know very much history um but if so so it seems like you could get even more centralization in the future uh in order to capture the efficiency benefits and then you might have a global government that could reasonably be said to be just a uni like the entity that controls the world um and that would then be a unipolar outcome uh it's not a unipolar outcome in which the thing in charge of the world is an ai system but it isn't it is a unipolar outcome and i think i i feel wary of this but i don't like having a single point of failure um i don't like it when like there's a when when like i or like i really like the i really like it when people are allowed to like you know advocate for their own interests um which you know isn't necessarily not happening here right there this could be a global democracy uh but but still it seems like you know it's a very lim it like the libertarian intuition of like markets are good generally tends to suggest against centralization and i like do buy that intuition uh but this could also just be a like status quo bias where i'm like i know the world i can very easily see the um the problems in in the world that we're not actually in at the moment uh and i don't want it to change so i don't know i don't have super strong opinions there it's very plausible to me that like that world is better because then you can like um control dangerous technologies much much better like if there just are technologies that are sufficiently dangerous and destructive that they would destroy they would lead to extinction then maybe i'm more inclined to favor a unipolar outcome i would like to ask you about deep mind and maybe another question before we we wrap up um so what is it that the that the safety team at deepmind is is up to no one thing um the safety team at deepmind is like reasonably large and there's just a bunch of projects going on uh i've been doing a bunch of inner alignment stuff uh most recently i've been uh trying to come up with more examples that are you know in actual systems rather than hypotheticals i've also been doing a bunch of conceptual work of just like trying to make our arguments clearer and more conceptually precise a large smattering of stuff um not all that related to each other except them as much as it's all about ai alignment as a final question here rohan i'm interested in your your your core uh your core at the center of all this so you know what's the most important thing to you right now like in so far as ai alignment maybe the one thing that most largely impacts the future of life ah [Music] like if you just look at the universe right now and you're like these are the most important things i think for things that i impact uh you know more granul like more granular than just make ai go well i think for me it's probably like making better arguments and more convincing arguments um currently this will probably change in the future partially because i hope to succeed at this goal and then it won't be as important um but i feel like uh right now especially with the like advent of these like large large neural nets and more people like seeing a path to agi i think it is much more possible to make arguments that would be convincing to ml researchers as well um as well as like the you know philosophically oriented people who make up the ai safety community and they think that just feels like the most useful thing i can do at the moment in terms of the world in general i feel like it is something like the attitudes of consequential people two words um well long-termism in general but maybe risks in particular where and like importantly like i do feel like it is more like i i care primarily about you know the people who are actually making decisions that impact the future maybe they are taking into account the future maybe they're like it would be nice to care about the future but the realities of politics mean that i can't do that or else i will be lose my job but my guess is that they're mostly just not thinking about the future and like that seems if you're talking about the future of life that seems like the most that seems pretty important to change how do you see doing that when um many of these people don't have the as sam harris put it the science fiction geek gene is what he he called when he was on this podcast he's like all you know the long termists who are all like we're going to build agi and then create these radically different futures um like many of these people may just mostly care about their children and their grandchildren like that may be the human tendency do we actually advocate for any actions that would not impact their grandchildren it depends on your timelines right fair enough but like most of the time the arguments i see people giving for any preferred policy proposal of theirs or act just like almost any action whatsoever it seems to like be a thing that would have a like noticeable effect on people's lives in the next 100 years so like in that sense grandchildren should be enough okay so then long-term ism is doesn't matter well i i mean i don't for getting the action done oh possibly like um i still think they're not thinking about the future i think it's more of a like like i don't know if i had to take my best guess at it with like noting the fact that i am just a random person who is not an at all an expert in these things because why would i be and yes listeners noting that lucas has just asked me this question uh because it sounds interesting and not because i am at all qualified to answer it it seems to me like the more likely explanation is that like there are just always a gazillion things to do there's always like you know 20 bills to be picked off the sidewalk uh but like their value is only twenty dollars they're not like two billion dollars and like everyone is just constantly being told to pick up all the 20 bills and as a result they like are cons they are in a perpetual state of like having to say no to stuff and doing only the stuff that seems like uh most urgent and and like maybe also important and so like most of our institutions tend to be in them in a very reactive mindset as a result not because they don't care but just because that's the thing that they're incentivized to do is to like respond to the urgent stuff and so getting policy makers to care about the future whether that even just includes children and grandchildren not the next 10 billion years would be sufficient in your view it might be it seems plausible i mean i don't know that that's the approach i would take i just i think i'm more just saying like i'm not sure that you even need to convince them to care about the future i think actually it's possible that like what's needed is like people who have the space to bother thinking about it um which like you know i get paid to think about the future if i didn't get paid to think about the future i would not be here on this podcast because i would not be smart i would not have enough knowledge to uh to be worth talking you talking to um and you know i don't i think there are just not very many people who can be paid to think about the future and like the vast majority of them are in there maybe i don't know about the vast majority but a lot of them are in our community and very few of them are in politics and politics generally seems to anti-select for people who can think about the future and i don't have a solution here but that is the problem as i see it and i would want the salut if i were designing a solution i would be trying to attack that problem that would be one of the most important things uh yeah probably yeah i think on my view yes all right um so as we wrap up here is there anything else you'd like to add or any parting thoughts for the audience yeah um i have been giving all these disclaimers during the podcast too but i'm sure i missed them in some places but like i just want to know it lucas has asked me a lot of questions that are not things i usually think about and i just gave off the cuff answers if you asked me them again like two weeks from now i think for many of them i might actually just say something different so don't take them too seriously and treat like the alignment ones i think you can take those reasonably seriously but the things that were less about that you know take them as like some guy's opinion man some guy's opinion man yeah exactly okay uh well uh thank you so much for coming on the podcast rohan it's always a a real pleasure to to speak with you you're a bastion of knowledge and wisdom and ai alignment and uh yeah thanks for all the work you do yeah thanks so much for having me again this is this was fun to record [Music]
Related conversations
AXRP
3 Jan 2026
David Rein on METR Time Horizons

This conversation examines core safety through David Rein on METR Time Horizons, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -0 · 108 segs
AXRP
7 Aug 2025
Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -5 · 133 segs
AXRP
6 Jul 2025
Samuel Albanie on DeepMind's AGI Safety Approach

This conversation examines core safety through Samuel Albanie on DeepMind's AGI Safety Approach, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -4 · 72 segs
AXRP
1 Dec 2024
Evan Hubinger on Model Organisms of Misalignment

This conversation examines technical alignment through Evan Hubinger on Model Organisms of Misalignment, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med -6 · avg -7 · 120 segs
Counterbalance on this topic
Ranked with the mirror rule in the methodology: picks sit closer to the opposite side of your score on the same axis (lens alignment preferred). Each card plots you and the pick together.
Mirror pick 1
TED Talks
18 Dec 2023