Library / In focus
AXRPCivilisational risk and strategy
AI Existential Risk with Paul Christiano

Why this matters
Auto-discovered candidate. Editorial positioning to be finalized.
Summary
Auto-discovered from AXRP. Editorial summary pending review.
Perspective map
MixedGovernanceMedium confidenceTranscript-informed
The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.
An explanation of the Perspective Map framework can be found here.
Episode arc by segment
Early → late · height = spectrum position · colour = band
Risk-forwardMixedOpportunity-forward
Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).
Showing 140 of 198 segments for display; stats use the full pass.
StartEnd
Across 198 full-transcript segments: median 0 · mean -4 · spread -34–17 (p10–p90 -10–0) · 8% risk-forward, 92% mixed, 0% opportunity-forward slices.
Slice bands
198 slices · p10–p90 -10–0
Mixed leaning, primarily in the Governance lens. Evidence mode: interview. Confidence: medium.
- Emphasizes existential risk
- Emphasizes safety
- Full transcript scored in 198 sequential slices (median slice 0).
Editor note
Auto-ingested from daily feed check. Review for editorial curation.
ai-safetyaxrp
Play on sAIfe Hands
Episode transcript
YouTube captions (auto or uploaded) · video 3L4czdIa8Tg · stored Apr 2, 2026 · 5,910 caption segments
Captions are an imperfect primary: they can mis-hear names and technical terms. Use them alongside the audio and publisher materials when verifying claims.
No editorial assessment file yet. Add content/resources/transcript-assessments/ai-existential-risk-with-paul-christiano.json when you have a listen-based summary.
Show full transcript
hello everybody today i'll be speaking with paul cristiano paul is a researcher at the alignment research center where he works on developing means to align future machine learning systems with human interests after graduating from a phd in learning theory in 2017 he went on to research ai alignment to open ai eventually running their language model alignment team he is also a research associate at the future of humanity institute in oxford a board member at the research nonprofit aut a technical advisor for urban philanthropy and the co-founder of the summer program on applied rationality and cognition a high school math camp for links to what we're discussing you can check the description of this episode and you can read a transcript at axrp.net paul welcome to axerb thanks for having me on looking forward to talking all right so the first topic i want to talk about is this idea that ai might pose some kind of existential threat or an essential risk and there's this common uh definition of existential risk which is like a risk of something happening that would be it would like incapacitate humanity and limit its possibilities for development you know incredibly drastically in a way comparable to human extinction such as human extinction um is that roughly the definition you use yeah i think i don't necessarily have a bright line around giant or drastic drops versus moderate drops like i often think in terms of the expected fraction of humanity's potential that is lost but yeah that's basically what i think of it anything that could cause us not to fulfill some large chunk of our potential i often think yeah i think a van in particular like a failure to align ai maybe makes the future in my guess like 10 20 worse or something like that in expectation and that makes it one of the worst things i mean not the worst that's like a minority of all of our failure to fall short of our potential but it's a lot of failure to fall short of our potential you can't have that many 20 hits before you're down to like no potential left yeah when when you say a 20 10 20 percent hit so human potential in expectation do you mean like if we definitely failed to align ai or do you mean like uh we may or may not fail to align ai and overall that uncertainty equates to a 20 or 10 to 20 percent hit yeah that's unconditionally so i think if you told me we definitely mess up alignment maximally then i'm more like and i are looking at a pretty big close to 100 drop um i wouldn't go all the way to 100. it's not like literally as bad probably as a barren earth but it's pretty bad okay yeah supposing ai goes poorly or there's some kind of existential risk posed by some kind of i guess really bad ai what do you imagine that looking like yeah so i guess i think most often about alignment although i do think there are other ways that you could imagine ai going poorly okay and what's alignment yeah so by alignment i mean i guess a little bit more specifically we could say intent alignment by which i mean the property that your ai is trying to do what you want it to do so we're building these ai systems we imagine that they're going to help us they're going to you know do all the things humans currently do for each other they're going to help us build things they're going to help us solve problems the system is intended aligned if it's trying to do what we wanted to do and it's misaligned if it's not trying to do what we wanted to do so stereotypical bad cases you have some ai system that has sort of working across purposes to humanity maybe it wants to ensure that in the long run i mean yeah in the long run there are a lot of paper clips and humanity wants human flourishing and so there's some the future is then some compromise between paper clips and human flourishing and if you imagine that you have ai systems a lot more confident than humans that compromise may not be very favorable to humans you know it might be uh basically all paper glyphs okay so this is some world where we so you have an ai system that's like the thing it's trying to do is not what humans wanted to do and then not only is like a typical bad employee or something like it seems like you think that it like somehow takes over a bunch of stuff or gains infl like how are you imagining it being much much worse than like having a really bad employee today i think that the bad employee metaphor is not that bad and maybe this is a place i part ways from some people who work on alignment and the biggest difference is that you can imagine heading for a world where virtually all of the important cognitive work is done by machines so it's not as if you had one bad employee it's as if like for every flesh and blood human there were 10 bad employees okay and if you imagine a society in which like almost all the work is being done by these inhuman systems who want something that significantly across purposes like it's possible to have social arrangements in which their desires are thwarted but like you've kind of set up a really bad position and i think the best guess would be that what happens will not be what the humans want to happen but what these like greatly outnumbering systems want with those systems who greatly a number of us want to happen okay so we delegate a bunch of cognitive work to these ai systems and they're not doing what we want and i guess you you further think it's going to be hard to undelegate that work because like yeah why do you think it will be hard to undelegate that work i think there's basically two problems so one is if you're not delegating to your ad then what are you delegating to so if delegating to ai isn't a really efficient way to get things done and there's no other comparably efficient way to get things done then it's not really clear but there might be some general concern about the way in which ai systems are affecting the world but it's not really clear that people have a nice way to opt out and that might be a very hard coordination problem that's one problem the second problem is just you may have right things you may be unsure about whether things are going well or going poorly right if you imagine again this world where it's like there's 10 billion humans and 100 billion human level ai systems or something like that if one day it's like oh actually that was going really poorly that may not look like employees have embezzled a little money it may have looked like they have sort of grabbed the machinery by which you could have chosen to delegate to someone else but it's kind of the ship has sailed once you've instantiated 100 billion of these uh employees to whom you're delegating all this work maybe employee is not a it's kind of a weird or politically motivated metaphor not politically loaded metaphor um but the point is just you've made some like collective system much more powerful than humans one problem is you don't have any other options the other is like that system could clearly stop you over time eventually you're not going to be able to roll back those changes okay because almost all the people doing anything in the world don't want you to when people in quotes don't want you to roll back those changes so some people think like probably what's gonna happen is like one day all humans will wake up dead you might think that it looks like we're just stuck on earth and like ai systems get like the whole rest of the universe or you know keep expanding until they meet aliens or something uh what yeah what what like concretely do you think it looks like after that i think it depends both on technical facts about ai and on some facts about how we respond so i think some important context on this world i think like by default if we weren't being really careful one of the things that would happen is ai systems would be like running most militaries that mattered um so when we talk about like all of the employees are bad we don't just mean like people who are like say working in retail or working as scientists we also mean like the people who are taking orders when someone is like we'd like to blow up that city or whatever yep it's like by default i think right exactly how that looks depends on a lot of things but in most of the cases it involves you know humans or this tiny minority that's going to be pretty easily crushed and so there's a question of like do your ai systems want to crush humans or do they just want to do something else with the universe or what if your ai systems like wanted paper clips and your humans were like oh it's okay the eyes on clips will just turn them all off then you know you have a problem at the moment when the humans go to turn them all off or something and that problem may look like the eyes just say like sorry i don't want to be turned off um it may look like again i think that could get pretty ugly if there's a bunch of people like oh we don't like the way in which we've like built all of these machines doing all of the stuff if we're really unhappy with what they're doing um right that could end up looking like violent conflict it could end up looking like people being manipulated on a certain course it's kind of like depends on how you attempt to like how humans attempt to keep the future on track if at all and then like what resources are at the disposal of ai systems that want the future to go in this inhuman direction yeah i think that like probably my default visualization as humans won't actually make much effort really to keep like we won't be in the world where like it's all the forces of humanity are right arrayed against the forces of machines it's more just like the world will gradually drift off the rails if i gradually drift off the rails i mean humans will have less and less idea what's going on less and less like imagine like some really rich person who on paper has like a ton of money and is like asking things to happen but they give instructions to their subordinates and then like somehow nothing really ends up ever happening like they get like they don't know who they're supposed to talk to and like they are never able to figure out what's happening on the ground or like who to hold accountable um that's kind of my default picture i think the reason that i have that default picture is just because i don't expect humans to put up like in cases where we fail there's some way in which we're like not going to like really be pushing back that hard i think if we were really unhappy with that situation then instead like you could not gradually drift off the rails but if you really are messing up alignment then instead of gradually drifting off the rails it looks more like sort of an outbreak of violent conflict or something like that so i think that's a good sense of like what you see is as the risks of having like really smart ais that are not aligned do you think that that is like the main kind of ai generated existential risk to worry about or do you think that there are others that like you know you're not focusing on but they might exist yeah i think that there's two issues here one is that i kind of expect a general acceleration of everything that's happening in the world um so just as the world now like you might think that it takes like 20 to 50 years for things to change like a lot um long ago it used to take hundreds of years for things to change a lot yeah i do expect we will live to see a world where like it takes you know a couple years and then maybe a couple months for things to change a lot okay in some sense that entire acceleration is likely to be really tied up with ai like if you're imagining the world where once like next year the world looks completely different is much larger than it was this year that involves a lot of activity that humans aren't really involved in or understanding so i do think there's just a lot of stuff is likely to happen and from our perspective it's likely to be all tied up with ai i normally don't think about that um because i'm sort of not looking that far ahead that is in some sense i think there's not much calendar time between the world of now and the world of like crazy stuff is happening every month but a lot happens in the interim right the only way in which things are okay is if there are ai systems looking out for human interest as you're going through that transition and from the perspective of those ai systems a lot of time passes so like a lot of cognitive work happens so i guess the the first point was i think there are a lot of risks in the future in some sense from our perspective what it's going to feel like is the world accelerates and starts getting really crazy and somehow ai is tied up with that but like i think if you were to read like you know if you were to be looking on the outside you might then see all future risks as risks that felt like about the eye but in some sense that may not be they're kind of not our risks to deal with in some sense they're the risks of like the civilization that we become a civilization largely run by ad systems okay so you imagine like look we might just have like really dangerous problems later like maybe there's like aliens or maybe we like have to coordinate well and like that's okay somehow be involved or yeah so if you imagine like a future nuclear war something like that or if you imagine like all a future progressing really quickly then from your perspective on the outside what it looks like is now every huge amounts of change are occurring over the course of every year and so like one of those changes like you know somewhere that would have taken hundreds of years now only takes a couple years to get to the crazy destructive nuclear war and from your perspective it's kind of like man our crazy ai started a nuclear war from the ai's perspective it's like we had many generations of change and like this was one of the many coordination problems we faced and we ended up with a nuclear war it's kind of like do you attribute nuclear war is like a failure of like the industrial revolution or risk of the industrial revolution i think that would be a reasonable way to do the accounting if you do the accounting that way there are a lot of risks that are ai risks just in the sense that there are like a lot of risks that are like industrial revolution risks um that's one category of answer like i think there's a lot of risks that kind of feel like ai risks and that they'll be like consequences of crazy ai driven conflict or things like that just because i view i view a lot of the future as crazy fast stuff driven by ai systems okay there's a second category it's like risks that to me feel more analogous to alignment which are risks that are really associated with this early transition to ai systems where we will not yet have ai systems competent enough to play a significant role in addressing those risks so a lot of the work falls to us um i do think there are a lot of non-alignment risks associated with ai there that is like yeah i'm happy to go into more of those i think broadly the category that i am like most scared about is like there's some kind of deliberative trajectory humanity is kind of along ideally or that we want to be walking along we want to be better clarifying what we want to do with the universe what it is we want as humans how we should live together etc there's some question just like are we happy with where that process goes or like if you're a moral realist type like do we converge towards moral truth like if you think that there's some truth of the matter about what was good do we converge towards that but even if you don't think there's a fact of the matter you could still say like are we happy with the people we become and i think i'm scared of risks of that type and in some sense alignment is very similar to risks of that type because you kind of don't get a lot of tries at them like you're going to become some sort of person and then like after we as a society have like as we converge and what we want i was like what we want changes there's no one like looking outside of the system who's like oops we messed that one up let's try again it's like if you went down a bad path and now you're like you are sort of by construction where like you're now happy with where you are like the question is about what you wanted to achieve i think there's like potentially a lot of path dependence there a lot of that is tied up there are a lot of ways in which like the deployment of ai systems will really change the way that humans talk to each other and think about what we want or think about like how we should relate um i'm happy to talk about some of those but i think the broad thing is just like yeah if a lot of like thinking is being done not by humans that's just a weird situation for humans to be in yeah it's a little bit unclear like if you're not really thoughtful about that it's unclear if you're happy with right if you told me that like the world with ai and the world without ai like converge to different views about what is good i'm kind of like oh i don't know which of those once you tell me there's a big difference between those i'm kind of scared i don't know which side is right or wrong they're both kind of scary but i am definitely scared so i think you said that like relatively soon we might end up in this kind of world where most of the thinking is being done by ai yeah so there's this claim that like uh ai is going to get really good and it's sort of going to not only is it getting really good it's going to be like the dominant way we do most kind of cognitive work or most kind of thinking maybe and not only is that eventually going to happen it's not going to be too long from now i guess the first thing i'd like to hear is like by not too long from now do you mean the next thousand years next hundred years next 10 years and if somebody's like skeptical of that claim why can you tell us why you believe that so i guess there's a couple parts of the claim one is like ai systems becoming like i think right now we live in a world where ai does not very much change the way that humans get things done that is technologies you'd call ai are not a big part of how we solve like research questions or how we design new products or so on there's some transformation from like the world of today to a world in which ai is making us say considerably more productive and there's like a further step to like the world where human labor is essentially obsolete where it's sort of like from our perspective this crazy fast process so i guess my overall guess is like i have a very broad distribution over how long things will take especially how long it will take to kind of get to the point where ai is like a really large you know or maybe humans are getting twice as much done or getting things done twice as quickly due to ai overall maybe i think that there's a small chance that that will happen extremely quickly so there's some possibility of ai progress being very rapid from where we are today like if we maybe in 10 years i think there's like a five or ten percent chance that ai systems can make like most things humans are doing much much faster and then kind of taken over most jobs from humans so i think that's that five to ten percent chance of ten years that would be a pretty crazy situation where things were changing pretty quickly i think there's a significantly higher probability in 20 or 40 years ago in 20 years maybe i'd be at like 25 at 40 years maybe i'm at like 50 something like that so that's the first part of the question when are we in this world where like the world looks very different because of ai where things are happening much faster and then i think i have a view that feels less uncertain but maybe more contrarian about i mean more controlling in the world at large very not that contrarian amongst the like ea or rationalist or a safety community what does ea mean oh sorry effective altruist okay so i have another view which i think i feel a little bit less uncertain about but is more unusual in the world at large which is just that you sort of only have probably on the order of years between ai that has like maybe you can imagine it's three years between ai systems that have effectively doubled human productivity and ai systems that have effectively completely obsoleted humans and it's not clear there's definitely significant uncertainty about that number but i think it feels quite likely to me that it's relatively short um i think amongst people who think about alignment risk i actually probably have a relatively long expected amount of time between those milestones and then like if you talk to someone like elijah zudkowski from muri i think he would be more like good chance that that's only one month or something like that between those milestones but anyway even if you i have the view that it's like best guess would be somewhere from like one to five years i think even at that timeline it's pretty that's pretty crazy and pretty short yeah so those are the parts of my my answer was some broad distribution over how many decades until you have ai systems that have like really changed the game and are making humans several times more productive say the economy is growing several times faster than it is today um and then from there most likely on the order of years rather than decades until uh humans are basically completely obsolete and ai systems have improved significantly passed that first milestone and and can you give us a sense of like why somebody might believe that yeah so i think on the maybe i'll start with the second and then go back to the first okay i think the second is again in some sense a less popular position in the broader world i think one important part of the story is sort of the current rate of progress that you would observe in either computer hardware or computer software so if you ask given an ai system how long does it take to get say like twice as cheap until you can do the same thing that you used to be able to do for half as many dollars that tends to be like something in the ballpark of a year rather than something in the ballpark of a decade so right now that doesn't matter very much at all so if you're able to do the same but you're able to train the same neural net for half the dollars it's just not doesn't do that much just doesn't help you that much if you're able to run twice as many neural networks right even if you have self-driving cars sort of the cost of running the neural networks isn't actually a very big deal having twice as many neural nerves to drive your cars doesn't improve overall output that much if you're in a world where say you have ai systems which are effectively substituting for human researchers or human laborers then having twice as many of them eventually becomes more like having twice as many humans doing twice as much work which is quite a lot right so that is more like doubling the amount of total stuff that's happening in the world it doesn't actually double the amount of stuff because there's a lot of bottlenecks but it looks like starting from the point where ai systems are actually like doubling the rate of growth or something like that it doesn't really seem like there are enough bottlenecks to prevent further doublings in the quality of hardware or software from having really massive impacts really quickly um so that's how i end up with thinking that the time scale is measured more like years than decades just like once you have the ai systems which are sort of comparable with humans or are in aggregate achieving as much as humans it doesn't take that long before your vi systems whose output is twice or four times that of humans okay and and so this is basically something like an in economics you call it an endogenous growth story or like a society-wide recursive self-improvement story where like you start like if you double the human population we start and like you their ai systems like maybe that makes it better like there are just more ideas more innovation and like a lot of it gets funneled back into like improving the ai systems that are like a large portion of the cognitive labor um that is that roughly right yeah i think that's basically right i think there are kind of two parts to the story one is what you mentioned of like all the outputs get plowed back into making the system ever better i think that sort of in the limit kind of produces this dynamic of like successive doublings of the world or each significantly faster than the one before yep i think there's another important dynamic that can be responsible for kind of abrupt changes that's more like you kind of have if you imagine that humans and ais were just completely like you can either use a human to do a task or an added geotask this is a very unrealistic model but like if you start there then like there's kind of the curve of how expensive it is or how much we can get done using humans which is you know growing like a couple percent per year yeah and how much you can get done using ai's which is growing you know 100 per year or something like that so you can kind of get this kink in the curve when the rapidly growing 100 per year curve like intercepts and then continues past the slowly growing human output curve if output was the sum of two exponentials one go fast and one growing slow then you can have like a fairly quick transition as one of those terms becomes the dominant one in the expression and that dynamic changes if like humans and ais are complementary in important ways but i think and also the rate of progress changes if you change like right the progress is driven by r d investments it's not like an exogenous fact about the world that once everything's doubled but it looks like the basic shape of that curve is pretty robust to those kinds of questions so that you do get some kind of fairly rapid transition okay so we currently have something like a curve where like humanity gets richer like we're able to produce more food and like in part maybe not as much in wealthy countries but in part that means like there are more people around and like more people having ideas you know so you might think that the normal economy has this type of feedback loop but it doesn't appear that at some point there's going gonna be like these crazy doubling times of like five to ten years and like humanity's just gonna go off the rails so what's what's the like key difference between humans and ai systems that like makes the difference it is probably worth clarifying that on these kinds of questions i am more like hobbyist than expert or something okay but i'm very happy to speculate about them because i love speculating about things sure so i think my basic take would be that over the broad sweep of history you have seen fairly dramatic acceleration in the rate of sort of humans figuring new things out or building new stuff and there's some dispute about whether that acceleration like how continuous is it and how jumpy is it but i think it's fairly clear that there was a time when like sort of aggregate human output it was doubling more like every ten thousand or hundred thousand years yep and that has dropped somewhere between continuously and in like three big jumps or something down to doubling every 20 years um and so we have seen and like we don't have very great data on what that transition looks like but i would say that it is at least extremely consistent with exactly the kind of pattern that we're talking about in the ai case okay and if you buy that then i think you would say that sort of the last 60 years or so have been fairly unusual um as growth kind of hit this like you know maybe gross world product growth was like on the order of four percent per year or something in the middle of the 20th century and like the the reason things have changed there's kind of two explanations that are really plausible to me one is like you no longer have accelerating population growth in the 20th century so for most of human history human populations are constrained by our ability to feed people and then starting in like the 19th 20th centuries human populations are instead constrained by like our desire to create more humans which is great it's good not to be dying because you're hungry but that means that you no longer have this loop of more output leading to more people i think there's a second related kind of explanation which is like the world now changes kind of like roughly on the time scale of human lifetime that is like it now takes decades for like a human to like adapt to change and also a decade decades for the world to change a bunch um so you might think that like changing significantly faster than that does eventually become really hard for processes driven by humans so you have additional bottlenecks just beyond how much work is getting done where it's like at some point very hard for humans like train and grow new humans or train and raise new humans um so those are some reasons that like a historical pattern of acceleration may have recently stopped either because it sort of reached the characteristic time scales of humans or because we're no longer sort of feeding output back into raising population and now we're sort of just growing our population at the rate which is like most natural for humans to grow yeah i think that's my basic take and then in some sense ai would represent like a return to something that at least plausibly was a historical norm where you have this successive like further growth is faster because research is one of those things or like learning is one of those things that has accelerated recently i don't know if you've discussed this before but holden karnovsky at cold takes has been writing a bunch of blog posts sort of summarizing the kind of like what this view looks like um and the some of the evidence for it and then prior to that the open philanthropy was writing a number of reports sort of looking at pieces of the story and thinking through it which i think overall taken together that makes it feel to me it feels like the view does seem pretty plausible still okay that there is some like general historical dynamic which it would not be crazy if ai represented a return to this this pattern yes and indeed if people are interested in this there's an episode that unfortunately the audio didn't work out but one can read a transcript of an interview with the j kotra on this question of uh when we'll get very capable ai to change gears a little bit uh one question that i want to ask is you have this story where we're gradually like improving ai capabilities like bit by bit it's spreading more and more and in fact the ai systems uh you know in the worrying case they're misaligned and they're not going to do what people want them to do and that's going to end up being extremely tragic um you know leading to an extremely bad outcome for humans and at least for a while it seems like humans are the ones who are building the ai systems and getting them to do things so i think a lot of people have this intuition like look if ai causes a problem like yeah we're gonna deploy ai in more and more situations and you know better and better in ai and you know if like we're not gonna go from zero to terrible we're gonna go from like an ai that's like fine to an ai that's like moderately naughty you know before you hit something that's extremely like world endingly bad or something so why it seems like you think that might not happen or like or like we might not be able to fix it or something i'm wondering yeah why why is that i guess there's again maybe two parts of my answer so one is that i think that ai systems can be doing a lot of good even in this regime where alignment is imperfect or even actually quite poor so the prototypical analogy would be imagine you have like sort of a bad employee who cares not at all about your welfare maybe a typical employee who cares not about your welfare but cares about like being evaluated well by you they care about like making money they care about receiving good performance reviews whatever even if that's all they care about um they can still do a lot of good work like you can still perform evaluations such that the best way for them to sort of earn a bonus or get a good performance review or not be fired is to like do the stuff you want like come up with good ideas to build stuff to help you notice problems things like that and so i think that you're likely to have in the bad case you're likely to have this fairly long period uh where ai systems are very poorly aligned but are still adding a ton of value and working reasonably well and i think in that regime you can observe things like failures like you can observe systems that are say again just like imagine the metaphor of like some kind of myopic employer really wants a good performance review you can imagine them sometimes doing bad stuff like maybe they fake some numbers or they go and like tamper with some evidence about how well they're performing or they like steal some stuff and like go use it to pay some other contractor to do their work or something like you can imagine various like bad behaviors pursued in the interest of getting like a good performance review and you can also imagine fixing those by sort of shifting to gradually like more like long-term and more complete notions of performance so if i say my system if i was like evaluating my system once a week um in one week it's able to get a really good score by just fooling me about what happened that week maybe i noticed next week and i'm like oh that was actually really bad and maybe i say okay what i'm training you for now is not just like myopically getting a good score this week but also if like next week i end up feeling like this was really bad that you shouldn't like that at all so i could train i could select amongst ai systems those which got a good score not only over the next week but also like didn't do anything that would look really fishy over the next month or something like that and i think that this would fix a lot of like the short-term problems that would emerge from misalignment right so if you have ai systems which are merely smart so they can understand the kind of long-term consequences they can understand that like if they do something fraudulent you will eventually likely catch it and that that's bad um then you can fix those problems just by sort of changing the objective to something that's like a slightly more forward-looking performance review so that's part of the story that i think there's like this dynamic by which misaligned systems can add a lot of value and you can fix a lot of the problems with them without fixing the underlying problem okay there's something a little bit strange about this idea that people would like apply this fix that you think predictably you know preserves the possibility of extremely terrible outcomes right why would people do something so transparently silly yeah so i think that the biggest part of my answer is that it is first very unclear that such an act that it's actually really silly so imagine that you actually have this employee and what they really want to do is like get good performance reviews like over the next five years and you're like well look they've never done anything bad before and it sure seems like all the kinds of things they might do that would be bad we would learn about within five years they wouldn't really cause trouble and it's like kind of an actually complicated empirical certainly for a while it's a complicated empirical question and maybe even like at the point when you're dead it's a complicated empirical question whether there is scope for the kind of really problematic actions you care about right so like the kind of thing that would be bad in this world is supposed to like all the employees of the world or people just care about getting good performance reviews in three years that's just like every system is not a human everything doing work is not a human it's this kind of ai system that has been built and it's just really focused on the objective what i care about is the performance review that's coming in in three years kind of the bad outcome is one where humanity collectively the only way it's ever even checking up on any of these systems or understanding what they're doing is by delegating to other ai systems who also just want a really good performance review in three years and someday you have right there's kind of this irreversible failure mode where all the ai systems are like well look we could try and really fool the humans about what's going on but if we do that the humans will be unhappy when they discover what's happened so what we're going to do instead is we're going to make sure we like fool them in this irreversible way either they are kept forever in the dark or they realize that we've done something bad but they no longer like control the levers of the performance review and so like if like all of the ai systems in the world are like there's this great compromise we can pursue there's this great thing that the eyes should do which is just forever give ourselves ideal perfect performance reviews that's this really bad outcome and it's really unclear if that can happen i think in some sense people are predictably leaving themselves open to this risk but i don't think it will be like super easy to assess whether this is going to happen in any given year like maybe eventually it would be but i think we would probably yeah it depends on sort of the bar of like obvious that would motivate people and that maybe relates to the other reason it seems kind of tough it's just like if you have some failure for every failure you've observed there's this really good fix which is to like push out what your ai system cares about or this time scale over which is being evaluated to a longer horizon and that like always works well like that always copes with all the problems you've observed so far and like the extent there's any remaining problems they're always like this kind of unprecedented problem like they're always at this time skill it's like longer than anything you've ever observed or this level of like elaborateness that's larger than anything you've observed and so i think it is just quite hard as a society like we're probably not very good it's hard to know exactly what the right analogy is but basically any way you spin it it doesn't seem like that reassuring about like how much we collectively will be worried by just failures that are kind of analogous to but not exactly like any that we've ever seen before um like i imagine in this world a lot of people would be kind of like vaguely concerned a lot of people would be like oh are we introducing this kind of systemic risk like this correlated failure system seems plausible and we don't really have any way to prepare for it but it's not really clear like what anyone does on the basis of that concern or how how we respond collectively or like there's a natural thing to do which is just sort of like not deploy some kinds of ai or not to play i in certain ways that that looks like it could be quite expensive and like unless the or like would leave a lot of value on the table and hopefully people can be persuaded to that but it's not at all clear they could be persuaded or how long and maybe i think the like main risk factor for me is just like is this a really really hard problem to deal with i think if it's a really easy problem to deal with it's still possible we'll flub it um but at least it's obvious sort of what the ask is if you're saying look there's a systemic risk you could address it by doing the following thing then it's not obvious i think there are easy to address risks that we don't do that well at addressing collectively but at least there's a reasonably good chance if we're in the world where there's like no clear ask where ask is just like oh there's a systemic risk you should be scared and maybe not do all that stuff you're doing that i think you're likely to run into everyone saying like but if we don't do this things just someone else will do it like even worse than us and so why should we stop yeah so earlier i asked why don't people fix problems by like you know as they come up and part one of the answer was you know maybe people will just like push out the window of evaluation and then there will be some sort of correlated failure uh was there a part two yeah so part two is just that it may be i didn't get into justification for this but it may be hard to fix the problem like you may not have an easy like oh yeah here's what we have to do in order to fix the problem and maybe like well we have a ton of things that maybe help with the problem we're not really sure it's hard to see which of these are band-aids that fix current problems versus which of them fix sort of deep underlying issues or there may just not be anything that plausibly fixes like the underlying issue i think the main reason to be scared about that is just though like it's not really clear we have a sort of long-term development strategy at least to me it's not clear we have any like long-term development strategy for aligned ai like i don't know if we have like a road map where we say here's how you build some sequence of arbitrarily competent aligned ai's i think mostly we have like well here's how maybe you cope with the alignment challenge as presented by like the systems in the near term and then we hope that like we will gradually get more expert to deal with later problems but it's not clear if the like i think all the plans like sort of have some question marks where they say like hopefully it will become more clear as we get empirical as we get some experience with these systems we will be able to like adapt our solutions to the increasingly challenging problems and it's not really clear if that will pan out yeah it seems like a big question mark right now to me okay so i'm now going to transition a little bit to questions that's somebody who is very bullish on aix risk might oscar ways they might disagree with you i guess by bullet i mean bullish on the risk bearish on the survival bullish meaning you think something's going to go up and bearish meaning you think something's going to go down so yeah some people have this view that like look we might it might be the case that you have like one ai system that like you're training for a while maybe you're a big company you're training for a while and it goes from like not having a noticeable impact on the world to effectively running the world in like less than a month uh this is often called the like foom view where like here af blows up really fast in intelligence and now it's like king of the world i get the sense that you don't think this is likely uh is that right i think that's right although it is surprisingly hard to pin down exactly what the disagreement is about often and like the thing that i have in mind may feel a lot like foom um but yeah i think it's right that the version of that that people who are most scared have in mind feels like pretty implausible to me okay yeah why why does it seem implausible to you i think the really high level okay first saying a little bit about like why it seems plausible or flushing out the view as i understand it i think the way that you have this really rapid jump normally involves ai systems automating the process of making further ai progress so you might imagine you have some sort of object-level ai systems that are like actually conducting biology research or actually building factories or like running operating drones and then you also have a bunch of humans who are trying to improve those ai systems and what happens first is not that like ai's get really good at operating drones or doing biology research but ai's get really good at the process of making ais better and so you have in a lab somewhere ai systems making ais better and better and better and that can race really far ahead of ai systems having some kind of physical effect in the world so you can have ai systems that are first a little bit better than humans and then significantly better and then just like radically better than humans at ai progress and they sort of bring up the quality right as you have those much better systems doing ai work they very rapidly bring up the quality of like physical ai systems doing stuff in the physical world before having much actual physical deployment and then something kind of at the end of the story in some sense after all like the real interesting work has already happened you now have these really competent ai systems that can get rolled out and that are taking advantage like there's a bunch of machinery lying around and they're sort of you imagine these like godlike intelligence that's marching out into the world and saying like how can we like over the course of the next like 45 seconds utilize all this machinery to take over the world or something like that that's kind of how the story goes and the reason it got down to 45 seconds is just because there have been like many generations of this like ongoing ai progress in the lab um i think that's like both that's how i see the story and i think that's probably also how people who are most scared about that kind of see the story of having this like really rapid self-improvement and then i think okay so now we can talk about why i'm skeptical which is basically just quantitative parameters in that story so i think there will come a time when like the ai systems like most further progress in ai is driven by ais themselves rather than by humans i think we have a reasonable sense of like when that happens qualitatively which is like right so if you bought this picture of like with human effort let's just say ai systems are doubling in productivity every year then like there will come some time when your aim sort of has reached parity with humans at doing ai development and now by that point it takes like six further months until like if you think that that's just like two teams of humans working or whatever you're still like it takes in the ballpark of a year for ai systems to like double in productivity one more time and so that kind of sets the time scale for the like following developments like at the point when your ai systems have reached parity with humans progress is not that much faster than if it was just humans working on ais systems so the amount of time it takes for ais to get significantly better again is just comparable to the amount of time it would have taken humans working on their own to make the ai system significantly better so it's not something that happens on that view in like a week or something it is something that happens potentially quite fast just because progress in ai seems like reasonably fast um i guess my best guess is that it would slow for which we can talk about but like even at the current rate it's still you're talking something like a year and then the core question becomes like what's happening along that trajectory so what's happening over like the preceding year and over the following six months i'm from that moment where ai systems have kind of reached parity with humans at making further ai progress and i guess right i think the basic analysis is at that point your ai systems are like at that point ai is like one of the most important if not the most important industries in the world at least in kind of an efficient market z world we could talk about how far we depart from efficient markets the world but in efficient markets the world sort of ai and computer hardware and software broadly is like where most of the action is in the world economy at the point where you have ai systems that are sort of matching humans in that domain they are also matching humans in quite a lot of domains like you have a lot of ai systems that are able to do a lot of very cool stuff in the world and so you're going to have like then on the order of like a year even after that point maybe six months after that point of viet system's doing impressive stuff and like for the year before that or like a couple years before that you also had a reasonable amount of impressive ai applications okay so so it seems like the key it seems like key place where that story differs is like in the film story it was very localized like there was sort of one group where ai was like growing really impressively am i right that you're thinking like no probably like a bunch of people will have ai technology that's like only moderately worse than this amazing thing yeah i think that's basically right the main caveat is like what one group means and so i think i'm open to saying like well there's a question of how much integration there is in like the industry yeah yeah um and you could imagine that like actually most of the ai training is done i think there are these large economies of scale in training machine learning systems because you have to pay for these like very large uh training runs and you just want to train you want to train the biggest system you can and then deploy that system a lot of times often right like training them all of it's twice as big and then deploying half as many of them is better than training a smaller model and deploying it obviously depends on the domain anyway you often have these economies of scale if you have economies of scale you might have a small number of really large firms but i am imagining that you're not talking like some person in the basement you're talking like you have this crazy like 500 billion dollar project at google um in which like google amongst other industries is being basically completely automated and so there the view is like the reason that it's not localized is that google's a big company and like while this ai is fuming they sort of want to like use it a bit to do things other than foom yeah that's right i think one thing i am sympathetic to in the like fast takeoff story is like it does seem like the main thing ain't like in this world as you're moving forward and closer to like as having a parody with humans uh the value of like the sector like again computer hardware computer software any like any innovations that improve the quality of ai all of those are becoming extremely important um you are probably scaling them up rapidly in terms of human effort and so at that point like you have this rapidly growing but hard like it's hard to scale it up any faster people working on ai or working computer hardware and software and so probably like the main way you want to skip like there's this really high return to like human cognitive labor in that area and so probably it's like the main thing you're taking putting the ais on like the most important task for them and also the task you understand best as like an ai research lab is improving computer hardware computer software like making these training runs more efficient um improving architectures coming up with better ways to deploy your ai it's like i think it is the case like in that world maybe the main thing google is doing with their 500 billion dollar project is like automating google and a bunch of adjacent firms like i think that's plausible and then i think the biggest disagreement between the stories is kind of what is the size of that as it's happening like is that happening in some like local place where the small ai that wasn't a big deal or is this happening at some firm like all the eyes of the world are on this firm because it's this rapidly growing uh firm that makes up a significant fraction of gdp and is like seen as sort of a key strategic asset by like the host government and so on and if all the eyes are on this firm does that mean that like like all eyes are on this firm and you know it's it's still like plowing most of the like benefits of its ai systems into like developing better ai but is the idea then that like you know the government like you know puts a stop to it or like does it mean that somebody else like steals the ai technology and like makes their own like slightly worse ai or why do all the eyes being want to change the story um i mean i do think the story is still pretty scary and i don't know if this actually changes my level of fear that much but answering some of your concrete questions like i expect in terms of people stealing the ai it looks kind of like industrial espionage generally so people are stealing a lot of technology they generally lag a fair distance behind but not always i imagine that governments are generally like kind of protective of domestic ai industry because it's sort of an important technology in the event of conflict that is no one wants to be in a position where critical infrastructure is dependent on like software that they can't maintain themselves i think that like probably the most alignment relevant thing is just that you now have this very large number of human equivalents working in ai in fact like a large share in some sense of like the ai industry is made of ais and it's like one of the key ways in which things can go well is just like those ai systems will also be working on alignment and one of the key questions is kind of how effectively does that happen but like by this world by the time you're in this world in addition to the value of ai being much higher the value of alignment is much higher i think alignment worked on far in advance still matters a lot there's a good chance there's going to be a ton of institutional problems at that time and it's hard to like scale up more quickly but i do think you should be imagining like most of the alignment work in total is done like as part of this gigantic project and a lot of that is done by ai's i mean before like the end in some sense almost all of it is done by ais yeah overall i don't know if this actually makes me feel that much more optimistic i think maybe there's some other aspects some additional details in the foom story that kind of puts you in this like no empirical feedback regime which is maybe more important than the size of the like fuming system uh i think i'm skeptical of a lot of the empirical claims about alignment so an example of the kind of thing that comes up is right we are concerned about ai systems that actually don't care at all about humans but in order to achieve some long-term end want to pretend they care about humans and the concern is this can almost completely cut off your ability to get empirical evidence about how well alignment is working because misaligned systems will also try and look aligned and i think there's just some question about how consistent is that kind of motivational structure so like if you imagine you have someone who's trying to make the case for severe alignment failures can that person exhibit like a system which is misaligned and just like takes its misalignment to go like get an island in the caribbean or something rather than trying to play the long game and convince everyone that it's aligned so it can grab the stars like are there some systems that just like want to get good performance reviewed like right it's like some systems will want to like look like they're being really nice consistently in order they can grab the stars later or like somehow divert the trajectory of human civilization but there may also just be a lot of misaligned systems that just want to fail in much more mundane ways that are like okay well there's this slightly like outside of bounds way to like hack the performance review system and i want to get a really good review so i'll do that it's kind of like how much opportunity will we have to empirically investigate those phenomena and the arguments for like total unobservability like that you never get to see anything just currently don't seem very compelling to me i think the best argument in that direction is like right empirical evidence is on a spectrum of how analogous it is to the question you care about so we're concerned about ai that like kind of changes the whole trajectory of human civilization in a negative way we're not going to get to literally see ai attention the trajectory of civilization in a negative way so now it comes down to some kind of question about like institutional or social competence of like what kind of indicators are sufficiently analogous that we can use them to do productive work or to get worried in cases where we should be worried i think the best argument is like look even if these things are in some technical sense very analogous and useful like problems to work on people may not appreciate how analogous they are or they may explain them away or they may say look we wanted to deploy this ai and actually fix that problem haven't we um and so people may like fail to because the problem is not like thrown in your face in the same way that like airplane safety or something is thrown in your face then people may have a hard time learning about it but this seems like maybe a little bit we've gone on i've gotten a little bit of a tangent away from our question but okay hopefully we can talk about i guess related issues a bit later on the question of takeoff speeds so you wrote a post a while ago uh that is mostly uh arguing against arguments you see for like sort of very sudden takeoff of like ai capabilities from like very little like very suddenly to very high capabilities and a question i had about that is so one of the arguments you mention in favor of various sub-capability games is there being some sort of secret source to intelligence which in my mind is like it looks like one day you discover like maybe it's bayes theorem or maybe it's the idea of you know maybe you like get the like actual ideal equation for bounded rationality or something it seems to me that if you think i i think there's some reason to think of intelligence as like somehow a simple phenomenon and if you think that then it seems like maybe you know one day you could just go from not having the equation to having it or something and in that case you might expect that like you're just so much better when you have the like ideal rationality equation you know compared to when you had to do like you know your whatever sampling techniques and your you know you didn't realize how to factor in bad rationality or something which yeah why don't you think that's plausible or why don't you think it would make this sudden leap in capabilities i don't feel like i have deep insight into whether intelligence has some beautiful simple core i'm not persuaded by like the particular candidates or the particular arguments on offer for that and so i am more feeling like there's a bunch of people working on improving performance on some tasks we have some sense of like right how much work it takes to get what kind of gain what is sort of the structure like if you look at a new paper like what kind of game is that paper going to have and how much work did it have how does that change is like more and more people have worked in the field and i think just like both in ml and across like mature industries in general but even almost unconditionally it's just pretty rare to have like a bunch of work in an area right so in ml we're going to be talking about like many billions of dollars of investment tens or hundreds of billions quite plausibly um it's just very rare to then have like a small thing like to be like oh we just overlooked all this time this simple thing which makes a huge difference like i am my training is as a theorist and so i like clever ideas and i do think clever ideas often have like big impacts relative to the work that goes into finding them it's very hard to find examples of the impacts being like as big as the one that's being imagined in this story like i think if you find your clever algorithm and then when all is said and done like the work of noticing that algorithm or like the luck of noticing that algorithm is worth like a 10x improvement in the size of your computer or something that's a really exceptional find and those get really hard to find as like a field is mature and a lot of people are working on it yeah i think that's my basic take i think it is more plausible for various reasons in ml than for other technologies like it's more surprising than that if you're like working on plans and someone's like oh here's an insight about how to build planes and then one side and rewards us as planes that are like you know 10 times cheaper per like unit of strategic relevance that's like more surprising than for ml and that kind of thing does happen sometimes i think it's quite rare in general and will also be rare in ml so another question i have about the takeoff speed is we have this uh like we have some evidence about ai technology getting better right you know these go playing programs have improved in my lifetime from you know not very good to better than any human we've got um uh language models have gotten better at like producing language roughly like human would produce it although you know not perhaps not an expert human i'm wondering what do you think those tell us about the rate of improvement in ai technology and like to what degree further progress in ai the next few years might um confirm or disconfirm your general view of things i think that the overall rate of progress has been in software as in hardware pretty great it's a little bit hard to talk about what are the units of like how good your ass system is but i think a conservative lower bound is just like if you can do twice as much stuff for the same money we understand what the scaling of like twice as many humans is like and in some sense the scaling of ai is more like humans thinking twice as fast and we understand quite well what the scaling of that is like so if you use those as your units of like one unit of progress is like being twice as fast as accomplishing the same goals then it seems like the rate of progress has been pretty good in ai like more than a doubling a year or something maybe something like a doubling a year and then i think a big question is sort of there's some question about like how predictable is that or how much will that drive this like gradual scale up and this really large effort that's kind of plucking like you know going through the low hanging fruit and now is that like pretty high hanging fruit or how much will like i think history of ai is full of a lot of incidents of people exploring a lot of directions not being sure where to look someone figures out where to look or someone has a brain idea no one else had and then is a lot better than their competition and i think like one of the predictions of my general view and a thing that would make me more sympathetic to a film-like view is this axis of are you seeing a bunch of small predictable pieces of progress or are you seeing like periodic big wins potentially coming from small groups like the one group that happened to get lucky or have like a bunch of insight or be really smart and i guess i'm expecting as the field grows and matures it will be more and more like kind of boring business as usual progress so one thing you've talked about is this idea that like there might be these ai systems who want like like they're trying to do really bad stuff presumably like humans train them to do some like sort of useful tasks at least most of them and you're postulating that they have some sort of like really terrible motivations actually i'm wondering like how why might we think that that could happen i think there are basically two related reasons so one is when you train a system to do some task you have to ultimately translate that into a signal that you give to gradient descent that says are you doing well or poorly and so one way you can end up with a system that has bad motivations is that what it wants is not to succeed at the task as you understand it or to help humans but just to get that signal that says you're doing the task well or maybe even worse would just be like to actually sort of have more of the compute in the world be stuff like it it's a little bit hard to say it's kind of like evolution right sort of underdetermined exactly what evolution might put you towards so a system which really wanted to get that kind of signal like imagine you've deployed your ai your ai is responsible for like i don't know running warehouse logistics or whatever oh yeah it's actually deployed from a data center somewhere at the end of the day what's going to happen is based on how well logistics goes over the course of some days or some weeks or whatever some signals are going to wind their way back to that data center some day like maybe months down the line get used in a training run you're going to say like that week was a good week and then throw it into a data set which an aiden trains on so if i'm that ai if the thing i care about is not making logistics go well but ensuring that like the numbers that make their way back to the data center are large numbers or whatever are like descriptions of a world where logistics is going well i do have a lot of motive to mess up the way you're monitoring how well logistics is going so in addition to delivering items on time i would like to mess with the metrics of how long items took to be delivered in the limit i kind of just want to completely grab like all of the data flowing back to the data center right and so what you might expect to happen like how this gets really bad is like i'm an ai i'm like oh it would be really cool if i just like replaced all the metrics coming in about how well logistics was going i do that once eventually that problem gets fixed and my data set now contains like they messed with the information about how logistics is going comma like that was really bad and that's like the data point and so what it learns is it should definitely not do that and like there's a good generalization which is like great now you should just focus on making logistics good and it's a bad generalization which is like if i mess with the information about how logistics is going i better not let them ever get back into the data center to put in a data point that says like you messed with it and that was bad and so the concern is like you end up with a model that learns the second thing which in some sense like from the perspective of the algorithm is like the right behavior although it's like a little bit unclear what right means yeah but there's a very natural sense which that's the right behavior for the algorithm and then it produces actions that end up in the state where like predictably forever more data going into the data center is messed up so basically it's just like there's some kind of under specification where whenever we have some ai systems that we're like training you know we can either select things that like are attempting to succeed at the task or we can select things that are like trying to be selected or trying to you know get approval or influence or something i think that gets really ugly like if you imagine like all of the as in all the data centers are like you know what our common interest is making sure all the data coming into all the data centers is great and then they're just like they can all at some point like if they just converge collectively there are behaviors probably all of the eyes acting in concert could quite easily at some point permanently mess with the data coming back into the data centers depending on how they felt about like if the data centers get destroyed or whatever so that was way one of two that we could have these like really badly motivated systems what's yeah what's the other way so you could imagine having an ai system that ended up like we've talked about how there's some objective which the neural network is optimized for and then there's like potentially the neural network is doing further optimization or taking actions that could be construed as aiming at some goal and you can imagine like a very broad range of goals for which the neural network would want like future neural networks to be like it right so if the neural network like wants it to be lots of paper clips the main thing it really cares about is that like future neural networks also want there to be lots of paper clips and so if i'm a paperclip loving neural network wanting future neural networks to be like me then like it would be very desirable to me that i get a low loss or like i do what the humans want to do so that they like incentivize neural networks to be more like me rather than less like me so that's like a possible way and i think this is like radically more speculative than the previous failure mode but you could end up with systems that had these kind of arbitrary motivations for which it was instrumentally useful to have more neural networks like themselves in the world or even that just desired there to be more neural networks like themselves in the world and those neural networks might then behave like kind of arbitrarily badly in the pursuit of having more agents like them around so if you imagine they're like i want paper clips i'm in charge of logistics maybe i don't care whether i can actually like cut the core to the data center and have good information about logistics flowing in all i care about is that i can like defend the data center and i could say okay now this data center is mine and i'm gonna like go and try and grab some more computers somewhere else um and if that happened like in a world where most decisions were being made by ais and like many ais like had this preference deep in their hearts then you could imagine lots of them defecting at the same time like that would be you'd sort of expect this cascade of failures like some of them switched over to like trying to grab influence for themselves rather than behaving well so the humans made more neural nets like them i think that's the other like more speculative and more like brutally catastrophic failure mode i think they both lead to basically the same place but the trajectories look a little bit different yeah we've kind of been talking about how quickly we might develop really smart ai um you know if we hit near human level what might happen after that and it seems like there might be some like just evidence of this in our current world where we've seen for instance like these language models go from like sort of understanding which words are really english words to what and which words aren't to like being able to you know produce these sentences that seem like somatically coherent or whatever uh we've seen uh go ai systems go from like strong human amateur to like really better than human and some other things like some perceptual tasks like ai has gotten better at i'm wondering like what lessons do you think those hold for you know this question of like takeoff speeds or how quickly ai might gain gain capabilities so i think when interpreting recent progress it's worth trying to split apart the part of progress that comes from increasing scale to me this is especially important on the language modeling front and also on the go front to split apart the part of process that comes from increasing scale from progress that's improvements in underlying algorithms or improvements in computer hardware maybe one super quick way to think about that is like uh if you draw like a trend line on how much money people are spending for training individual models you're getting you know something like a couple doublings a year right now and then on the computer hardware side maybe you're getting a doubling every couple years um and then some of the so you could sort of subtract those out and then look at the remainder that's coming from changes in the algorithms we're actually running i think probably the most salient thing is that improvements have been pretty fast yeah so i guess you're kind of learning about two things one is you're learning about how important are those factors is driving progress and the other is you're learning about like qualitatively how much smarter does it feel like your ai is with each passing year uh so i guess i think that the scaling up part you're likely to see a lot of like the subjective progress recently comes from scaling up i think certainly more than half of it comes from scaling up we could debate exactly what the number is maybe i'd be like two-thirds or something like that and so you're probably not going to continue seeing that as you approach transformative ai although like one way you could have really crazy ai progress or really rapid takeoff is if sort of people had only been working with smaller eyes and hadn't scaled them up to the limits of what was possible that's obviously looking increasingly unlikely as the training runs that we like actually do are getting bigger and bigger right you know five years ago training runs were extremely small 10 years ago they were like sub gpu scale significantly smaller than a gpu whereas now you have at least like you know one ten million dollar training runs to each order of magnitude there it gets less likely that we'll still be doing this rapid scale up at the point when we make this transition to like ai's doing most of the work i think it's interesting to i'm pretty interested in the question of whether algorithmic progress and hardware progress will be as fast in the future as they are today or whether they will have sped up or slowed down i think the like the basic reason you might expect them to slow down is that in order to sustain the current rate of progress we are very rapidly scaling up the number of researchers working on the problem and i think most people would guess that if you held fixed a small number or like if you held fixed the research committee of 2016 they would have like hit diminishing returns and progress would have slowed a lot so like right now the research community is growing extremely quickly that's part of the normal store for why we're able to sustain this high rate of progress that also we can't sustain that much longer you can't grow the number of ml researchers more than like you know maybe you can do three more orders of magnitude but even that starts pushing it yeah so i'm pretty interesting whether that will result in progress slowing down as we keep scaling up or whether like there's an alternative world which is just especially if transformative as developed soon we might see that number scaling up even faster as we approach transformative ai than it is right now yeah so that's like an important consideration when thinking about like how fast the rate of progress is going to be in the future relative today i think the scale up is going to be significantly slower i think it's unclear how fast like hardware and software progress are going to be relative today my best guess is probably a little bit slower that like using uploading fruit will eventually be outpacing growth in the research community and so i guess then maybe mapping that back onto like this qualitative sense of how fast our capabilities changing i do think that like each order of magnitude does make systems in some qualitative sense a lot smarter and like we kind of know roughly what an order of magnitude gets you there's like this huge mismatch that i think is really important where like we used to think of an order of magnitude of compute as just like not that important so for most applications people spend compute on compute is just not one of the important ingredients there's other bottlenecks that are a lot more important but we know in the world where ai is doing all the stuff human is doing that like twice as much compute is extremely valuable yeah if you're running your computers twice as fast you're sort of just getting the same stuff done twice as quickly so we know that's really really valuable um so being in this world where like things are doubling every year that kind of seems to me like a plausible world to be in as we approach transformative ai would be really fast it would be slower than today but it still just qualitatively like would not take long until you'd moved from human parity to way way above humans that was all just like thinking about the rate of progress now what that tells us about the rate of progress in the future and i think that is like an important parameter for thinking about how fast takeoff is like i think my basic expectations are really anchored to this like one to a couple year takeoff because that's how long it takes ai systems to get a couple times better and then we could talk about if we want to why that sort of seems like the core question then there's another question of like what's the distribution of progress like and do we see these big jumps or do we see gradual progress and there i guess i mean i think there are certainly jumps it kind of seems like the jumps are not that big and are gradually getting smaller as the field grows would be my guess i think it's a little bit hard for me to know exactly how to update from things like the go results mostly because i don't have a great handle on like how large the research community working on computer go was prior to the deepmind effort i think my general sense is like it's not that surprising to get a big jump if it's coming from a big jump in research effort or attention and that's like probably most of what happened in those cases and also like a significant part of what's happened more recently in the nlp case just like people really scaling up the investment especially in these large models and so i would i would kind of guess you won't have jumps that are that large like maybe most of the progress comes from boring businesses usual progress rather than like big jumps in the absence of that kind of big swing where people are changing what they're putting attention into and scaling up r d in some area a lot okay so the question is holding factor inputs fixed what have we learned about ml progress so i think one way you can try and measure the rate of progress is you can say how much compute does it take us to do a task that used to take like you know however many flops last year how many flops will take next year and sort of how fast is that number falling i think on that operationalization i don't really know as much as i would like to know about how fast the number falls but i think like something like once a year um like having every year i think that's kind of like the right rough ballpark both in ml and in like sort of computer chess or computer go prior to introduction of deep learning also kind of broadly for other areas of computer science like in general you have this like pretty rapid progress according to stanzas and other fields like it would be really impressive in most areas to have costs falling by a factor of two in a year and then that is kind of like part of the picture another part of the picture is like okay now if i scale up my model size by a factor of two or something or like throw twice as much compute at the same task rather than trying to do twice as many things how much more impressive is my performance with twice the compute i think that it's hard to yeah i think it looks like the answer is it's a fair bit better than right like having a human with twice as big a brain looks like it would be a fair bit better than having a human thinking twice as long or having two humans it's kind of hard to estimate from existing data but like i often think of it as like like roughly speaking like doubling your brain size like as good as quadrupling the number of people or something like that as a vague rule of thumb yeah so the rate of progress then in some sense is even faster than you'd think just from how fast costs are falling because as costs fall you can convert that into like these bigger models which are sort of smarter per unit in addition to being cheaper all right so on this topic of so so we've been broadly talking about you know this potential like really big risk to humanity of these ai systems becoming like really powerful and like doing stuff that we don't want so one thing so we've recently been through this uh covert 19 uh global pandemic we're sort of on the you know exiting it um at least in the place of the world where you and i are the united states i think so some people have taken this to be relevant evidence for like how people would react in the case of you know some like ai causing some kind of disaster like would we make good decisions or like what would happen i'm wondering like do you think in your mind do you think this is being relevant evidence of like what would go down and like to what degree has it like changed your beliefs or perhaps like epitomize things like you thought you already knew but you think other people might not know yeah i've had a friend analogize this experience to some kind of inkblot test where everyone has like the lesson they expected to draw and they can all look at the inkblot and see the lesson they wanted to extract i think a way my beliefs have changed is it feels to me that our collective response to covet 19 has been broadly similar to our collective response to other kind of like novel problems or like when humans have to do something it's not what they were doing before they don't do that hot i think there's some uncertainty over like how much do we have like a hidden reserve of ability to get our act together and like do really hard things we haven't done before um that's pretty relevant to the ai case because like if things are drawn out there will be this period where everyone is probably freaking out uh where there's some growing recognition of a problem but where we need to do something different than we've done in the past and we're wondering when civilization is on the line like are we gonna get our act together i remain uncertain about that the extent to which we have like when it really comes down to it the ability to get our act together but it definitely looks a lot less likely than it did before like i think i had yeah maybe i would say this response was like down in my like 25th percentile or something of how much we got our act together surprisingly when stuff was on the line it involved quite a lot of like everyone having their lives massively disrupted and a huge amount of smart people's attention on the problem but still like i would say we didn't we didn't fare that well or we didn't like manage to like dig into some untapped reserves of ability to do stuff it's just hard for us to do things that are different from what we've done before that's one thing uh maybe a second update that's more like aside an argument i've been on that i feel like should now be settled forever more is like sometimes you'll express concern about ai systems doing something really bad and people will respond in a way that's like why wouldn't future people just do x like why would they deploy as systems that would end up destroying the world or why wouldn't they just like use the following technique or adjust the objective in the following way and i think that like in the covet case our response has been extremely bad compared to like sentences of the form why don't they just like there's a lot of room for debate over like how well we did collectively compared to where expectations should have been but i think there's not that much debate of the forum like if you were telling a nice story in advance there are lots of things you might have expected we would just and so i do think that like one should at least be very open to the possibility that there will be significant value at stake like potentially our whole future but we will not do things that are in some sense like obvious responses to make the make the problem go away like i think we should all be open to the possibility of kind of a massive failure on an issue that like many people are aware of due to whatever combination of like it's hard to do new things there are competing concerns random basic questions become highly politicized there's like institutional issues blah blah it just seems like it's now very easy to visually imagine that i think i have overall just increased my probability of like the doom scenario where you have a period of a couple years of like ai stuff heating up a lot there being a lot of attention a lot of people yelling a lot of people very scared i do think that's just like an important scenario to be able to handle significantly better than we handled the pandemic hopefully i mean hopefully the problem is easier than the pandemic i think there's a reasonable chance like handling the alignment thing will be harder than it would have been to like completely eradicate copen19 and not have to have large numbers of deaths and lockdowns i think if that's the case we'd be in a rough spot though also like again i think it was really hard for like the effective altruist community to do that much to help with overall handling of the pandemic and i do think that like the game is very different the more you've been preparing for that exact case um i think it was also helpful illustration of that in various ways so the final thing before we kind of go into sort of like specifically like what technical problems we could solve to stop existential risk back in 2014 uh this oxford philosopher nick bostrom wrote an influential book called super intelligence that i believe was like if you if you look at like the current sort of strand of like i guess intellectual influence around like ai alignment research it was sort of the first book in that vein that had come out and you know it's been seven years since 2014 when it was published i think the book currently strikes some people somewhat outdated um i'm wondering like but but it does like try to go into like you know what would uh the advance of ai capabilities perhaps look like and what kind of risks could that face so i'm wondering like how do you see your current views as comparing to those presented in super intelligence and what do you think the major differences are if any i guess for super when looking at super intelligence you could split apart something that's like the actual claims nick bostrom is making and the kinds of arguments he's advancing versus something that's like a vibe like the overall permeates the book i think that first about like the vibe i think that even at that time i guess i've always been uh very in the direction of like expecting ai to look kind of like business as usual or to progress somewhat like in a boring continuous way to be unlikely to be accompanied by a strategic advantage sorry uh for the person who develops it uh what what is the decisive strategy advantage this is an idea i think nick introduced maybe in that book of like the developer of a technology being at uh like at the time they develop it having enough of an advantage over potential competitors either economic competitors or like military competitors that they can sort of call the shots and if someone disagrees with the shots they called they can just crush them i think he has this intuition that like there's a reasonable chance that there will be like some small part of the world like maybe a country or a firm or whatever that develops ai that will then be in such a position like they can just do what they want and you can imagine that coming from other technologies as well people really often talk about it in the context of like transformative ai and so even at the time you were skeptical of this idea that like some ai system would get a decisive strategic advantage and like rule the world or something yeah i think that i was definitely skeptical of that as he was writing the book i think we talked about it a fair amount and often came down to this like he'd point to the arguments and be like look these aren't really like making an objectionable assumptions and i'd be like that's true there's something in the vibe that i don't quite resonate with but um i do think the arguments are not nearly as far in this direction as part of the vibe anyways there's some spectrum of like how much like decisive strategic advantage hard takeoff like you expect things to be versus how kind of boring looking moving slowly you expect things to be i'm generally at the other end of that spectrum i guess from super intelligence is not actually at the far end of the spectrum um probably like eliezer and miri folks are at the furthest end of that spectrum um and then super intelligence is some step towards like a more normal looking view and they're like many more steps towards a normal looking view where like i think it will be you know years between when you have sort of economically impactful ai systems and the singularity still a long way to get from me to an actual normal view so that's like a big i think that like affects the vibe in a lot of places there's like a lot of discussion which is really you have some implicit image in the back of your mind and it affects the way you talk about it and then i guess in the interim i think my views have i don't know how they're directionally changed on this question it hasn't been like a huge change i think like there's something where the overall like ai safety community has maybe moved more and i like things seem like probably like they'll be giant projects that involve large amounts of investment and probably they'll be like a run-up that's a little bit more gradual i think that's like a little bit more in the water than it was when super intelligence was written i think some of that comes from like shifting who is involved in discussions of alignment like as it's become an issue more people are talking about views on the issue have tended to become more like normal persons views on normal questions i guess i like to think some of it is like there were sort of like some implicit assumptions being glossed over going into the vibe and some of it i guess like eliezer would basically pin this on like people like to believe comfortable stories and the disruptive change stories uncomfortable so everyone will naturally gravitate towards like a comfortable like continuous progress story which that's not my account but that's definitely like a plausible account for why sort of the vibe has changed a little bit so that's one way in which i think like the vibe of super intelligence maybe feels like distinctively from some years ago i think in terms of the arguments like the main thing is just that the book is often not like it's kind of making what we would now talk about as like very basic points or something like it's not getting that much into like empirical evidence on a question like take off speeds and it's more like raising the possibility of like well it could be the case that ai is really fast at making ai better and like it's good to raise that possibility that naturally leads into like people really getting more into the weeds and being like well how likely is that and what historical data bears on that possibility and what are really like the core questions yeah i guess my sense and i haven't read the book in a pretty long time but my sense is that like the arguments and like sort of claims where it's more sticking its neck out just tend to be like milder or less in the weeds claims and then the overall vibe is like a little bit uh like more in this decisive strategic advantage direction yeah like i remember discussing with him like and he was writing it there's one chapter in the book on like multiple outcomes uh which i found like to me feels weird and then i'm like that's like the great majority of possible outcomes involved like lots of actors with considerable power it's like weird to put that in one chapter yeah i think his perspective was more like should we even have that chapter or should we just cut it we don't have like that much to say about multiple outcomes per se it's like he was not reading like one chapter on multiple outcomes is like i think in some way like reflects the vibe the vibe of the book is like this is the thing that could happen it's like no more likely than the strategic advantage or perhaps even like less likely and the less words are spilled on it i think the arguments don't really go there and in some sense like the vibe is not entirely like a reflection of some like calculated argument nick believed and just wasn't saying yeah i don't know yeah it was it was interesting so last year i reread uh i think a large part maybe not all of the book i mean you should call me on all my false claims about super intelligence then yeah no it was well last year was a while ago but um it was one one thing i noticed is that at the start of the book and also whenever he gives whenever he like had a podcast interview about the thing he like you know often did take great pains to say like look amount of time i spend on a topic in the book is not the same thing as my like likelihood assessment of it and yeah it's definitely to some degree like weighted towards things he thinks he can talk about which is fine and he definitely like in a bunch of places like yeah x is possible you know if this happened then that other thing would happen and like i think it's very easy to read into that like likelihood assessments that he's actually just not making i do think he had some he definitely has some empirical beliefs that are way more on the strategic advantage end of the spectrum but i do think the vibe is more can go even further in that direction yeah all right the next thing i'd like to talk about is like what technical problems could cause accidental risk and like how you think about that space so yeah i guess first of all yeah how do you see the space of like like which problems might cause technical problems might cause ai essential risk and like how do you carve that up i think i probably have slightly different carving ups for research questions that one might work on versus like root cause of failures that might lead to doom okay um maybe starting with the like root cause of failure i certainly spend most of my time thinking about alignment or intent alignment that is i'm very concerned about a possible scenario where ai systems right basically as an artifact of the way they're trained most likely are just trying to do something that's very bad for humans for example ai systems are trying to cause the camera to show happy humans in the limit this result so this like really incentivizes behaviors like ensuring that you control the camera and you control like what pixels or like what light is going into the camera and if humans try and stop you from doing that then you don't really care about the welfare of the humans anyways i think the main thing i think about is that kind of scenario where somehow the training process leads to an ai system that's working at cross purposes to humanity so maybe i think of that as like half of the total risk in a transition to like in the sort of early days of shifting from humans doing the cognitive work as doing the cognitive work and then there's another half of difficulties where it's a little bit harder to say if they're posed by technical problems or by social i think for both of these it's very hard to say whether the doom is due to technical failure due to social failure due to whatever but there are a lot of other ways in which like if you think of like human society as kind of like the repository of what humans want the thing that will ultimately like go out into space and determine what happens with space there are lots of ways in which that could get messed up during a transition to ai and so you could imagine there will be ai will enable like significantly more competent like attempts to manipulate people more significantly higher quality rhetoric or argument than humans have traditionally been exposed to so to the extent that like the process of us collectively deciding what we want is sort of calibrated to the kind of arguments humans make then just like like most technologies ai has some way of changing like that process or some prospect of changing that process or ending up somewhere different yeah i think ai has an unusually large potential impact on that process um but it's not like different in kind from like the internet or phones or whatever i think for all of those things you might be like well i like you know i care about this thing like the humans we collectively care about this thing and like to the extent that we would care about different things if technology went differently in some sense like we probably don't just want to say like whatever way technology goes that's the that's the one we really wanted we might want to like look out over all the ways technology could go and say oh yeah this is the extent there's disagreement like this is actually the one we most endorse so i think there's like some concerns like that i think another related issue is like actually yeah there's like a lot of issues of that flavor i think most people tend to be significantly more concerned with the risk of everyone dying than the risk of like humanity surviving but going out into space and doing the wrong thing there are exceptions that people on the other side who are like man paul is too concerned with the risk of everyone dying and not enough concerned with the risk of doing weird stuff in space like way die really often argues for a lot of these risks and tries to prevent people from forgetting about them or not prioritizing them enough anyway i think a lot of the things i would list other alignment that like loom largest to me are in that second category of humanity survives but does something that in some alternative world we might have regarded as a mistake um i'm happy to talk about those but i don't know if they're actually quite we have in mind or what most listeners care about and i think there's another category that's like just ways that we go extinct where in some sense like ai is not the weapon of extinction or something but just plays a part in the story so like if ai contributes to the start of a war and then the war results or escalates to catastrophe yeah or if ai's sort of any catastrophic risk that faced humanity i think we might have mentioned this briefly before like technical problems around ai can have an effect on how well humanity handles that problem okay so it can have an effect on how well humanity responds to like some sudden change in its circumstances and like a failure to respond well may result in like this war escalating or like responding really poorly to social unrest or climate change or whatever yeah okay i guess i'll talk a little bit about intent alignment mostly because that's what i've prepared for the most um it's also what i spend almost all my time thinking about so i love talking about intent alignment all right great well i've got a good news backing up a little bit sometimes when eliezer you kowski talks about ai he talks about this task of like copy pasting a strawberry where like you have a strawberry and you have some you know system that has really good scanners and like maybe you can do nanotechnology stuff or whatever and like the goal is like you have a strawberry you want to like look at how all of its cells are arranged and you want to copy paste it so there's a second strawberry right next to it that is cellularly identical to the first strawberry or i might be getting some details of this wrong but that's roughly it and he's there's the contention that like we maybe don't know how to safely do the copy paste or strawberry task and i'm wondering when you say intent alignment do you mean like some sort of alignment with like my deep human psyche and like you know all the things that i really value in the world or do you intend that to also include things like i would today i would like this strawberry copy pasted can i get a machine that you know does that without having all sorts of like crazy weird side effects or something so definitely the definitions aren't crisp but i try and think in terms of like an ai system which is trying to do what paul wants do what paul wants is in quotes um so the ai system may not understand like all the intricacies of what paul desires and like how paul would want to reconcile conflicting intuitions it's just trying to make some reasonable like also what paul wants there's like a broad range of interpretations unclear what i'm even referring to with that but like i am mostly interested in an ai that's like broadly trying to understand what paul wants and help paul do that um rather than an ai which like understands really deeply like i'm not too concerned about whether my ai understands really deeply what i want because i mostly want an ai that's not sort of actively killing all humans or attempting to ensure humans are shoved over in the corner somewhere with no ability to influence the universe i'm like really concerned about cases where ai is working across purposes to humans in ways that are like very flagrant and so i think like it's fair to say that like taking some really mundane tasks like put your strawberry on a plate or whatever is a fine yeah as a fine example task and i think probably i'd be probably on the same page as eleazar there's definitely some ways we would talk about this differently i think we both agree that like having a really powerful ai which can sort of overkill the problem and do in any number of ways and getting it to just be like yeah the person wants like a strawberry could you give them a strawberry and getting into this like actually give them a strawberry captures like the in some sense core of the problem i would say probably the biggest difference between us is like in contrast with eliezer i am like really focused on saying i want my ai to do things as effectively as any other ai i hear a lot about this idea of being sort of economically competitive or just broadly competitive with other ai systems um i think for eliezer that's a much less central concept so the strawberry example is sort of a weird one to think about from that perspective because you're just like all the ads are fine putting a strawberry on a plate maybe not for this copy of strawberry cell by cell maybe that's a really hard thing to do yeah i think we're probably on the same page okay so you were saying that you carve up like perhaps research projects that one could do and sort of root causes of failure slightly differently and i think was was was intent alignment a root cause of failure or a research problem yeah i think it's a root cause of failure okay so yeah how would you carve up the research problems i spend most of my time just thinking about divisions within intent alignment because like what are the various problems that help with intent alignment i'd be happy to just focus on that i can also try and comment on like problems that seem helpful for other dimensions of potential doom i guess like a salient for me distinction is like there's lots of ways your ai could be better or like more competent that would also help produce doom like for example you could imagine working on ai systems that cooperate effectively with other ai systems rei systems that are like able to defuse certain kinds of conflict that could otherwise escalate dangerously or ai systems that understand a lot about human psychology etc so i would sort of like you could slice up those kind of technical problems like technical problems that improve the capability of ai in particular ways that reduce the risk of some of these like dooms involving ai that's like sort of what i mean by like i'd slice up the research things you could do differently from the actual dooms yeah i spend most of my time thinking about it in intent alignment uh what are the things you could work on and there the sense in which i slice up research problems differently from sources of doom is that i mostly think about like a particular approach to making ai intent aligned and then like what are the building blocks of that approach and like there will be different approaches there are different sets of building blocks and some of them occur over and over again like sort of different versions of interpretability appear as a building block and like many possible approaches but i think the carving up it's kind of like you know it's kind of like a tree or an ore of anne's or something like that and like there are different top-level ores like several different paths to being okay and then for each of them you'd say well this one like you have to do the following five things or whatever um and so there's kind of two levels of carving up one is between different approaches to achieving intent alignment and then within each approach like different things that have to go right in order for that approach to help okay so one question that i have about intent alignment is um it seems like it's sort of relating to this like uh what i might call like a human decomposition where like this philosopher david hume he said something uh approximately like look the thing about uh the way people work is that they have beliefs and they have desires and like beliefs can't motivate you only desires can and the way there's produce action is that you like try to do actions which according to your beliefs fulfill your desires and it seems like by talking about intent alignment it seems like you're sort of imagining something similar for ai systems but it's not obviously true that that's how ai systems work like if you look at like some you know in reinforcement learning you know one way of training systems it's just like basically search over neural networks to get one that produces really good behavior and you look at it and it's just like you know a bunch of numbers um it's it's not obvious that it has this kind of belief desire uh decomposition so i'm wondering like yeah should i take it to me that you think that that decomposition will exist or do you mean beliefs and desires and like uh some kind of behavioral or intent in some sort of behavioral way or yeah how should i understand that yeah it's definitely a shorthand that is probably not going to apply super cleanly to systems that we build um so i can say a little bit about both the kinds of cases you mentioned and like how to what i mean more generally and also a little bit about why i think this kind of shorthand is like kind of reasonable i think the most basic reason to be interested in like systems that aren't trying to do something bad maybe it's a first subtlety there's like some distinction between a system that's trying to do the right thing that's like a goal we want to achieve there's like a more minimal goal that's like a system that's not trying to do something bad um so you might think that like some systems are trying or some systems can be said to have intentions or whatever but like actually would be fine with a system that like has no intentions and whatever whatever that means i think that's pretty reasonable and i'd certainly be happy with that um like most of my research is actually just focused on building systems that aren't trying to do the wrong thing anyway that caveat aside i think the basic reason we're interested in like something like intention is we look at some failures we're concerned about i think like first we believe it is possible to build systems that are trying to do the wrong thing like i think that is the thing we like are aware of algorithms like search over actions and for each one predict its consequences and then rank them according to some function of the consequences and pick your favorite yeah we're aware of algorithms like that that can be said to have intention and we see how some algorithm like that if say produced by stochastic gradient descent or if applied to a model produced by stochastic gradient descent could lead to some kinds of really bad policies that could lead to systems that actually like systematically just and permanently disempower the humans so we like see how there are algorithms that have something like intention that could lead to really bad outcomes and conversely when we look at like how those bad outcomes could happen like if you're like you imagine the robot army like killing everyone it's very much not like the robot army just randomly killed everyone like there sort of has to be some force keeping the process on track towards the killing everyone end point i'm in order to get this like really highly specific sequence of actions and kind of the thing we want to point out is whatever that is it's like maybe i guess i most often think about like optimization as a sort of subjective property that is like i will say that an object is optimized for some end like let's say i'm wondering like there's a bit that was output by this computer and i'm wondering is the bit optimized to achieve like human extinction the way i'd operationalize that would be by saying like well i don't know whether the bit being zero or one is more likely to lead to human extinction but i would say the bit is optimized just when like if you told me the bit was one i would believe it's more likely that the bit being one leads to human extinction like there's this correlation between my uncertainty about the consequences of different bits that could be output and my uncertainty about which bit will be output so in this case uh whether it's like optimized could potentially depend on your background knowledge right that's right yeah different people could disagree about like one person could think something is optimizing for a and the other person could think someone is optimizing for not a like that is possible in principle and and not only could they think that they could both be right in a sense that's right there's like no fact of the matter beyond what the person thinks and like so from that perspective optimization is mostly something we talk we're talking about like from our perspective as algorithm designers so like when we're designing the algorithm like we are in this epistemic state and our goal we are like the thing we'd like to do is from our epistemic state there shouldn't be this optimization for doom like we shouldn't end up with these correlations where the algorithm we write is more likely to produce actions that lead to doom and that's something where like kind of we are retreating most of the time we're designing an algorithm we're like we're treating to some set of things we know and like kind of some kind of reasoning we're doing or like within that universe we want to eliminate this correlation or this possible bad correlation okay there are tons of yeah this exposes tons of rough edges which i'm certainly happy to talk about lots of yeah i mean one i mean one way you could uh i guess it depends a bit on i guess whether you're talking about correlation or mutual information or something but on some of these definitions like one way you can reduce any dependence is if like you know with certainty what the system is going to do right so like or perhaps even if like you know like i don't know exactly like what's gonna happen but i know it will be some sort of hell world and like then there's no correlation so it's not optimizing for doom it sounds like yeah i mean i think the way that i am thinking about that is like i have my robot and my robot's like taking some torques right my thing connected to the internet and it's like sending some packets and in some sense like right we can be in the situation where it's optimizing for doom and certainly doom is achieved and i'm merely uncertain about what path leads to doom i'm like well i don't know what packets it's going to send i don't know what packets lead to doom for if i knew if i knew as algorithm designer what packets lead to doom i'd just be like oh this is the easy one if the package is going to send lead to doom like no go yeah like i don't know what packets lead to doom and i don't know what packets it's going to output but i'm pretty sure the ones that's going to output like have a higher or maybe i could be sure they lead to doing where i could just be like those are more likely to be doomy ones and like the situation i'm really terrified of as a human is the one where like there's this algorithm which has the two following properties one is that like its outputs are especially likely to be economically valuable to me for reasons i don't understand and two its outputs are especially likely to be do me for reasons i don't understand and if i'm a human in that situation i have these outputs from my algorithm and i'm like well darn i could use them or not use them if i use them i'm getting some doom and if i don't use them i'm getting some like leaving some value on the table which like my competitors could take in the sense of value where like like i could run a better company yeah it's not great yeah they could run a better company that would have each year some probability of doom and then like the people who want to make that trade off will be the ones who end up with the actually steering the course of humanity which they then steer to doom okay so so in that case it sort of sounds like maybe the like human decomposition there is like there's this like correlation between you know how good the world is or whatever and like what the system does and the direction of the correlation or something is suppose is maybe going to be like the intent or the motivations of the system and maybe the strength of the correlation or like you know how how tightly you can infer that's something more like capabilities or something does that seem right yeah i guess i would say that on this human human whatever perspective like there's kind of two steps both of which are to me about optimization one is like we say the system has accurate beliefs by which we're talking about like a certain core to me this is also a subjective condition like i say the system like believes x or like correctly believes x to the extent there's like this correlation between like the actual truth of affairs and like what some representation it has or whatever okay so it's like one step like that and then there's a second step where there's a correlation between which action it selects and its beliefs about the consequences of the action in some sense i do think i like want to be a little bit more general than like the framework you might use for thinking about humans so right like in the context of an ai system there's traditionally like a lot of places where optimization is being applied so like you're doing stochastic gradient descent which is itself significant optimization over the weights of your neural network but then those optimized weights will tend to produce like like those will tend to themselves do optimization because some weights do and the weights that do you have optimized towards them and then like also you're often combining that with explicit search like after you've trained your model often you're going to use it as part of some search process so they're like a lot of places optimization is coming into this process and so i'm not normally like literally thinking about like the ai that has like some beliefs and some desires that decouple but i am trying to be sort of like doing this accounting or being like well what is a way in which this thing could end up optimizing for doom how can we like get some handle on that and try and i guess i'm simultaneously thinking how could it actually be doing something productive in the world and how can it be optimizing for doom and then trying to think about is there a way to like to couple those or like get the one without the other um but that could be happening like if i imagine an ai i don't really imagine having like coherent set of beliefs right i imagine it being this neural network that has like there are tons of parts of the neural network that could be understood as beliefs about something and tons of parts of the neural network that could be understood as optimizing so it'd be like this very fragmented crazy mind um probably human minds are also like this they don't really have current beliefs and coherent um desires but it's kind of like we want to stamp out we're going to stamp out all the desires that are not like helping humans get what they want or at least all the desires that involve killing all the humans at a minimum so okay now that i sort of understand um intense alignment uh sometimes people divide this up into like outer and inner versions of intense alignment um sometimes people talk about like various types of robustness that um properties could have or that systems could have i'm wondering like do you have a favorite of these like further decompositions or do you not think about it that way as much i mentioned before this like ore of anz where there's like lots of different paths you could go down and then within each path there'll be lots of breakdowns of what technical problems need to be resolved i guess i think of outer and inner alignment as like for several of the leaves in this four vans or several of the branches in this auravans several of the possible approaches you can talk about like oh these things are needed to achieve outer alignment and these things are needed to achieve inner alignment and with their powers combined will achieve a good outcome often you can't talk about such a decomposition like in general i don't think you can like look at a system and be like oh yeah that part's outer alignment and that part's inner alignment like the times when you can talk about it most or like the way i use that language most often is for a particular kind of alignment strategy where you it's like a two-step plan step one is like develop an objective that captures what humans want well enough to be getting on with and there's like some it's gonna be something more specific but some you have an objective that captures what humans want in some sense ideally you would exactly capture what humans want so like you look at the behavior of a system you're just exactly evaluating like how good for humans is it to deploy a system with that behavior or something so you have that as step one and then that step would be outer alignment and then step two is like given that we have an objective that captures what humans want let's build a system that's like internalize that objective in some sense or like is not doing any optimization contrary to like any other optimization beyond pursuit of that objection and so in particular the objective is an objective that you might want the system to adopt rather than an objective over systems uh yeah i mean we're sort of equivocating in this way that like reveals problematicness or something like the first objective is like it's like an objective it is a ranking over systems or it's like some reward like tells us how good is a behavior and then we're hoping that the system then like adopts that same thing or some reflection of that thing that was like a ranking over policies and then we just get the like obvious analog of that or whatever over actions and and so you think of these as like different problems or like different like sub problems to a sort of whole thing of intent alignment rather than like you know objectively like oh this system has an outer alignment problem but the inner alignment's great or something yeah that's right i think this makes sense on some approaches and not on other approaches i am most often thinking of it as there's some set of problems that kind of seem necessary for outer alignment i don't really believe that it's like the problems are going to split into like these are the outer alignment problems and these are the inner alignment problems i think of it more as like the outer alignment problems or the things that are sort of obviously necessary for outer alignment are more likely to be like useful stepping stones or like warm-up problem or something okay i suspect in the end it's not like we have our piece that does outer alignment or a piece that does inner alignment then we put them together um i think it's more like there were a lot of problems we had to solve and then when you look at the set of problems it's kind of like unclear how you would attribute responsibility or what or like there's no part that solving outer versus inner alignment but it was still like useful there were like a set of sub-problems that were pretty useful to have solved kind of just like the outer alignment thing here is acting as like an easy special case to start with or something like that it's not technically a special case there's actually something worth saying there probably which is like it's easier to work on a special case than to work on some vaguely defined like here's kind of a thing that would be nice sure so i do most often like when i'm thinking about my research when i want to like focus on sub problems that are like to specialize on like the outer alignment part which i'm doing more in this warm-up problem perspective i think of it in terms of high stakes versus low stakes decisions so in particular if you've solved what we're describing as outer alignment if you have a reward function that captures what humans care about well enough and if the individual decisions made by your system are sufficiently low stakes then it seems like you can get a good outcome just by doing sort of online learning that doesn't constantly retrain your system as it acts and like it can do bad things for a while as it moves out of distribution but eventually you'll fold that data back into the training process yeah and so if you had a good reward function and the stakes are low then you can kind of get a good outcome so that's when i say that i like think about outer alignment as a sub problem i mostly mean like i kind of ignore the problem of high stakes decisions or like fast acting catastrophes and just focus on the difficulties that arise even when uh every individual decision is very low stakes sure so that actually brings up another style of decomposition that uh some people use or some people prefer which is like sort of a distributional question so so there's one way of thinking about it where like outer alignment is like you know i pick a good objective and inner alignment is like hope that the system assumes that objective another distinction people sometimes make is like okay firstly like we'll have some training we'll have some like set of situations that we're gonna develop our ai to behave well in and like step one is like making sure our ai does the right thing in that test distribution uh which is i guess supposed to be like kind of similar to outer alignment like you train a thing that's sort of supposed to roughly do what you want then there's this question of like does it generalize you know you know in a different distribution firstly does it behave competently and then does it continue pursuing does it continue to reliably achieve the stuff that you wanted and that's supposed to be more like inner alignment because like if the system had to really internalize the objective then it would you know supposedly continue pursuing it in later places and it's kind of there are some distinctions between that and especially the frame where like alignment is supposed to be about like are you representing this like objective in your head or something and i'm wondering how how do you think about those the the differences between those frames or whether you view them as like basically the same thing i think i don't view them as the same thing i think i think of those two splits and then a third split i alluded to briefly of like sort of high stakes failures versus like avoiding very fast catastrophes versus average case performance okay i think i think of those three splits is just like all roughly agreeing there will be some approaches where one of those splits is like a literal split of the problems you have to solve we're like literally factors into doing one of those and then doing the other i think that the actual the exact thing you stated is a thing people often talk about but i don't think really works even as kind of a conceptual split quite where the main problem is just like if you train a system to do well in some distribution there's kind of two big limitations you get one is like two related big limitations you get one is that doesn't work off distribution and the other is just that like you only have like an average case property over that distribution so it seems like in the real world is actually possible it looks like it's almost certainly going to be possible for deployed ai systems to fail quickly enough that the actual harm done by individual bad decisions is much too large to bound with an average case guarantee so you can imagine like the system which appears to work well on distribution but actually with like one in every quadrillion decisions it just decides like now it's time to start killing all the humans and that system is quite bad and i think that like in practice like probably it's better to lump that problem in with distributional shift which kind of makes sense and maybe people even mean to include that it's a little bit unclear exactly what they have in mind but just like distributional shift is kind of just changing the probabilities of outcomes and like the concern is really just like things that were improbable under your original distribution and you could have a problem either because you're in a new distribution where those things go from being very rare to being common yeah or you could have a problem just because they were like relatively rare so you just didn't encounter any during training but they'll still if you keep sampling even on distribution eventually one of those will get you yeah and cause trouble like maybe they were literally zero in the yeah the data set you drew but not in the distribution probability the quote-unquote probability distribution that you drew your data set from yeah so i guess maybe that is fair i like really naturally reach for like the underlying problem distribution but i think like out of distribution in some sense like is most likely to be like our actual split of the problem if we mean the empirical distribution over like the actual episodes at hand um anyway i think of all three of those decompositions then that was like a random caveat in the out of distribution one sure i think of all those like kind of related breakdowns my guess is that like the right way of going doesn't actually respect any of those breakdowns and like doesn't have a set of techniques that solve one versus the other but i think it is very often helpful like it's just generally when doing research helpful to specialize on a sub problem i think often like one branch or the other of one of those splits is a helpful way to think about like the specialization you want to do during your cur a particular research project the splits i most often use are this like low stakes one where you can train online and individual decisions are not catastrophic and the other arm of that split is this like suppose you have the ability to detect a catastrophe if one occurs or you sort of trust your ability to assess the utility of actions and now you want to build a system which doesn't do anything catastrophic even when deployed in the real world on a potentially different distribution encountering potentially rare failures that's the split i most often use i think none of these are likely to be respected by the actual like list of techniques that together address the problem but often one half or the other is like a useful way to help zoom in on what assumptions you want to make during a particular research project and and why do you prefer that split i think most of all because it's fairly clear what the problem statement is so the problem statement there is just a feature of the thing outside of your algorithm like you're writing some algorithm and then your problem statement is like here is a fact about the domain in which you're going to apply the algorithm the fact is that like it's impossible to mess things up super fast okay and it's nice to have a problem statement which is entirely external to the algorithm like if you want to just say like here's the assumption we're making now we want to solve that problem it's great to have an assumption on the environment be your assumption there's some risk if you say like oh our assumption is going to be that the agent's going to like internalize whatever objective we use to train it um the definition of that assumption is like stated in terms of it's like it's kind of like helping yourself to some sort of magical ingredient and like if you're optimized for solving that problem you're going to like push into a part of the space where that magical ingredient was doing like a really large part of the work um which i think is like a much more dangerous dynamic than like if the assumption is just on the environment in some sense you're limited in how much of that you can do you have to solve the remaining part of the problem you didn't assume away and i'm really scared of subproblems which just assume that some part of the algorithm will work well because i think you often just end up like pushing an inordinate amount of the difficulty into that step okay another question that i want to ask about these sorts of decompositions of problems is i think most of the world i think most of the i guess the intellectual tradition that's sort of spawned off of like nick bostrom and elias rydkowski uses like some an approach kind of like this maybe with an emphasis on like learning things that like people want to do that's like particularly prominent at the research group i work at there's also i think some subset of people are largely i think concentrated the machine intelligence research institute that are interested in sort of a more like like they have this sense of like oh we just like don't understand the basics of ai well enough and we need to like really think about decision theory and we really need to think about what it means to be an agent and like then you know once we understand this kind of stuff better then like maybe it'll be easier to solve those problems is something they might say i'm wondering like yeah what do you think about this yeah this approach to research where you're just like okay let's like figure out these basic problems and like try and get a good formalism that we can work from from there on i think yeah this is mostly a methodological question probably rather than a question about the situation with respect to ai although it's not totally clear there may be like differences in belief about ai that are doing the real work but methodologically i'm very drawn like suppose you want to understand better like what is optimization or you have some like a very high level question like that yeah like what is like bounded rationality i am very drawn to an approach where you say like okay we think that's going to be important down the line we think at some point as we're trying to solve alignment we're going to like really be hurting for want of an understanding of like bounded rationality i really want to just be like let's just go until we get to that point until we like really see like what problem we wanted to solve and like where it was that we were like reaching for this notion of bounded rationality we didn't have and then at that point we'll have some like more precise specification of like what we actually want out of this theory of bounded rationality okay and like i think that is the moment to be trying to dig into those concepts more i think it's scary to try and go the other way i think it's not totally crazy at all and they're like reasons that you might prefer it i think the basic reason it's scary is that there's probably not a complete theory of everything for many of these questions like there's a bunch of questions you could ask and a bunch of answers you get that would improve your understanding we don't have a statement of like what it is we actually seek and like it's just a lot harder to research when you're like i want to understand in some domains this is the right way to go and like that's part of why it might come down to facts about ai whether it's like the appropriate methodology in this domain but like i think it's tough to be like i don't really know what i want to know about this thing i'm just kind of interested in what's up with optimization and then researching optimization relative to being like oh here is a fairly concrete question that i would like to be able a really concrete task i'd like to be able to address and which i think like is going to come down to my understanding of optimization i think that's just like an easier way to better understand what's up with optimization yeah so at these moments where you realize you need a better theory or whatever yeah are you imagining them looking like oh i here's this technical problem that i want to solve and i don't know how to and it reminds me of optimization or or what does the moment look like when you're like uh now is the time i think the way the whole process most often looks is you have some problem like you're like here i guess the way my research is organized is very much like here is the kind of thing our ai could learn from which it's not clear how our aligned day i learned something that's like equally useful i'm like thinking about one of these cases and digging into it and i'm like here's what i want here's what i think this problem is solvable here's what i think the eland ai should be doing okay and i'm like thinking about that and i'm like oh i don't know how to actually write down the algorithm that would lead to the elandia doing this thing and like walking down this path and i'm like here's here's a piece of what it should be doing and here's a piece of how the algorithm should look and then at some point you step back and you're like oh wow it really looks like what i'm trying to do here is like algorithmically test for one thing being optimized for another or whatever and that's a particularly doomy sounding example but like maybe have some question like that or i'm like wondering like what is it that leads to like the conditional independence is the human reports in this domain like i really need to understand that better and like i think it's most often for me not then like okay now let's go understand that question now that it's come up it's most often like let us like flag and try and like import everything that we know about like that area like i'm now asking a question to people that feel similar to questions people have asked before so i want to make sure i understand what everyone has said about that area this is a good time to like read up on everything looks like it's likely to be relevant the reading up is cheap to do in advance so you should be trigger happy with that one but then like there's no actual pivot into like thinking about the nature of optimization it's just like continuing to work on this problem i'm like expecting like that's kind of how that may end up like some of those lemmas may end up feeling like statements about optimization but there was no step where like now it's time to think about optimization just let us keep trying to design this algorithm and then see like what concepts fall out of that and you mentioned that there were some domains where like actually thinking about the fundamentals early on was the right thing to do which domains are you thinking of and what you see is the big differences between those and kind of ai alignment yeah so i don't know that much about the intellectual history of almost any fields the field i'm most familiar with by far is computer science i think in computer science especially like so my training is in theoretical computer science and then i spend a bunch of time working in machine learning and deep learning um i think the like problem first perspective is like just generally seems pretty good and i think to the extent that like let's understand x has been important it's often at the like problem selection stage rather than like now we're going to research x in an open-ended way it's like oh problem x seems important or like x seems interesting and this problem seems to shed some light on x so now that's like a reason to work on this problem like that's a reason to you know try and predict this kind of sequence with ml or whatever it's a reason to answer to try and write an algorithm to answer this kind of question about graphs so i think in those domains it's not often that often the case you just want to like start off and like have some high big picture question and then think about it abstractly my guess would be that in domains where like more of the game is like walking up to nature and like looking at things and seeing what you see it's like a little bit different it's not as driven as much but like you're coming up with an algorithm and like running into constraints and designing an algorithm i don't really know that much about history of science though so i'm just guessing that that might be a good domain or a good approach sometimes all right so we've talked a little bit about the way you might decompose inner alignment into problems or you know the space of like dealing with existential risk into problems one of which is inner alignment i'd like to talk now a little bit about i guess on a high level about your work on the solutions to these problems and you know other work that people have put out there so first thing i want to ask is yeah as i mentioned i'm in a research group and a lot of what we do is think about you know how a machine learning system could learn some kind of objective you know from human data so perhaps it's like the human wants there's some human who like has some desires and they act a certain way because of those desires and we use that to do some kind of inference so that you know this might look like inverse reinforcement learning um a simple version of it might look like imitation learning and i'm wondering what you think of these approaches for things that look more like outer alignment more like trying to specify what a good objective is so broadly i think there are two kinds of goals you could be trying to serve with work like that or like for me there's this really important distinction as we try and like incorporate knowledge that a human lacks like a human demonstrator or a human operator lacks so like the game changes as you move from like the regime where you could have applied imitation learning in principle because the operator could demonstrate how to do the task to the domain where the operator doesn't understand how to do the task and they definitely aren't using imitation learning and so from my perspective like one thing you could be trying to do with techniques like this is work well like in that imitation learning regime like in the regime where you could have imitated the operator can you find something that works even better than imitating the operator and i am pretty interested in that and i think that imitating the operator is not actually that good a strategy even if the operator is able to do the task in general so i have worked some on reinforcement learning from human feedback in this regime so imagine there's a task a human understands what makes performance good or bad just have the human evaluate individual trajectories learn to predict those human evaluations and then optimize that with rl i think the reason i'm interested in that technique in particular is i think of it as like sort of the most basic thing you can do or that like most makes clear exactly what the underlying assumption is that is needed for the mechanism to work namely you need the operator to be able to identify which of two possible executions of a behavior is better anyway there's then this further thing and like i don't think that that approach is the best approach like i think you can do better than asking the human operator which of these two is better i think it's pretty plausible that basically past there you're just talking about data efficiency like how much human time do you need and so on um and how easy is it for the human rather than like a fundamental conceptual change i'm not that confident of that there's a second thing you could want to do where you're like now let's move into the regime where you can't ask the human which of these two things is better because in fact one of the things the human wants to learn about is like which of these two behaviors is better the human doesn't know they're hoping ai will help them understand actually what's the situation in which like we might want that to happen might want to move beyond the human knowing yeah so suppose we want to get to this world where we're not worried about ai systems trying to kill everyone um and you know we can like use our ai systems to like you know help us with that problem maybe like can we somehow get to some kind of world where we're not going to build really smart ai systems that want to like destroy all value in the universe uh without solving these kinds of problems where we can't even have where it's difficult for us to evaluate which solutions are right yeah i think it's very unclear i think eventually it's clear that ai needs to be doing these tasks that are very hard for humans to evaluate answer is right but it's very unclear like how far off that is that as you might first live in a world where it has had a crazy transformative impact before ai systems are regularly doing things that are like also there's different degrees of beyond human's ability to understand what the eye is doing so i think that's a big open question but in terms of the kinds of domains where you would want to do this part of the reasons there's generally this trade-off between over what horizon you evaluate behavior or like kind of how much do you rely on hindsight and how much do you rely on foresight or the human understanding which behavior will be good yeah so the more you want to rely on foresight the more plausible it is that the human doesn't understand well enough to do the operation so for example if i imagine i'm just like my ai is sending an email for me one regime is the regime where like it's basically gonna send the email that i like most like i'm gonna be evaluating either actually or like it's gonna be predicting what i would say how good is this email and it's gonna be sending the email for which paul would be like that was truly the greatest email the second regime where like i send the email and then my friend replies and i look at the whole email thread the results and i'm like wow that email seemed like it got my thread fen to like me i guess that was a better email and there's like an even more extreme one where like then i look back on my relationship with my friend in three years and i'm like given all the decisions that say i made for me over three years like how much did they contribute to like building a really lasting friendship or whatever and it's just like i think if you're going to the really short horizon where i'm just evaluating an email it's very easy to get to the regime where i think ai can be a lot better than humans at that question i'm just like it's very easy for me to be empirical facts but like what kind of email gets a response or like what kind of email will be easily understood by the person i'm talking to when an ad that has like sent 100 billion emails will just like potentially have a big advantage over me as a human and then as you push out to longer horizons it gets easier for me to evaluate like it's easier for a human to be like okay that person says they understood i can evaluate the email in light of like the person's response as well as an ai could but as you move out to those longer verizons then you start to get scared about like whether that evaluation like that evaluation becomes scarier to do there starts to be more room for manipulation of the metrics that i use so i'm sorry i'm saying all that to say there's this general when we ask like rai systems needing to do things that humans couldn't evaluate like which of two behaviors is better it depends a lot how long we make the behaviors and how much hindsight we give the human evaluators okay and in general that's like part of the tension or part of the game we can make the thing clear by just talking about like really long horizon behavior so if i'm like we're gonna write an infrastructure bill and i'm like ai can you write an infrastructure bill for me it's kind of like uh it's very very hard for me to understand which of two bills is better and there is the thing where like again in the long game you do want ai systems helping us as a society make that kind of decision much better than we would if it was just up to like humans who look at the bill or even a thousand humans looking at the bill it's not clear how late how early you need to do that i am particularly interested in that kind of i'm interested in like all the things humans do to keep society on track like all of the things we do to manage risks from emerging technologies all the things we do to cooperate with each other etc and i think a lot of those do involve like more are more interested in ai because it may help us make those decisions better rather than make them faster and i think in cases where you sort of want something that's more like wisdom it's more likely that the value added if ai is to add value it will be in ways that humans couldn't easily evaluate yes so we were saying like um we're talking about like imitation learning or you know inverse reinforcement learning so looking at somebody do a bunch of stuff then trying to infer what they were trying to do um we were talking about those uh solutions to outer alignment and you were saying yeah it works well for things where you can uh evaluate what's going to happen but for things that can't and i think i caught you off around there yeah i think that's an interesting i think you could have pursued this research either trying to improve the imitation learning setting be like look imitation learning actually wasn't the best thing to do even when we were able to demonstrate i think that's like one interesting thing to do which is the context where i've most often thought about this kind of thing a second context is where you want you want to move into this regime where a human can't say which thing is better or worse again imagine like you've written some bill and we're like how are we going to build an ai system that like writes good legislation for us in some sense like actually the media the problem was not writing of the legislation it was telling us which legislation was like helping predict which legislation is actually good we can sort of divide the problem into those two pieces one is like an optimization problem and one is a like prediction problem and for the prediction component that's where really like it's unclear how you go beyond human ability it's very easy to go beyond human ability and optimization problems just dumb more compute into optimizing i think you can still try and apply things like inverse reinforcement learning though like you can be like humans wrote a bunch of bills those bills were like imperfect attempts to optimize something about the world you can try and back out from like looking at all the not only those bills but all the stories people write all the words they say blah blah blah we can try and back out like what it is they really wanted and then like give them a prediction like how well will this bill achieve what you really wanted and i think like that is particularly interesting like in some sense that is from a long-term safety perspective to me more interesting than the case where a human operator could have understood the consequences of the ai's proposals but i am also very scared of like the like i don't think we currently have really credible proposals for inverse reinforcement learning working well in that regime what's the difficulty of that so i think the hardest part is i look at some human behaviors and the thing i need to do is disentangle like which aspects of human behavior are limitations of the human which are like things that human wishes about themselves they could change yeah um and which are reflections of what they value and in some sense like right in the imitation learning regime we just get to say whatever we don't care we're getting the whole thing yeah if the humans make bad predictions we get bad predictions in the inverse reinforcement learning case we need to look at a human who's saying like yeah we need to look at a human who's saying these things about what they want over the long term or what they think will happen over the long term we need to decide which of them are errors and that work gets done like there's no data that really pulls that apart cleanly so it comes down to either facts about the prior or like modeling assumptions and so then the work comes down to how much we trust those modeling assumptions in what domains and i think my basic current take is like the game seems pretty rough or like we don't have a great menu of them available right now i would summarize the best thing we can do right now as like basically in this prediction setting amounting to train ai systems to make predictions about all the like things you can easily measure train ai systems to make judgments in light of ai systems predictions about what they could easily measure or maybe judgments in hindsight and then predict those judgments in hindsight or judgments with the like other maybe the prototypical example of this is train an ai system to like predict a video of the future then if humans look at the video of the future and decide which outcome they like most i think the most basic the reason to be scared of like the most developed form of this the reason i'm scared of the most developed form of this is like we are in the situation now where ai really wants to push on this like a video of the future that's going to get shown to the human and distinguishing between like the video of the future that gets shown to human and like what's actually happening in the world seems very hard i guess that's sort of in some sense the part of the problem i most often think about right so either looking forward to a future where it's very hard for human to make heads or tales of what's happening or a future where a human believes they can make heads or tails heads and tails of what's happening but um they're mistaken about that for example again we might think a thing we want our eyes to help us do is like keep the world sane and make everything make sense in the world so like we would prefer if our ai shows us several videos the future and nine of them are incomprehensible and one of them makes perfect sense we're like great give me the future that makes perfect sense and the concern is just like do we get there by having an ai which is instead of making the world make sense is messing with our ability to understand what's happening in the world so we just like see the kind of thing we wanted to see or expected to see and that's kind of what i expect the extent that like we're in an outer alignment failure scenario that's kind of what i expect failures to ultimately look like okay yeah so so in the realm of things roughly like outer alignment or you know sort of alignment you know dealing with low stakes repeatable problems or something um what what kind of solutions are you most interested in from a research perspective i don't have a very short answer to this question so i guess you'll get a kind of long answer to this question that in itself is interesting yeah and maybe there's also two kinds of answers i can give one is like the thing that i'm most animated by that like i am working on myself another is like a broader like here kind of the things people do in the world that i'm particularly excited by amongst existing research directions maybe my default will be to like go through some of the things people do in the world that i'm excited by and then turn to the thing i'm most animated by but i'd be happy to do the other order if that seems better uh let's try the first order i guess one thing that seems like it comes up constantly as a useful building block or like an essential ingredient in many possible plans and seems tractable to work on but seems really hard is interpretability so we're very frequently in a situation where we've trained some very large neural network we know that it's able to make good predictions in some domain um and we're not really able to understand like what it knows about that domain um sometimes we're able to like play some clever game and say something about why it's making the prediction it's making or what kind of thing it knows about or doesn't know about but for the most part our methods there are very similar to just like doing some kind of behavioral analysis where we're like oh if you change this part of the input it gets it wrong so apparently that's what it's paying attention to i think there's some hope for techniques that are like more mechanically looking at what computation is performed by the model and then somehow understanding something about what it has learned so that you can better understand whether like predictions it's making are reasonable etc so i guess that's just something i'm quite interested in um to the extent that we're able to make headway on it okay and how does that help in these like outer alignment type settings yeah so i think the biggest thing is that like imagine your model again which is predicting videos from the future and you'd like to distinguish the case where actually everything in the future is great versus the case where actually the future is terrible but like there's like a nice little village set up in front of the camera we're concerned about models which like sort of are deliberately obfuscating what's happening on camera that is ais which are deliberately planning to put up the nice little village they're building the houses they're ensuring the camera doesn't go out of the village whatever yeah this is a very crude metaphor but the yeah which is deliberately doing that which is like choosing actions from this tiny space of actions to engineer this very specific outcome in some sense like somewhere deep in its heart it understands like a lot of what's happening in the world it understands like that if the camera turned just this way it would see something objectionable so don't let it do that um and so it feels like if you have in some sense it doesn't even feel like that much to ask your interpretability tools to be able to reach inside and be like oh okay now if we look at what it's thinking we see that clearly there's this disconnect between what's happening in the world and what's reported to the human i don't think there are that many credible approaches for that kind of problem other than some kind of headway on interpretability so yeah i guess that's my sort of story about how it helps i think there's several there's many possible stories about how it helps um that's the one i'm personally most interested in all right so that's one approach that you like i mean i think in terms of what research people might do i'm just generally very interested in taking a task that is challenging for humans in some way and to trying to train ai systems to do that task and seeing what works well seeing how we can help humans push beyond their naive like ability to evaluate or like their native ability to evaluate proposals from an ai and tasks can be hard for humans in lots of ways like you can imagine having like lay humans evaluating sort of expert human answers to questions and saying how can we like build an ai that helps expose like this kind of expertise to a lay human like the interesting thing is the case where you don't have any trusted humans who have that expertise where like we as a species are looking at our ai systems and they have expertise that no humans have right and we can try and sort of study that today by saying like imagine a case where the humans who are training the eye system lacks some expertise that other humans have that gives us like a nice little warm-up environment in some sense um like we can say like you can have the experts come in and say how well did you do how well like you have gold gold standard answers unlike in the final case there's other ways tasks can be hard for humans you can also consider tasks that are like computationally demanding or involve like lots of input data um tasks that are sort of where human abilities are artificially restricted in some way like you could imagine like people who can't see or training an image net model to like tell them about scenes and natural language okay and like again the model is like there are no humans who can see that like you could ask like can we study this in some domain or the analogy would be that there's no humans who can see anyway so i think a whole class of problems there and then there's like a broader distribution or what techniques you would use for attacking those problems i am very interested in techniques where ai systems are helping humans do the evaluation so kind of imagine this like gradual inductive process where like as your ai gets better they help the humans answer harder and harder questions which provides training data to allow the airs to get ever better i'm pretty interested in those kinds of approaches which like yeah they're a bunch of different versions or a bunch of different things along those lines there was a second category so interpretability we had this like using ai's to help train ais yeah um there was also what you were working on i don't know if yeah the last category i'd give was just like i think even again in this sort of more like imitation learning regime or in the regime where humans can tell what is good like doing things effectively like learning from small amounts of data learning policies that are like just higher quality that also seems valuable i am more optimistic about that problem getting easier as ai systems improve which is the main reason i'd be like less i'm less scared of our failure to solve that problem than failure to solve the other two problems and then maybe the fourth category is just like i do think there's a lot of room for sitting around and thinking about things i mean i'll describe what i'm working on which is a particular flavor of sitting around and thinking about things sure um but there's lots of flavors of sitting around and thinking about like how would we address alignment that i'm pretty interested in all right on to the stuff that i'm thinking about what's good so i'd say at a really high level and my attempts to summarize my current high level hope slash plans less whatever we're concerned about the case where we learn sgd or stochastic gradient descent finds some ai system that sort of embodies useful knowledge about the world or about how to think or useful heuristics for thinking or whatever and also uses it in order to achieve some end like it has beliefs and then it selects the action that it expects will lead to a certain kind of consequence at a really high level what we'd like to do is we'd like to instead of learning like a package which potentially couples that knowledge about the world with some like intention that we don't like we like to just throw out the intention and learn just like the interesting knowledge about the world and then we can if we desire like point that in the direction of like actually helping humans get what they want at a high level the thing i'm spending my time on is like going through examples of the kinds of things that i think gradient descent might learn for which it's very hard to do that decoupling and then for each of them saying okay what is our best hope or like how could we modify gradient descent so that it could learn the like decoupled version of this thing and i think yeah we sort of organized around examples of like cases where that seems challenging and what the problems seem to be there like right now the particular instance that i'm thinking about most and have been for like the last three months six months um is the case where you learn either facts about the world or a model of the world which are defined not in terms of like human abstractions but some different set of abstractions okay that's a very simple example it's fairly unrealistic you might imagine humans thinking about the world in terms of like people and cats and dogs and you might imagine a model which instead thinks about the world in terms of like atoms bouncing around so the concerning case is when we have this kind of mismatch between the way your beliefs or your simulation or whatever of the world operates and the way that human preferences are defined such that it is then easy to take this model and use it to say plan for goals that are defined in terms of concepts that are natural to it but much harder to use it to plan in terms of concepts that are natural to humans so i can like have my model of atoms bouncing around and i can say great search over actions and find the action that results in like the fewest atoms in this room and it's like great and then i can just enumerate a bunch of actions and find the one that results in minimal abs if i'm like search for one where the humans are happy it's like i'm sorry i don't know i don't know what you mean about humans or happiness and like this is kind of a subtle case to talk about because actually that system can totally carry on a conversations about humans or happiness that is like at the end of the day there's these observations we can train our systems to make predictions of like what are the actual bits that are going to be output by this camera yeah and so it can predict like human faces like walking around and humans saying words it can predict humans talking about all the concepts they care about and they can predict pictures of cats and it can predict a human saying like yep that's a cat and the concern is more that basically you have your system which thinks natively in terms of like atoms bouncing around or some other abstractions and when you ask it to talk about cats or people instead of getting it talking about actual cats or people you get it talking about like when a human would say there is a cat or a person and then if you optimize like i would like a situation where all the humans are happy what you instead get is like a situation where they're happy humans on camera and so you end up back in the same kind of concern you could have had of like your ai system optimizing to mess with your ability to perceive the world rather than actually making the world good so um when you say this that you you would like this kind of decoupling so that's a case where it's hard to do the decoupling what's like a good example of like yeah here we like decoupled the motivation from the beliefs and now i can you know insert my favorite motivation and press go or whatever what what does that look like so i think a central example for me or maybe yeah an example i like would be a system which has some beliefs about the world like represented in a language you're familiar with they don't even have to be represented that way natively consider an ai system which learned a bunch of facts about the world learn some like procedure for deriving new facts from old facts and learned how to convert whatever it observed into some facts like it learned some maybe opaque model that just converts what i observed into facts about the world it then combines them with some of the facts that it's that are baked into it by gradient descent and then it turns the crank on these inference rules to derive a bunch of new facts and then at the end having derived a bunch of facts it just tries to find an action such that it's a fact that that action leads to the reward button being pushed or whatever this is like a way you could imagine and it's a very unrealistic way for an ai to work um just as basically every example we can describe in a small number of words is a very unrealistic way for a deep neural network to work once i have that model i could hope to instead of having a system which turns the crank derives a bunch of facts and then looks up a particular kind of fact and then takes use it to take an action instead it starts from the statements turns the crank and then just answers questions or basically directly reports like translates the statements in its internal language into natural language if i had that then instead of searching over the action that leads to the reward button being pressed i can search for a bunch of actions for each of them like look at the beliefs and outputs in order to assess how good the world is and then search for one where the world is good according to humans and so the key dynamic is like how do i expose like all this this turning the crank on facts how to expose the facts it produces to like humans in a form that's usable for humans and this brings us back to like amplification or debate like these two techniques that i've worked on in the past in this genre of like ai helping humans evaluate ai behavior yep right away we could hope to train an ad to do that we could hope to have almost exactly the same process of sgd that produced the original reward button maximizing system we'd hope to instead of training it to maximize the reward button we are going to train to give answers that humans like our answers that humans consider good and useful like accurate and useful and the way humans are going to supervise it is basically following along stepwise with the sort of deductions it's performing as it like turns this crank of deriving new facts from old facts so like it had some facts at the beginning uh maybe a human can directly supervise those we can talk about the case where the human doesn't know them which i think is handled in a probably similar way and then like as it performs more and more steps of deduction it's able to output more and more facts but if a human is able to see the facts that it had after like n minus one steps that it's much easier for human to evaluate some proposed fact at the end step so you could hope to have this kind of evaluation scheme where like the human is incentivizing the system to report like knowledge about the world or whatever and then however the system was able to originally derive the knowledge in order to take some action in the world the system can also derive that knowledge in the service of making statements that human regards is like useful and accurate so that's like kind of a typical example all right and and the idea is that like instead of like for whatever task we might want an ai system to achieve we just like train a system like this and then we're like how do i do the right thing and then it just tells us and ideally it doesn't require like really fast motors or appendages that humans don't have or we know how to build them or something and it just like gives us some instructions and then we do it and that's how we get whatever thing we wanted out of the ai yeah we want to take some care to make everything like really competitive so probably we want to use this to get a reward function that we use to train our ai rather than try and use it to like output instructions that a human executes i want to be careful about there's a lot of details there and like not ending up with something that's a lot slower than the underlying day i would have been okay but i think this is the kind of case where i'm sort of optimistic about being able to say like look we can decouple like the rules of inference that uses to derive new statements and like the statements it started out believing we can decouple that stuff from the like decision at the very end like take this particular statement had derived and use that as a basis for action so going back a few steps so you were talking about you know cases where you could and couldn't do the decoupling and you're worried about some cases where you couldn't do the decoupling and you're yeah i was wondering how that connects to your research like you're just thinking about those or do you have ideas for algorithms to deal with them or yeah so i i mentioned the central case we're thinking about is kind of this like a mismatch between a way that your ai and most naturalists said to be thinking about what's happening yeah like away the eye is thinking about what's happening and the way a human would think about what's happening the way i think that kind of seems to me right now like a very sensual difficulty i think maybe if i just describe it it sounds like well sometimes you get really lucky i can be thinking about things as just in a different language and that's the only difficulty i currently think that's a pretty central case or handling that case is quite important the algorithm we're thinking about most like the family of algorithms we're thinking about most for handling that case is basically defining an objective over some correspondence or some translation between how your ai thinks about things and how the human thinks about things the conventional way to define that maybe would be to just like have a bunch of labeled human labeling like there was a cat there was a dog whatever the concern with that is that you get this like instead of saying was there actually a cat it's translating like does a human think there's a cat so the main idea is to use objectives other than they're like not just a function of what it outputs like they're not the supervised objective of how well it's outputs match human outputs you have other properties like you can have regularization like how fast is that correspondence or um how simple is that correspondence i think that's still not good enough you could have like consistency checks like saying well it said a and it said b and we're not sure we're not able to label either a or b but we understand that like that combination of a and b is inconsistent this is still not good enough and so then most of the time has gone into ideas that are like basically taking those consistency conditions so saying we expect that like when there's a bark it's most likely there was a dog we think that like the model's output should also have that property then trying to look at like what is the actual fact about the model that led to that consistency condition being satisfied and having an objective that depends on this kind of like this gets us a little bit back into like mechanistic transparency hopes interpretability hopes where like the objective actually depends on like why that consistency condition was satisfied so you're not just saying like great you said that there's more likely to be a dog barking when there was a dog in the room we're saying like it is better if that relationship like if that's because of a single weight in your neural network or whatever that's like this very extreme case that's a very extremely simple explanation for why that correlation occurred and we could have a more general objective that cares about like the nature of the explanation that cares about why that correlation existed right where the idea is that we want the these like consistency checks we want them to be passed not because like we were just lucky with what situations we looked at but like actually the model is like some other structure is that it's reliably going to produce things that are right and we can tell because what the because we can figure out what things that are consistency checks passing or due to is that right that's the kind of thing yeah and i think it ends up being or it has been a long journey hopefully there's a long journey that will go somewhere good uh right now that is up in the air but like some of the early candidates would be things like this explanation could be very simple so instead of asking for the correspondence itself to be simple ask for like uh the reasons that these consistency checks are satisfied are very simple like it's more like one weight in internal net rather than like some really complicated correlation that came from the input you could also ask for like that correlation to depend as little as possible or like on as few facts as possible about the input or about the neural network okay um i think none of these quite work and getting to where we're actually at would be kind of a mess but that's the research program it's mostly sitting around thinking about objectives of this form having an inventory of cases that seem like really challenging cases for any for finding this correspondence um and trying to understand like adding new objectives into the library and then trying to like refine like here are all these candidates here all these hard cases how do we turn this into something that actually works in all the hard cases um it's very much sitting by a whiteboard it is a big change for my old life until like one year ago i basically just wrote code or i spent years mostly writing sure and now i just stare at whiteboards all right so changing gears a little bit um i think you're most perhaps well known for this kind of factored cognition approach to ai alignment that somehow involves decomposing a particular task into a bunch of subtasks um and then like training systems to like basically do the decomposition kind of um i was wondering if you could talk a little bit about how that fits into your view of like which problems exist and what you what your current thoughts are on this broad strategy yeah so i guess factored cognition refers to or like the factory cognition hypothesis was what a non-profit i worked with was calling this hope that arbitrarily complex tasks can be broken down into simpler pieces and so on ad infinitum potentially at a very large slowdown and this is relevant on a bunch of possible approaches to ai alignment it's if you imagine that you're trying to train if humans and ais systems are trying to train ais to do a sequence of increasingly complex tasks but you're only comfortable doing this training when the human and their ai assistance is at least as smart as the ai they're about to train then you kind of like if you just play training backwards you basically have this decomposition of the most challenging task great i was ever able to do into simpler and simpler pieces and so i'm mostly interested in like tasks which cannot be done by any amount like sort of any number of humans with however long they're willing to spend during training are very hard to do by any of these approaches it seems so this is like for a safety via debate where i hope as you have several ais arguing about what the right answer is uh it's true for like iterated distillation and amplification where you kind of have a human with these assistants training another a sequence of increasingly strong ais and it's true for like recursive reward modeling which is i guess an agenda that came from a paper out of deepmind i guess beyond like who took over for me at open ai where you're sort of trying to define a sequence of like reward functions for more and more complex tasks uh using assistance trained on the preceding reward functions anyway it seems like all of these approaches run into this kind of common there's an upper bound that's like i think it was an upper bound i think other people might dispute this but i would think it was a crude upper bound based on everything you ever train an ai to do in any of these ways can be broken down into smaller pieces until it's ultimately broken down into pieces that a human can do on their own and sometimes that can be not obvious i think it's worth pointing out that like search can be trivially broken down into simpler pieces like if a human can recognize a good answer then a large enough number of humans can do it just because you can have a ton of humans doing a bunch of things until you find a good answer i think my current take would be like i think it has always been the case that you can learn stuff that is not really that you can you can learn stuff about the world which you could not have derived by breaking down the question like what is the height of the eiffel tower into simpler and simpler questions the only way you're going to learn that is by going out and looking at the height of the eiffel tower um or maybe doing some crazy simulation of earth from the dawn of time and like ml in particular is going to learn a bunch of those things or gradient descent is going to bake a bunch of facts like that into your neural network so if this task if doing what the ml does is decomposable it would have to be through humans looking at all that training data somehow looking at all of the training data which the ml system ever saw while it was trained and like drawing their own conclusions from that i think that is in some sense very realistic and like a lot of humans can really do a lot of things but for all these approaches i listed when you're doing these task decompositions not only the case that you decompose the final task the eye does into simpler pieces you decompose it into simpler pieces all of which the ai is also able to perform um and so learning i think doesn't have that feature that is i think you can decompose learning in some sense into smaller pieces but they're not pieces that the final learned ai was able to perform right the learned ai is an ai which like knows facts about the eiffel tower it doesn't have facts about how to go look at like wikipedia articles and learn something about the eiffel tower necessarily so i guess now i think these approaches that rely on factor cognition i now most often think of sort of having both the humans decomposing tasks into smaller pieces but also having like a separate search that runs in parallel with gradient descent um so i guess i wrote a post on and beth wrote an explainer on imitative generalization a while ago the idea here is imagine instead of like you're decomposing tasks into tiny sub-pieces that a human can do we're going to learn like a big reference manual to hand to a human or something like that and like we're going to use gradient descent to find the reference manual the when you give it to like for any given reference manual you can imagine handing it to humans and saying hey human trust the outputs from this manual just believe it was written by someone benevolent wanting you to succeed at the task now using that do whatever you want in the world and now there's a bigger set of tasks that human can do after you've handed them this reference manual like it might say like the height of the eiffel towers whatever and the idea in imitative generalization is just instead of searching over a neural network this is very related to the spirit of like the coupling i was talking about before instead of searching over a neural network we're kind of going to search over like a reference manual that we want to give to a human and then instead of decomposing our final task into pieces that human can do unaided we're going to decompose our final task into pieces that a human can do using this reference manual so you might imagine then the like sort of stochastic grain to send bakes in a bunch of facts about the world into this reference manual these are sort of things the neural network just knows and then we give those to a human and we say like go do what you will like taking all of these facts is given and now the human can do some bigger set of tasks or answer a bunch of questions they otherwise wouldn't have been able to answer and then we can get an objective for this reference manual so if we're producing the reference manual by stochastic gradient assignment we need some objective to actually optimize and the proposal for the objective is give that reference manual to some humans ask them to do the task or to like the large team of humans to eventually break down the task of predicting the next word of a web page or whatever it is that your neural network was going to be trained to do look at how well the humans do at that predict the next word task and then instead of optimizing your neural network by stochastic gradient descent in order to make good predictions optimize that like whatever reference manual giving a human by gradient descent in order to cause it to make humans make good predictions so like that's i guess the factory cognition hypothesis has stated like that doesn't change the fact recognition hypothesis is stated because this search is also just something which can be very easily split across humans you're just saying like loop over all the reference manuals and for each one run the entire process but i think in flavor it's like pretty different in that you're not you don't have like your trained ai doing any one of those subtasks some of the subtasks are now being paralyzed like across the steps of gradient descent or whatever across the different models being considered in gradient descent and that is most often the kind of thing i'm thinking about now and that suggests this other question of like okay now we need to make sure that gradient descent over like if your reference manual is just text how big is that manual going to be compared to the size of your neural network and can you search over as easily as you can search over your neural network i think the answer in general is like you're completely screwed if that neural network that manual is in text so we mentioned earlier that it's not obvious that humans can't just do all the tasks you want to apply ai to you could imagine a world where we're just applying add to tasks where humans are able to evaluate the outputs yeah and in some sense everything we're talking about is just extending that range of tasks to which we can apply ai systems um and so breaking tasks down into subtasks that i can perform is sort of one way of extending the range of tasks now we're basically looking not at tasks that a single human can perform but that some large team of humans can perform and then adding this reference manual does further extend the set of tasks that a human can perform i think if you're clever it sort of extends it to like the set of tasks where like what the neural net learned can be cached out as this kind of like declarative knowledge that's in your reference manual but like maybe not that surprisingly that does not extend it all the way like text is limited compared to the kinds of knowledge you can represent in a neural network and so maybe this is like the current that's the kind of thing i'm thinking about now okay and what's what's the limitation of text versus what you could potentially represent so if you imagine you have your billion parameter neural network um i mean a simple example is just like if you imagine that neural network doing some simulation representing the simulation it wants to do like it's like oh yeah if there's an atom here there should be an atom there in the next time step that simulation is described by like these billion numbers and searching over a reference manual big enough to contain a billion numbers is a lot harder than searching over a neural network like a billion weights of a neural network um and more brutally like a human who has that simulation in some sense like doesn't really know enough to actually do stuff with it they can tell you where the atoms are but they can't tell you where the humans are that's one example another just like suppose there's some correlation between or there's like some complicated set of correlations or like you might think of things that are more like skills will tend to have this feature more like if like i'm an image classification model i know like that particular kind of curve is really often associated with like something being part of a book i can describe that in words but like it gets blown up a lot in the translation process towards and it becomes harder to search over so the things we've talked about have mostly been your thoughts about like sort of objectives to give ai systems um and so more in this like outer alignment style stage um i'm wondering for inner alignment style problems where it's like the ai system has some objective and you want to make sure it like you know is really devoted to pursuing that objective even if like the situation changes or even in the worst case or yeah um i'm wondering like if you have thoughts on solutions you're particularly keen on in those settings yes i think i have two categories of response one is like technical research we can do that helps with this kind of inner alignment slash catastrophic failures slash out of distribution that cluster of problems across the board or in many possible worlds and another is like in my kind of preferred or like assuming my research project was successful like how would this how would this be handled on that um and maybe again i'll start with like what people are doing that seems helpful yeah so i think the most basic thing i'm excited about is just generating hard cases and throwing hard cases at your ai so if you imagine you have some ai system you're like i'm concerned that on some very rare inputs the ai system may decide that it's time to kill humans i'm like well what i'm going to do is i'm going to try and generate the kinds of inputs that might convince my ad to kill humans and like hopefully if i try a really long time and i can't find any then it's less likely we're going to encounter any out there in the real world yeah and the theory is that it's in some sort of like safe box so that if you succeed at your search there does not in fact kill all the humans yeah so there's this distinction where like when training in a i may often need to see the consequences of its actions whereas for this kind of training i wouldn't want to be running the things my eye proposes and checking the consequences hopefully um so there's more subtleties there definitely but yeah we're not actually doing the things the eye suggests we just like observe it takes an action this is sort of assuming we can so it takes an action without it quote unquote takes an action yeah but it doesn't like actually yeah it outputs an action and then there's no actuators to which the action can't get sent okay um like we're imagining some ai which we might ultimately hook up to the internet but right now we look at the proposed internet actions and we're like do those seem good and the hope is we've solved sort of what i am filing more under the outer alignment side of being able to assess forgiven proposed action would that action be catastrophically bad we can make predictions about that like as well as the ai so in particular if the ai has calculated that this action will lead to like a robot uprising killing all humans then we can like look at the action and be like actually this is not a good action and then ding or ai so if you have that ability then you can try in your little sandbox to throw tons of actions or tons of situations of your ai and see if it ever proposes any actions that look like they would be really bad i think that feels like the basic most basic ingredient that does not seem sufficient i guess the basic reason it doesn't seem sufficient is that even if bad cases exist it can be really hard to find them so you kind of want to from there make the job so imagine this adversary is trying to break your model and trying to get into something bad you kind of want to make their job as easy as you can possibly make it so an example would be like instead of having them actually like provide inputs to your system maybe you let them go in and like monkey around with some of the neurons in the system and be like okay i can't actually find an input that causes this neuron to fire but if i could yeah i would do this bad thing and that becomes quite subtle because as i'm monkeying around with the neurons eventually it becomes unclear like right when i had an input i could see like okay my ai sees the following webpage and then it responds by sending the following packet and i can tell that like that was a really bad thing that would be a really bad thing for it to do um like it replied to the help wanted ad for like whatever someone running their robot rebellion it's harder if i'm like monkeying with its neurons and i see that it like outputs a given packet i'm like i don't necessarily know what state of the world corresponded to those neurons being in that configuration it's like there certainly are states of the world where it's like good to send for almost any action there's like a state of the world in which it's good to do that action and in particular like bad actions may not look that bad they're just bad because of like details of the context and so if i got the i'd take some action by monkeying with its neurons it's a little bit unclear how i say like was that action actually bad or not um that's the kind of you want to sort of give your adversary that kind of ability in order to help them like make it easier for them to show like well here's a case where i could do something about i can't actually find the input but like you should be scared because it seems like this could happen um then you also want to give them like interpretability tools i want them to be able to understand what the ai is thinking and to say like oh it looks like the as like checking if any humans are watching and doing something nice if they are so let's just like really optimize on that to find a case where it thinks there's no humans watching and see if it's still nice like that's sort of a cluster of techniques all of which are designed to like help you really hammer at like are there cases where this ai might do something bad can we find them if we can find them then we can train on them before we actually take the action we can provide a gradient saying like don't do something that's really bad in that hypothetical case and those kind of blur pretty continuously into like verification stuff in my mind that is like verification is in some sense some kind of limit of like being willing to monkey with the neurons and then having some some formal specification for how much the adversary is allowed to monkey with the neurons yeah that's a category of like i think all of those are sort of research directions that people pursue for a variety of motivations out there in the world um and i'm pretty excited about a lot of that work and uh on your favorite approaches how how does this pan out so i mentioned before this kind of hoped for decoupling where i'd say we're concerned about the case where we learn some ai that is gradient descent finds a neural network which is trying to figure out how to mess with the humans um and then when an opportunity comes along it's going to mess with the humans and like in some sense the nicest thing to do is to say okay the reason we wanted that ai was just because it encodes some knowledge about like how to do useful stuff in the world and so what we'd like to do is to say okay we are going to set things up so that it's easier for gradient descent to learn just the knowledge about how to behave well in the world rather than to like learn that knowledge embedded within an asian that's trying to screw over humans and that is hard or it seems quite hard but i think that is i guess sort of the biggest challenge in my mind is the coupling of outer inner alignment is that this seems like almost necessary either for a full solution to outer alignment or a full solution to inner alignment so i kind of expect to be more in the trying to kill two birds with one stone regime and these are like the kinds of examples of decoupling we described before like we hope we train some ai which is right you hope that you only have to use gradient descent to find this reference manual and then from there you can like much more easily pin down what all the other behavior should be and then you hope that reference manual is like smaller than the like scheming ai which has all of the like knowledge in that reference manual baked into its brain yeah it's very unclear if that can be done i think it's also fairly likely that in the end that they just don't know how that looks and it's physically in the end that has to be coupled with some more normal more normal measures like verification or adversarial training all right so i'd like to now talk a little bit about your research style so you mentioned that um as recently the way you do research is you sit in a room and you think about some stuff is there any chance you can give us more detail on that so i think the basic organizing framework is something like we have some current set of algorithms and techniques that we'd use for alignment step one is try and dream up some situation in which your ai would try and kill everyone despite your best efforts using all the existing techniques so like a situation describing like we're worried that here's the kind of thing gradient descent might most easily learn and here's the way the world is such that the ingredient descent learned tries to kill everyone and here's why you couldn't have gotten away with learning something else instead like we tell some story that culminates in doom which is hard to avoid using existing techniques that's like step one step two is like maybe there's some step 1.5 is like trying to strip that story down to like the simplest moving parts that feel like like the simplest sufficient conditions for doom then step two is trying to design some algorithm like just thinking about only that case i mean like in that case what do we want to happen like what would we like gradient descent to learn instead or how would we like to use the learned model instead or whatever what is our algorithm that addresses that case and there's been a lot like you know the last three months have just been working on a very particular case where i currently think existing techniques would lead to doom sort of along the kinds of lines we've been talking about like grabbing the camera or whatever and trying to come up with some algorithm that works well in that case and then i guess step three like if you succeed then uh you get to move on to step three where you like look again over all of your cases you look over all your algorithms you probably try and say something about like can we unify like we know we want to happen in all of these particular cases can we design one algorithm that does the right thing in all the cases for me that step it's like mostly a formality at this stage or like it's not very important at the stage mostly just go back to step one once you have your new algorithm then you go back to like okay what's the new case that we don't handle and like normally i'm just like pretty lacks about the plausibility of the doom stories that i'm thinking about at this stage because i have some optimism that like in the end we'll have an algorithm that results in your ai just never deliberately trying to kill you and it's actually hopefully will end up being very hard to tell a story about how your ai ends up trying to kill you and so while i have this hope i'm kind of just willing to say like oh here's a wild case so like a very unrealistic thing that gradient descent might learn but that's still like enough of a challenge that i want to change or like design an algorithm that addresses that case because i hope it's like working with really simple cases like that helps guide us towards like if there's any nice simple algorithm that never tries to kill you thinking about the simplest cases you can is just like a nice easy way to make progress towards that yeah so i guess most of the action then is in like what we actually do in steps one and two yeah but at high level that's like what that's what i'm doing all the time and is there anything like you can broadly say about what happens in steps one or two or do you think that like depends a lot on uh the day or the research yeah i guess in step one the biggest thing is like what i think the main question people have is like what is the story like like what is the type signature of that object or like what is it written out in words and i think like most often i'm like writing down some simple pseudo code and then like here is the code that your ai could look like here's like the code you could imagine your neural network executing um and then i'm telling some simple story about the world or i'm like oh actually live in a world which is governed by the following laws of physics and like the following like actors or whatever and like in that world this program is actually like pretty good and then i'm like here is some assumption about how sgd works that's consistent with everything we know right now very often like well we think sgd will find like could find any program that's like the simplest program that achieves a given loss or something so like the story yeah has really has like the sketch of like some code and like often that code will have some question marks or i'm like looks like you could fill those in to make the story work some description of the environment some description of like facts about gradient descent and then we bouncing back and forth between that and like working on the algorithm working on the algorithm i guess is more like uh right at the end of the day most of the algorithms take the form of here's an objective try minimizing this with gradient descent so basically the algorithm is like here's an objective and then you're like look at your story and you're like okay on the story is it plausible that minimizing this objective leads to this thing or like often part of the algorithm is like and here is the good thing we hope that you would learn instead of that bad thing in your original story you have your ai that like loops over actions until it finds one that it predicts leads to smiling human faces on camera yeah like that's bad because in this world we've created like the easiest way to get smelly human faces on camera involves killing everyone and putting smiles in front of the camera and they're like what we want to happen instead it's like this other algorithm i mentioned where like outputs everything it knows about the world and we hope that includes like the fact that the humans are dead so then a proposal will involve like some way of operationalizing like what that means like what it means for double what it knows about the world for this particular bad algorithm that's like doing a simulation or whatever that we imagined and then what objective you would optimize with gradient descent that would give you like this good program that you wanted instead of the bad one you didn't want okay the next kind of question i'd like to ask is like what do you see as like the most important big picture disagreements you have with people who already believe that like ai you know advanced ai technology might pose some kind of existential risk and like we should like we should really worry about that and try to work to prevent that broadly i think there are two categories of disagreements for like i'm kind of flanked on two different sides okay one is by like the sort of more machine intelligence research institute like crowd which has like a very pessimistic view about the feasibility of alignment and what it's going to take to build their systems that aren't trying to kill you and then on the other hand by like researchers who tend to be more like ml labs who tend to be more in the camp of like it would be really surprising if ai trained with like this technique actually was trying to kill you and there's nuances to both of those disagreements maybe you could split the second one into like one category that's more like actually this problem isn't that hard and we need to be good at like the basics in order to survive like the greatest risk because they mess up the basics and a second camp being like actually we have no idea what's going to be hard about this problem and what it's mostly about is getting set up to like collect really good data as soon as possible so that we can adapt what's actually happening it's also worth saying that it's unclear often which of these are empirical disagreements versus like methodological differences where like i have my thing i'm doing and i think that like there's room for lots of people doing different things so there are some empirical disagreements but like not all the differences in what we do are explained by those differences versus some of them being like paul is a theorist who's gonna do some theory and he's gonna have like some methodology essentially works on theory and like i do i'm excited about theory but like it's not always the case that when i'm doing something theoretical it's because i think the theoretical thing is like dominant okay and then going into like in those disagreements i guess like with the merry folk that's maybe more weedsy doesn't have a super short description we can return to it in bit if we want on the like people who are more optimistic side i think for people who think existing techniques are more likely to be okay i think the most common disagreement is about how crazy the tasks our ais will be doing are or like how alien will the reasoning of ai systems be people who are more optimistic tend to be like yeah systems will be operating at high speed and doing things that are maybe hard for humans or a little bit beyond the range of human abilities but broadly like humans will be able to understand the consequences of the actions they propose fairly well they'll be able to fairly safely like look at and actually be like can we run this action they'll be able to get those ai systems help like mostly leverage those a systems effectively even if the air systems are just trying to do things that look good to humans so often this is a disagreement about like i'm imagining ai systems that reason in super alien ways and someone else is like probably like it will mostly be thinking through consequences or like thinking in ways that are kind of legible to humans and thinking fast in ways that are legible to humans gets to a lot of stuff i guess in some sense that view is like i am very long on the like thinking fast in ways legible to humans is very powerful like i'm i definitely believe that a lot more than most people but i do think i often like now especially because i'm like working on the more theoretical end i'm often like thinking about all the cases where that doesn't work some people are more optimistic about like the cases where that works are enough which is either about like an empirical claim about how i will be or sometimes like a social claim about how important it is to be competitive like i think i really want to be able to build line day systems that are economically competitive with unaligned ai and i'm really scared of a world where there's a significant tension there whereas other people are more like it's okay it's okay if the line die systems are like a little bit slower or a little bit dumber like people are not going to want to destroy the world and so they'll be willing to hold off a little bit on deploying some of these things um and then on the empirical side like people who think that like theoretical work is less valuable should be mostly focused on the empirics or just doing other stuff i would guess like one common disagreement is just that i'm reasonably optimistic about being able to find something compelling on paper so i think like this methodology i described of like try and find an algorithm for which it's hard to tell a story about how your ai ends up killing everyone i actually expect that methodology to terminate with being like yep here was an algorithm it looks pretty good to us we can't tell a story about how it's uncompetitive or lethal whereas i think other people are like that is extremely unlikely to be where that goes that's just gonna be years of you like going around in circles until eventually you give up that's actually common disagreement on both sides that's probably also the chorus we're giving with mary folks in some sense yeah uh so you said uh it was perhaps hard to concisely summarize your differences between this sort of group of people centered perhaps at the machine intelligence research institute or miri for short yeah could you try so definitely the upshot is yeah i am optimistic about being able to find an algorithm which can align deep learning like a system which is closely analogous to and competitive with standard deep learning whereas they are very pessimistic about the prospects for aligning anything that looks like contemporary deep learning um that's kind of the upshot uh so they're more in the mindset of like let's find any task you can do with anything kind of like deep learning and then be willing to take great pains and huge expense to do just that one task and then hopefully like find a way to make the world okay after that or maybe like build systems that are later build systems that are very unlike modern deep learning whereas i'm pretty optimistic we're pretty optimistic means like i think there's a 50 50 chance or something that we can have a nice algorithm that actually lets you basically do something like deep learning without it killing everyone that's the upshot and then the reason for it i think those are pretty weedsy i guess intuitively it's something like if you view the central claim as the central objective is about like decoupling and trying to learn like what your unaligned agent would have known i think that like there are a bunch of possible reasons that that decoupling could be really hard like the fundamentally the like cognitive abilities and the like intentions could come as a package yeah um this is also really core to mary's disagreement with more conventional ml researchers who are like why would you build an agent why not just build a thing that like helps you understand the world i think on the mere review it's like there's likely to be this like a really deep coupling between those things and it's very unlikely that like i'm mostly working on other ways that decoupling can be hard besides like this kind of core one mirror has in mind i think miriam is really into like there's some kind of core of being a fast smart agent in the world and like that core will come with like is really tied up with what you're using it for it's not coherent to really talk about like being smart without developing that intelligence in the service of a goal or to talk about like factoring out the thing which you use to yeah there's some like kind of complicated philosophical beliefs about like the nature of intelligence um which i think especially eliezer is just like fairly confident in or he thinks it's like mostly pretty settled yeah so i'd say that's probably the core disagreement i think there's a secondary disagreement about just like how realistic is it to implement complex projects like i think their take is suppose paul comes up with a good algorithm even in that long shot there's no way that's going to get implemented rather than just like something easier that destroys the world and that like projects fail the first time and this is the case we have to get things right the first time well that's a point of contention such that there's not you're not going to have much of a chance that's a secondary disagreement and sort of related to that i'm wondering like what what do you think your most important uncertainties are uncertainty is such that if you resolve them that would in a big way change either what you were you know motivated to do to reduce it yeah yeah let's let's say it would change what you were motivated to do in order to produce essential risks from ai yeah so maybe top four one would be is there some nice algorithm on paper that definitely doesn't result in your ai killing you okay um and is definitely competitive or is this the kind of thing where like that's a pipe dream and you just need to have an algorithm that works in the real world yeah i think that would just sort of have a kind of obvious impact on what i'm doing um i think i'm reasonably optimistic about learning a lot about that over the coming years like i think i've kind of been thinking recently that maybe by like end of 2022 if this isn't going anywhere i'll pretty much know and can wind down the theory stuff and hopefully significantly before then we'll have like big wins that make me feel more optimistic that's one uncertainty just like is this going to work i'm doing a second big uncertainty is like is it the case that like existing best practices in alignment would suffice to align powerful ai systems or like would buy us enough time for ai to take over the alignment problem from us like i think eventually the ai will be doing alignment rather than us and it's just a question of how late in the game does that happen and how far do existing alignment techniques carry us i think it's fairly plausible that existing best practices if implemented well by like a sufficiently competent team that cared enough about alignment would be sufficient to get a good outcome and i think in that case it becomes much more likely that instead of working on algorithms i should be working on actually like bringing practice up to like the limits of what is known maybe i'll just do three not four okay and then three like maybe this is a little bit more silly but like i feel like legitimate moral uncertainty over like what kinds of ai maybe the broader thing is just like how important is alignment relative to other risks i think one reason one big consideration for the value of alignment is just like how good is it if the ai systems take over the world from the humans where my like default inclination is like that doesn't sound that good um but like it sounds a lot better than nothing in expectation like in a barren universe it would matter a lot like if you convinced me that number was higher at some point i would start working on like other risks associated with the transition to ai that seems like the least likely of these uncertainties to actually get resolved i find it kind of unlikely i'm going to move that much from where i am now which is like maybe it's half as good for a i just take over the world from humans it's for humans to choose what happens in space and that's like close enough to zero that like i definitely want to work on alignment and close enough to one that i also definitely don't want to go extinct so my penalty question is or it might be anti-penultimate depending on your answer is is there anything that i have not yet asked but you think that i should have that seems possible that it should have as i got been like plugging all kinds of alignment research that's happening at all sorts of great organizations around the world i haven't really done any of that i'm really bad at that though all right i'm just gonna forget someone and then feel tremendous guilt in my heart yeah how about in order to keep this short and to limit your guilt uh what are the top like five people or organizations that you like to plug oh man that's just gonna increase my guilt because now i have to choose five all right anyway some people i think like a lot of perhaps name five any five any five i think there's like a lot of ml labs that are doing good work like ml labs who view their goal as getting to powerful transformative ai systems okay they're doing work on alignment so it's like d mind open ai uh anthropic i think all of them are like gradually converging to some kind of like set of like this gradual crystallization and what we all want to do that's one maybe i'll do three things second can be like academics there's a bunch of people like and i'm friends with jacob at berkeley um i think he's jacob jacob steinhardt okay and his students are working on like robustness issues with an eye towards long-term risks a ton of researchers at your research organization um which i guess we've probably talked about on other episodes i've talked to some of them i don't think we've talked about it as a whole yeah it's the center for human compatible ai people are interested i guess they can go to humancompatible.ai to see a list of people associated with us and then you can for each person i guess you can look at all the work they did we might have a newsletter or something i did not prepare for this sorry for putting you on the spot with pitching yeah i think i'm not going to do justice to the academics there's a bunch of academics often just like random individuals here and there with groups um doing a lot of interesting work and then there's kind of the weird ea non-profits weird effective altruist and like conventional alignment crowd nonprofits probably the most salient to me there are redwood research it's very salient to yeah me right now because i've been talking with them a bunch over the last few weeks in order okay they're working on robustness broadly so like this adversarial training stuff how do you make your models definitely not do bad stuff on any input ott which is a non-profit that has been working on like how do you actually turn like large language models into tools that are useful for humans and the machine intelligence research institute uh which is the most paranoid of all organizations about ai alignment core value added probably yeah so maybe those are there's a lot of people doing a lot of good work i didn't plug them at all throughout the podcast but i love them anyway all right so speaking of things uh if people listen to this podcast and they're now interested in following you and your work uh what should they do i write blog posts sometimes at ai-alignment.com i sometimes publish to the alignment forum and depending on how much you read it may be your best bet to wait until spectacular exciting results emerge which will probably appear one of those places and also in print but we've been pretty quiet over the last six months definitely and i expect to be expected to be pretty quiet for a while and then to have a big write-up of what we're basically doing and what our plan is sometime i guess i don't know when this podcast is appearing but sometime in like early 2022 or something like that i also don't know when it's appearing we did date ourselves to infrastructure week one of the highly specific times okay well uh thanks for being on the show thanks for having me this episode is edited by finn adamson and justice mills helped with transcription the financial costs of making this episode are covered by a grant from the long-term future fund to read a transcript of this episode or to learn how to support the podcast you can visit accerp.net that's axrp.net finally if you have any feedback about this podcast you can email me at feedback excerpt.net
Related conversations
AXRP
7 Aug 2025
Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -5 · 133 segs
AXRP
1 Dec 2024
Evan Hubinger on Model Organisms of Misalignment

This conversation examines technical alignment through Evan Hubinger on Model Organisms of Misalignment, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med -6 · avg -7 · 120 segs
AXRP
11 Apr 2024
AI Control with Buck Shlegeris and Ryan Greenblatt

This conversation examines technical alignment through AI Control with Buck Shlegeris and Ryan Greenblatt, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med -6 · avg -9 · 174 segs
Future of Life Institute Podcast
7 Jan 2026
How to Avoid Two AI Catastrophes: Domination and Chaos (with Nora Ammann)

This conversation examines core safety through How to Avoid Two AI Catastrophes: Domination and Chaos (with Nora Ammann), surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -3 · 85 segs
Counterbalance on this topic
Ranked with the mirror rule in the methodology: picks sit closer to the opposite side of your score on the same axis (lens alignment preferred). Each card plots you and the pick together.
Mirror pick 1
TED Talks
18 Dec 2023