Library / In focus
AXRPCivilisational risk and strategy
First Principles of AGI Safety with Richard Ngo

Why this matters
Auto-discovered candidate. Editorial positioning to be finalized.
Summary
Auto-discovered from AXRP. Editorial summary pending review.
Perspective map
MixedGovernanceMedium confidenceTranscript-informed
The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.
An explanation of the Perspective Map framework can be found here.
Episode arc by segment
Early → late · height = spectrum position · colour = band
Risk-forwardMixedOpportunity-forward
Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).
StartEnd
Across 79 full-transcript segments: median -6 · mean -8 · spread -30–8 (p10–p90 -21–0) · 19% risk-forward, 81% mixed, 0% opportunity-forward slices.
Slice bands
79 slices · p10–p90 -21–0
Mixed leaning, primarily in the Governance lens. Evidence mode: interview. Confidence: medium.
- Emphasizes safety
- Emphasizes ai safety
- Full transcript scored in 79 sequential slices (median slice -6).
Editor note
Auto-ingested from daily feed check. Review for editorial curation.
ai-safetyaxrp
Play on sAIfe Hands
Episode transcript
YouTube captions (auto or uploaded) · video DxwXLCQY1ns · stored Apr 2, 2026 · 2,763 caption segments
Captions are an imperfect primary: they can mis-hear names and technical terms. Use them alongside the audio and publisher materials when verifying claims.
No editorial assessment file yet. Add content/resources/transcript-assessments/first-principles-of-agi-safety-with-richard-ngo.json when you have a listen-based summary.
Show full transcript
[Music] hello everybody today i'll be speaking with richard mull richard is a researcher at openai where he works on a governance and forecasting he also was a research engineer at deepmind and designed a course ai safety fundamentals we'll be discussing his report agi safety from first principles as well as his debate with eliezer yukowski about the difficulty of ai alignment for links to what we're discussing you can check the description of this episode and you can read the transcripts at axrp.net well richard welcome to the show thanks so much for having me you wrote this report agi safety from first principles what is agi what do you mean by that term yeah so fundamentally it's ai that can perform as well as humans or comparably to humans on a wide range of tasks because it's got this thing called general intelligence that humans also have and i think this is not a very precise definition i think because we just don't understand what general intelligence really is like how to you know specifically pin down this concept but it does seem like there's something special about humans that allows us to do all the things we do and that's the thing that we expect agi to have and so i shouldn't be imagining that there's like some really technical definition of general or intelligence that you're using here no i think it's much more like a pointer to a let's say pre-paradigmatic concept that we you know we kind of know the shape of it but we don't understand in mechanistic detail how general intelligence works it's got these components like you know of course you need memory of course you need some kind of planning but like exactly how to characterize the ways in which you know which combinations of these count as general intelligence or something we're not after so when when you're talking about agi often there's this idea of agi that's smarter than people um which implies that intelligence is some kind of scalar concept where you can have like more or less of it and like it's really significant whether you have more or less of it do you think that's basically a correct way to think about intelligence or should we just not go with that yeah i think that seems like a good first approximation uh you know there's this great uh parody paper i think it's called on the impossibility of super sized machines which says you know size is not a clear concept like you know there are different dimensions of size like height and weight and width and so on um so you know there's no uh single criterion of when you count something as bigger than something else but nevertheless it does make sense to talk about machines that are bigger than humans and so in roughly the same way i think it makes sense to talk about machines that are more intelligent than humans and this is kind of i'd characterize it as something which draws from a sort of long intellectual history of realizing that humans were like a little bit less special than we used to think that you know we're not in the center of the universe we're not uh you know that different from other animals and now uh that we aren't at the pinnacle of intelligence okay um why should we think of intelligence as this kind of scalar thing one way you can do it is you can uh hope that we'll come up with a sort of more precise theory of intelligence you know in the same way i like the some metaphors from the past so when we think about the concept of energy or when we think about the concept of information or even the concept of computing these were all things that were like you know very important abstractions for making sense of the past like you can think about the industrial revolution as you know this massive increase in our ability to harness energy even though at the time the industrial revolution started we didn't really have a precise concept of energy so one one hope is to say well look uh we're going to get a more precise concept as time goes by just as we did for energy and computation and information and uh you know whatever this precise concept is going to be uh will end up thinking that you can have more or less of it um probably the more robust intuition here is just like it sure seems that if you make things faster if you uh make brains bigger um they just get better at doing a bunch of stuff and uh you know it seems like that broad direction is the dimension that i want to talk about okay um so that's intelligence i guess we're also talking about general intelligence um which i think you can trust with narrow intelligence where if i recall correctly narrow intelligence is something that can kind of do one specific task but a general intelligence is something that can you know do a wide variety of tasks is that approximately right yeah that seems right um i think there are a couple of different uh angles on uh general intelligence so uh one one angle that is often taken is just thinking about uh you know like being able to deal with a wide range of environments and that's kind of uh an easy thing to focus on because it's very uh you can very clearly imagine having a wide range of different tasks take on generality you might have is that generality requires the ability to be very efficient in learning so only you know being able to pick up a new task with only a couple of different demonstrations and another one you might have is that generality requires um the ability to act coherently over long periods of time so you can you know you don't have to um you can carry out tasks on time frames not just of hours but on days or weeks or months or so on i think these all kind of like tie together because fundamentally when when you're doing something in a narrow domain you can memorize a bunch of heuristics so you can encode all the information within uh the system that you're using uh like the weights of a neural network whereas when you've got either like long time horizons or a wide range of environments or not very much data like what that means is you need something which leverages uh which isn't just memorized which is uh in some sense um able to get more out of less okay cool and i i guess we're going to take it as sort of given for this discussion that we should expect to get agi at some point um for people who are wondering who are perhaps doubtful about that um can you give people a sense of why you think that might be possible yeah so i think there are a couple of core intuitions which seem valuable so the first one is that um there are a lot of dis analogies between neural networks and human brains but there are also a lot of analogies and so to me it seems overconfident to think that scaling up neural networks with presumably a bunch of algorithmic improvements and architectural improvements and so on um but once you're using as much compute either as you know an individual human brain or maybe as all human brains combined or maybe as like you know the entire history of humanity which are all milestones that are like feasible over a period of decades um at that point it seems like you really need some sort of specific reason to think that neural networks are going to be much worse or much less efficient than human brains in order to not place a significant credence on those systems being general uh or with a capability of generality in the same way as humans another intuition here is that um you know a lot of people think about machine learning as a series of paradigm shifts so we started off with symbolic ai and then we moved into sort of more like statistical style machine learning and now we're in the area of deep learning i think this is a little misleading i think uh the way i prefer to think of it is that neural networks were just plugging along doing their thing the whole time like you know the perceptron and the first um models of neurons were around before you know people think of the field of ai as having officially started and so uh if you just like look at this trend of your networks getting more and more powerful as you throw more and more compute with them and as you make relatively few uh big algorithmic changes then it starts to seem more plausible that actually you don't need a big paradigm shift in order to reach agi you can just keep scaling things up because that actually has worked for the last you know 70 years or so okay so it seems like your case is really based on this idea of like we're going to have modern machine learning or deep learning and we're going to scale it up you know we're going to scale it up in some ways maybe we're going to make discrete changes in like algorithms or architectures and other ways in this process of scaling what similarities do you see as going to be retained from now until agi and what important differences do you think there are going to be yeah so i think the core similarity um you know neural networks seem like they're here to stay reinforcement learning seems like it's here to stay um self-supervised learning seems pretty fundamental as well so these all feel very solid um then i think you know the the fundamental question is something like how much structure do you need to build in to your algorithms versus how much structure can you kind of like metal learn or learn via architecture search or things like that um and you know that feels like a very open question to me uh it seems like in in some cases we have these like very um useful algorithms like um monte carlo research and if you could if you can build that into your system then why wouldn't you do that uh and it seems really hard to you know learn some something like um mcds and mcts is just sort of randomly kind of looking at okay if i take various sequences of actions you know how well is that likely to be and i sort of randomly you know maybe i randomly pick actions or pick actions that i think are more likely to be optimal that kind of thing right exactly but then you might think uh look it's just hard to um so so yeah you've got these kind of different impulses on one hand we've got a few uh algorithms or a few like ways of designing networks that seem very clever and like maybe we can come up with more clever things on the other hand uh the sort of idea of the bitter lesson from rich saturn which is just that uh most clever innovations don't really last um and yeah so i i feel very uncertain about how much algorithmic progress you need in order to continue scaling up ai okay so a question i have sort of along this front current ai systems at least seem like a lot of what they're doing is recognizing patterns in a sort of shallow way and maybe um when you have a neural network and you want to get it into a new domain you need to expose it to the new patterns until it can like pick up the new patterns um and maybe it draws something on what the old patterns were like but it's it seems at least to a lot of people to be very pattern-based um and some people have this intuition that um something more like reasoning uh is going to be needed where within the neural network it's just like you know understanding its environment and doing kind of these things like mcts playouts you know inside its head and that once you have this kind of thing you won't need to be exposed to that many patterns until you just like understand the nature of what's going on so i'm wondering like to what degree does this difference between pattern recognition versus like real reasoning playing your thinking and do you think that real reasoning is necessary and slash or do you think that we'll get it in things roughly like current neural networks um i think that yeah uh what you call real reasoning is definitely necessary for agi um having said that i think pattern recognition is maybe more fundamental than a lot of people give it credit for where when humans are thinking about very high level concepts we do just use a bunch of pattern recognition like when you think about i guess uh great scientists thinking in terms of metaphors or like intuitions from a range of different fields um you know einstein imagining thought experiments and so on mathematicians uh you know visualizing numbers in terms of like objects and you know the interactions between those objects like these all feel like types of um pattern recognition i'm kind of like gesturing towards the idea of this book by uh lake off called metaphors to live by which is you know basically arguing that just a huge amount of human cognition happens via these types of drawing connections between different domains now of course you need uh some type of explicit reasoning on top of that but uh it doesn't seem like a fundamentally it doesn't seem like the type of thing which necessarily requires new architectures or you know us to build it in explicitly there's another book which is uh great on this topic called the enigma of reason which is um kind of like fits these ideas together like explicit reasoning and pattern matching and basically argues that you know explicit reasoning is just a certain type of pattern matching it's the ability to uh match uh patterns that have the right type of justification um uh i'm not doing it credit here so uh you should you should check that out if you're interested but uh basically i i think that these two things are pretty closely related and there's not like some fundamental difference okay so moving on a bit in the report you talk about different ways agis could be um one of them is sort of you know you have this like single neural network maybe that's an agi which i think a lot of people are more familiar with and then you you also talk about um collective agis that are maybe compo maybe it's an agi civilization where the civilization is intelligent or something like that could you describe those and tell us like how how important are those to think about if people don't normally think about them yeah so i guess the core intuition behind thinking about a single agi is just that our training procedures work really well when you train end to end you take a system you give it a feedback signal and you just adjust the whole system in response to that feedback signal and you know what you get out of it is basically one system where all the parts have been shaped to work together by gradient descent so that's a reason for thinking that you know uh in terms of efficiency having one system is going to be much more efficient than you know having many different systems trying to work together um i think the core intuition for thinking about a collection of agis is just that it's very cheap to copy um a model after you've learned it and so if you are thinking about if you're trying to think about the effects that a single like building a single agi will have it seems like the obvious next step is is to say well you know you're going to end up with a bunch of copies of that because it's very cheap and then now you're going to have to reason about the interactions between all of those different systems i think that it probably doesn't make a huge difference in terms of thinking about the alignment problem because you know i don't think you're going to get that much safety from having a large collection of agis compared to just having one agi it feels like if if the first one is aligned then the others are gonna then the collection of many copies of it is gonna be as well and if it's not then they won't but it does seem pretty relevant to thinking about the dynamics of how agi deployment might go in the world okay cool and i guess one question i have related to this like like that sort of points to this idea that um instead of thinking just about each agi or you know each neural network or whatever being intelligent we should we should also think of the intelligence of the whole group um and i think uh so i have this colleague dylan hadfield mannell who's very interested in this idea that like when we're thinking of training human ai systems um well when we're thinking of training training ai's to do things that people want we should sort of think about like having a human ai system be rational in the sense of like conserving resources and like achieving that whole system should achieve the human goal so i'm wondering like um yeah what thoughts do you have about whether like kind of the locus of intelligence we should think of that as being single ai systems multitudes of ai systems or some combination of humans and ais i think that the best outcome is to have the locus of intelligence effectively being some combination of humans and ais to me that feels like more of a governance decision rather than an alignment decision uh in the sense that i expect that increasingly as we build the agis they're going to take on more and more of the uh cognitive load of whatever task we want them to do and so if there's some point where we're going to stop and say like no we're not going to um yeah we're going to try and prevent the agis from assuming too much uh control then i think fundamentally that that's got to be some sort of regulatory or governmental choice uh i don't think it's the type of thing that you'd be able to build in on a technical level is my guess and then when it comes to uh when it comes to many agis you know i think people do underrate the importance of in particular uh cultural evolution when it comes to humans where that's one way in which the locus of um intelligence is uh not in a single person it's kind of like spread around uh many people uh yeah so that that's some reason to suggest that the like emergent dynamics of um of having many agis will play an important role but you know i think most of the problem comes in talking about a single system and like if you can have guarantees about what a single system will do then you can work from there to to talk about the the multi-agent system a bit basically i think the claim is that having the multi-agent system makes things harder but uh we shouldn't some people have argued that kind of makes things easier that it means we don't need to worry about um you know the alignment problem for a single agent case and i think that i think that's incorrect i think uh you know start start off by solving it for the single agent case and then uh without relying on any sort of messy multi-agent dynamics to make things uh go well so we sort of talk uh hinted at this idea of an alignment problem or at various risks um something that people might want to know is like some people seem to think that if we had agi this would be a really big deal um do you think that's right and if so why yeah um i think that's right i think you know there's a sort of uh straightforward argument which is humans can do incredible things because of our intelligence um if you have agis then uh they'll be able to do uh incredible things either much faster or in a way that builds up on themselves um and this will lead to the types of scientific and technological advancements that we can't really even imagine right now um maybe as one thought experiment i particularly like uh imagine you know taking uh 10 000 scientists like 10 000 of the best scientists um and technologists and just accelerating them by a factor of like a thousand just like speeding speeding up the way they can think um speeding up the types of progress that they make and uh speeding up i guess their computers that they're working on as well um and like what could they do after um you know the equivalent of 500 or a thousand years uh well you know they would have some disadvantages right they wouldn't be able to um engage with the external world nearly as easily but i think if you just look back uh 500 years ago and you think like like how far have we come in 500 years through this process of scientific and technological advancement uh it feels like it's kind of wild to think about anywhere near the same amount of progress happening um over a period of uh years or decades um so that's yeah one intuition that i find particularly compelling okay seems like a pretty big deal uh when do you think we'll get hei and why do you think it'll take that long yeah so i don't have particularly strong views on this so to a significant extent i defer to um the report put out by a geocochra at open philanthropy which if i recall correctly has a median time of around 2045 or 2050. i think that report is a pretty solid baseline uh you know i don't think we should have particularly narrow credences around that point in time um i guess it feels like there are some arguments some like high level arguments that sway me but that ultimately kind of cancel out so one high level argument is just look there are a lot of cases in the past where um where people were wrong about uh were like wildly overconfident about the technology needing to take a long time so you know the classic examples being predictions of uh the wright brothers about how long they'd um take to build an airplane where they thought it was 50 years away i think you know two years before they built it um you know uh predictions by uh i think it was you know about like landing people on the moon uh i think lord kelvin thinking that the nuclear reaction was kind of uh moonshine and nonsense and that was only a few years before uh or maybe even after they'd uh made the most critical breakthroughs so you know all of these they feel like strong intuitions uh that like kind of way in the direction of not ruling out uh significantly earlier timelines on the other hand uh these do feel a little bit cherry-picked and it feels like there's a lot of um you know there are many other examples where people just like see one plausible path to getting to a point and just don't see the obstacles that are going to slow them down a lot because because the obstacles are just not as salient you know so uh probably people 50 years ago were wildly overconfident about how far rocket technology and nuclear power would get and these are you know pretty big deals um you know it would uh it would be very significant if we could do like asteroid mining right now or if we had like very cheap nuclear power but um they were just like a whole bunch of unexpected obstacles even after the sort of initial early breakthrough so you know these i think these situations kind of like weigh against each other and the main effect is just to make me pretty uncertain all right cool and a related question that people have so there's some sense of like well if it's some people think that we might need to make preparations before we get agi and if agi will come very suddenly um then we should probably do it now but if it'll be like very gradual and we'll get a lot of warning maybe we can wait until aja is going to come soonish to prepare to deal with it um so just on the question of how sudden agi might be do you think that it's likely to sort of happen kind of all at once without much warning or do you think that there's going to be this very obvious gradual ramp up i think there's going to be a noticeable and significant ramp up whether that ramp up uh takes more than a couple of years is hard to say so so i guess that the most relevant question here is something like how fast does the ramp up compared with the ability of say the field of machine learning to build consensus and change direction versus how how fast is it compared with the ability of uh you know world governments to um notice it and make appropriate actions and things like that and you know while i'm pretty uncertain about how long the ramp up might take uh it feels like pretty plausible that it's uh shorter than uh the other comparable ramp ups i just mentioned and so that feels like the decision-relevant thing if the world could pivot in a day or two to really taking a big threat seriously or really taking uh technological breakthroughs seriously then um that would be a different matter but as it is you know how long does it take for the world to pivot towards taking global warming seriously like a matter of decades um even coronavirus that took you know months or years to really um shift people's thinking about that so that's kind of the way i think about it there's no real expectation that people will react in anywhere near a rapid enough way to deal with this unless we start preparing in advance cool another question i have just in terms of what it might look like to get agi there are two scenarios that people talk about so one scenario sort of related to the idea of collective agis and stuff there's there's one strain of thought which says probably what will happen is around the time that we get agi probably like a bunch of research groups will figure it out at approximately the same time interval and we should imagine a world where there are tons of agi systems um you know maybe they're competing with each other or maybe maybe not but you know there are a bunch of them and we have to think about a bunch of them um there's some people you think look probably it's one of these things where either you have it or you don't to a pretty big extent and therefore there's going to be a significant period of time where there's just going to be one agi that's like way more relevant than any other intelligent system which of these scenarios do you think is more likely i don't think it's particularly discreet so i don't think you've either got it or you've not i think that the scenario that feels most likely to me although again with pretty wide error bars is that you have the gradual deployment of systems that are increasingly competent on some of the axes that i mentioned before so in particular like being able to act competently over longer time horizons and being able to act competently in new demands with fewer and fewer samples um or less and less data and so i expect to see systems rolled out that can you know do a wide range of tasks better than humans over a period of five minutes or a period of an hour or um possibly even over a period of a day um before you have systems that can uh do the sort of strategically uh vital tasks that take you know six months or a year things like you know long-term research or or starting a new company or you know making these um large-scale strategic decisions and then i expect there to be the sort of push towards increasingly autonomous uh increasingly efficient systems but then once uh we've we've got those systems starting to roll out i guess i think of it in terms of orders of magnitude so the difference between you know being able to act competently over a day versus being able to act confidently over a week versus uh six months or so is we should expect that going from a week to a couple of months is not much harder than going from a day to a week you know into each each time you sort of jump up an order of magnitude it feels like that's roughly speaking uh we should expect it to be a similar level of difficulty okay so does that look like a world where um there's this whole field you know one person out of the field managed to figure out how they're to get their ai to plan on the scale of a year and the rest of them figured out how to have their ai's plan on the scale of like a month or two and um maybe maybe the one that um can plan on the scale of a year is just like way more relevant to the world than the rest of them yeah that seems right okay cool so we've mentioned this idea that like maybe well your report is titled agi safety from first principles and we've talked a bit about the things like the problem of alignment how good or bad should we expect agi to be in terms of uh its impact on the world and what we care about yeah so it feels like the overall effect is going to be dominated by the possibility of these extreme risk scenarios to me uh it feels like yeah a very it's hard to know what uh uh what's going on in expectation because you have these like you know very positive outcomes where we've um you know solved scarcity and like understand how to build societies in much better ways than we currently do and can kind of like make a brilliant future for humanity and then also the ones where we've ended up with uh catastrophic outcomes because we've built misaligned agis so yeah how's it hard to say on balance what it is but it feels like a a point in time where you can have a big influence by like trying to swing it in a positive direction okay so so why think that it seems like you think there's a good chance that we're gonna build agi that um like like does it kill everyone does it enslave everyone or does it make everyone's lives 10 worse like yeah i think um probably the way i would phrase it is that the agi gains power over everyone and uh power in the sense of just like being able to decide how the world goes and then you know at that point in time it feels pretty hard for me to say exactly what happens after that but it feels like by the time we've reached that point we've screwed up and when i say like having power what i mean is like having power in a way that doesn't you know defer that power back to humans okay and what do you think like uh how afraid of that should we be because i guess there's some world where what that looks like is um you know maybe i should imagine that we we have something like the normal economy except everyone pays like you know 50 of their income to the agi overlords or something and the agi gets to you know colonize a whole bunch of stuff and it buys a whole bunch of our labor like uh is this something should we be like super afraid of that or yeah i think um we should be uh pretty afraid in like a pretty comparable way to how if you describe to gorillas humans taking over the world um they should be pretty afraid now it's true that like there's some chance that you know the gorillas get a happy outcome like maybe we're particularly altruistic and we you know are kind to the gorillas i don't think there was any you know strong reason in advance to expect that humans would be kind to gorillas and in fact you know there have been many cases throughout history of you know humans driving other species to extinction just because we had power over them and we could so broadly speaking a system that chooses or like a set of systems that choose to gain power over humans that's even just from that fact we can probably infer that uh we should be pretty scared okay so the thing about agi is that it's it seems like it's probably going to think that people make or at least people make the things that make it or whatever why there's some sort of argument which says look probably agi won't we like we won't build like really terrible agi systems that take over the world because if it seems like if anyone you know foresees that they won't do it because they don't want the world to be taken over i'm wondering what you think of this argument for you know things will be fine because we just won't do it yeah so i think in some sense the core problem is that we don't understand what we're doing when we build these systems to anywhere near the same level as we understand what we're doing when we build an airplane or when we build a rocket or something like that so yeah in some sense if you step out and you're like why is ai different from other technologies one answer is just like we don't have a you know scientific principled understanding of how to build ai uh whereas you know for a lot of these other technologies we do now we could say well let you know let's experiment until we uh until we get that and then i think the answer to that would just be uh if you really expect agi to be such a powerful technology then you might just not have that long to experiment like maybe you know maybe you have months or years uh but like that's not really very long in the grand scheme of things when you're trying to figure out how to make a system safe and in particular a type of technology that's like very new and very powerful uh it feels like yeah if if we if we really knew what we were doing uh in terms of like us designing the systems that i feel much better but as it is it's more like our optimizers our training algorithms are the ones designing the systems and we're just kind of like setting it up and then uh letting it letting it run a lot of the time okay but why why should we expect people to like well like it seems like this story there's still some point where somebody says look i'm going to set up this system and it's going to there's a really good chance it's going to turn into an agi and i don't really understand agis and there are these you know good arguments that i've heard on this podcast that i listen to for how they're going to be powerful and they're going to be dangerous but i'm going to press the button anyway uh why why do they press the button isn't that irrational so one answer is just that uh they don't really believe the arguments right and i think it's easy to not believe the arguments when uh the thing you're postulating is this like qualitatively different behavior from the other systems that have come before uh jacob steinhardt has an excellent series of blog posts recently talking about sort of emergent changes in behavior as you scale up ai systems and i think it's kind of like it does seem a little bit crazy that as you just like make the systems more and more powerful you're not like really changing the fundamental algorithms or so on you do get these uh you know fundamental shifts in what they're able to do so not just you know performing well in a narrow domain then performing well in like a wide range of domains uh and then you know maybe the best example so far is gpt3 being able to do few shot learning just by being given prompts like that's the sort of thing and and that's like you know much smaller than the types of change you expect as you scale systems up from current ais to agis so yeah plausibly people just don't intuitively believe the scale of the change that they'll expect to see uh and then the second argument is you know maybe they do but um there are a bunch of competitive motivations that they have you know maybe uh they're worried about economic competition maybe they're worried about geopolitical competition uh you know it seems pretty hard to talk about these very large scale things decades in advance but that's the sort of fundamental shape of an argument that i think is pretty compelling like you know if people are worried about getting to a powerful technology first then they're going to cut corners okay so now i'd like to talk i guess a bit about the technical conversation about making these agi systems safe so uh you've recently had this um conversation with earlier sir yudkowski um this an early proponent of the idea that there might be existential risk from agi and one thing that came up is this idea of um of goal seeking or goal directed agents where i think eliezer had this point of view that was very very very focused on this idea that we're going to get these ai systems that they're going to have very coherent behavior in pursuit well that they will sort of coherently shape the world in a certain way that's kind of how he thinks of um he kind of thinks that there's just one natural way to coherently shape the world for simple goals and it seems like in your point of view you think about these uh abilities of agents like self-awareness uh ability and tendency to plan um picking actions based on their consequences uh how sensitive they are to scale how coherent they are and how flexible they are and in your view it seems like you sort of think about agents as sort of sliders on these fronts maybe we'll have a low value of on one of the sliders i'm wondering what your thoughts are about i guess the relationship between these ways of thinking and where you would rely on one versus the other so i think the first thing to flag here is that eliezer and i agree to a much greater extent than you might expect from just reading the debate that we had so i think he's yeah this core idea that he has that um there's something you know fundamentally dangerous about intelligence uh is something that i buy into um i think the particular things that i want to flag on that front are this idea of instrumental reasoning the idea that you know in order to you know achieve a goal like you know suppose you're an obedient ai you're a human asks you to achieve a goal or you have to reason about how to like what intermediate steps you're going to take uh you need to be able to sort of plan out sub goals and then achieve those sub goals and then go on and achieve this far goal and it feels like that's you know very core to the to the concept of general intelligence uh and in some sense you've just got this core ability and we're like but also please don't apply it to humans like please don't reason about humans instrumentally please don't make sub goals that involve persuading me or manipulating me and i think eliaser is totally right that you know as you scale this ability up then it becomes increasingly unnatural to have the sort of exception for humans and in the instrumental reasoning that the system is doing you've got this pressure towards you know doing instrumental reasoning towards achieving outcomes and which like in some sense what alignment is trying to do is carve out this special zone for human values which says no no no like don't do the intelligence thing at us in in that way i'd also like another sort of core idea here is the idea of um abstraction the idea that you know intelligence is very closely linked to being able to sort of zoom out and see a higher level pattern and again that's uh something that when we want to build bounded agents we want to try and avoid we want to say uh you know we've given you a small scale goal or like we've trained you to achieve a small scale goal please do not abstract from that into wanting to achieve a much larger scale goal and again we're trying to carve out an exception because most of the time we want our agents to do a lot of abstraction we want them to be thinking in like very creative flexible ways about how the world works it's just that in the particular case of the goals we give them we don't want them to generalize in that way okay so i think i think that's kind of like my summary of some of these ideas that i've gotten from eliezer at least where he he has this concept that you know in the limit once you push towards the heights of great greater general intelligence it becomes incredibly hard to prevent these capabilities from being directed at us in the ways that we don't want it feels like maybe the core disagreement i had is something like how tight is this abstraction in the sense of how much can we trust that these things are correlated with each other as not just in the regime of like highly super-intelligent systems but also in the regime of systems that you know are slightly better than humans or like even noticeably significantly better than humans at doing a wide range of tasks like intellectual research like jobs in the world like doing alignment research things like that and i guess my position is is just you know either i don't understand why he thinks that his claims about the limit of intelligence are also going to continue continue to apply for the relevant period as we approach um as we move towards that point or else maybe he's trusting too much in the abstraction and failing to see ways in which reality might sort of be messier than he thinks is going to be the case okay so so am i right to summarize that as you thinking that like indeed in there's some sort of limit where in order to be kind of generally competent in order to have like if you have the things that we call intelligence then you sort of think about humans you're it's very hard for you to not think about humans instrumentally and like abstract your goals to you know to larger scales and whatever but like maybe near human level we can still do it and it'll be fine that's right and we might hope that in particular um one argument that feels pretty crucial to me is this idea that humans are in fact uh let's say our comparative advantage is towards a whole bunch of things that are not particularly let's say aligned from the perspective of other species so we we have a strong advantage at doing things like expanding to new areas like gathering resources uh like hunting and fighting and so on we're not very specialized at things like doing mathematical research reasoning about um alignment uh reasoning about how to uh you know uh reasoning about economics for example in ways that make our societies better and so uh and that that's just because of you know the environment in which we evolved and so it seems very plausible to me that as we train ais to become increasingly general and generally intelligent you know eventually they're gonna surpass humans at all of these things but uh the hope would be that they surpass humans at the types of things that are most useful and least worrying before they surpass humans at the types of things that uh the sort of like power seeking behavior that we're most worried about so i guess yeah a question of like differing comparative advantages even though you know eventually once they get sufficiently intelligent uh they'll outstrip us at all of these things yeah so so one concept that i think is lying in the background here is this idea of a pivotal act so a listener might listen to that and think like well it sounds like uh you're saying that um you know for a while we'll have slightly super intelligent ai and that'll be fine but then when we get the really super intelligent ai that will kill us and why should i feel comforted by that yeah um so some people have this idea that like look when we get really smart ai step one is to use it to do something that means that we don't have to worry about uh risk from artificial general intelligence anymore and people tend to describe kind of drastic things to give listeners a sense i think in this uh conversation with yukowski yukowski gave the example of melting every gpu on earth i'm wondering yeah how much do you buy the idea of kind of uh a pivotal act being necessary and do you think that um being intelligent enough in the way that you can do some kind of pivotal act is compatible with like the kinds of intelligence where you're coherent enough to you know achieve things of value but you know you forgot to treat humans as uh you know instrumental things to be manipulated in service of your goals so i think that there are certainly you know i don't think we have particularly strong candidates right now for ways in which you can use an agi to prevent scaling up to dangerous regimes i think there are um plausible things that seem worth exploring that are maybe a little bit less dramatic sounding than elias's example so in the realm of alignment research you might have a system that can make technical progress on you know mathematical questions uh of the sort that um that are related to ai uh alignment research uh you could have systems so that that's kind of yeah or what one option is like automating the the kind of like theoretical alignment research another option which is associated with proposals like amplification and reward modeling and debate is just to use these systems to automate the empirical practical side of alignment research by giving better feedback and then on the governance side i think just having a bunch of highly capable ais in the world is going to prompt governments to take uh risks from agi a lot more seriously and i don't think that the types of action that would be needed to slow down scaling towards dangerous regimes are actually that discontinuous from the types of things we see in the world today so you know for example global monitoring of uranium and uranium enrichment uh to prevent proliferation of nuclear weapons i think uh indeed there's like a lot of cultural pressure against things like building nuclear power and a wide range of uh other technical technological progress that's that's seen as dangerous so i feel uncertain about how like difficult or extreme governance interventions would need to be in order to actually get the world to think hey let's slow down a bit let's be like much more careful but to me it still feels plausible that pivotal action is a little bit of a misnomer it's more just like the world handling the problem as the world becomes more sort of wakes up to this the scale and scope of the problem okay and so it seems like in your way of thinking the thing that uh stops the slide to you know you know incredibly powerful uh unfriendly agi is we we get a agis to help do our ai safety research for us is that about right as well as a bunch of governance actions right um yeah so that feels like you know the sort of default proposal that i'm excited about but you know like in in terms of like the specifics of the proposals it's it's you know i can't point to any particular thing and say you know this one is uh one that i think would like work right now sure sure i i guess if i think about that it seems that um like if i imagine effective safety research it seems to involve both like pretty good mean sense reasoning like you have to reason about the systems that we're going to deploy and what they're going to do and maybe they're going to get roped into helping with the safety research they have to think about how they'll research or have some invariants about that that are going to be maintained or something so yeah you have to have pretty good sense reasoning and you also have to be thinking about uh humanity in order to know what counts as safety research versus like you know research to ensure that the agi is blue or some other round of property that nobody cares about and i think there's some worry that like look as long as you have that amount of like goal orientation in order to like get you doing like really good research and like that amount of awareness of humans there's i think there's some way that like that combination of attributes is itself enough to you know think about humans instrumentally in the kind of dangerous way i'm wondering what you think about that so i agree that's an argument as to why we can't just take a system say go off please solve the alignment problem and then have it just come back to us and give us a solution so i think yeah in some sense uh many of the alignment proposals that are on the table today are ways of trying to mitigate the things you're discussing and i don't think that yeah it's hard to say like how much of a disagreement there is here because you know i i do think you know all of the things you said are just reasons to be worried but then it feels like i think i think this partly ties back to the thing i was saying before about like the core problem being that we just don't understand the way that these systems are developed it's so it's less like you have to do a highly specific thing with your system in order to make it go well and more like you just need to have more of a like either you need to like have a deeper understanding which is kind of like a bit more like intellectual work than going out and doing stuff in the world um or maybe you need to just like scale up certain kinds of supervision which again like i don't currently see the reasons why this is necessarily infeasible it feels like uh there's a lot of there's a lot of scope here for promising proposals okay so yeah let's move on to this thing we've mentioned a couple of times now uh called alignment uh what first of all what do you mean by alignment or misaligned ai systems to a first approximation when i say aligned i mean obedient so uh a system that when you give it an instruction it will follow the intention of that instruction and then come back and and like wait for more instructions it's kind of like roughly what i mean and then when i mean misaligned roughly what i mean is power seeking a system that is trying to gain power over the world for itself either in order to achieve some other goal that um you know humanity uh wouldn't want or just for the sake of you know in the same way that some humans are sort of intrinsically power-seeking you might have an ai that's intrinsically power-seeking so yeah that's those are the two sort of uh more specific concepts i usually think about okay how big a role does you know solving the technical problem of making aisle aligned uh do you think that that has like a massive role in making our agi safe a minor role um but like do you think it's basically the whole problem or like 10 of the problem or something um like is the question something like what proportion of the difficulty is alignment versus governance work yeah or just any i think people there are some people who think that like oh you'll need to do alignment and also some other stuff uh i don't know if it's governance i don't know if it's like you know governance of thinking about how the ais are going to interact and maybe some people really want them to coordinate you know there might be a variety of things there yeah so i guess i feel like it's most of the problem but nowhere near all of it like if i think about the ways in which i'm concerned about existential risk from advanced ai probably the split is something like 50 worried about alignment 25 worried about sort of governance failures related to alignment and then 25 worried about just straightforward misuse or you know conflicts that are sparked by um major technological change okay so a concern that i think some people have with the idea of ai alignment is this fear that like oh we're gonna we're gonna create this ability for these really powerful people to create ai systems uh that just do what they want that are that are super obedient to them and i think some people have this worry that like oh we've we've just like we've just created this effective machine to turn people into tyrants um or you know totalitarian overlords how worried are you about this i think that yeah it seems pretty worrying that advances in technology lead to dramatic increases in inequality and part like a big part of ai governance should be setting up structures and systems to ensure that these systems if they're aligned are then used well so i think uh there's work on you know like kalinoki for example has a paper on windfall clauses which talks about the ways in which you might try and redistribute benefits from agi uh more broadly i think there's uh various things to do with uh you know like not just corporate governance but also like uh global governance that like uh yeah i think there's a bunch of unsolved questions there ultimately it's not clear to me what the alternative to addressing these questions is i think that uh yeah it would it would be nice if we can kind of like mitigate all of these problems at once but it feels like we're just gonna have to like do the hard work of you know setting up uh as many structures and safeguards as we can sure i so i guess some people might think like look the alternative is try and think hard until you come up with a plan other than uh create an ai that's really aligned to an individual or maybe like you know come up with technical plan to create an ai that's aligned with humanity or objective moral truth or something i guess uh i feel pretty pessimistic about not just having to solve alignment but also having to solve morality for example um it feels like uh i i'm i don't think we necessarily want when i say obedient it doesn't need to be the case that the system is obedient to a given individual right probably uh you want to train it so that it's obedient to ideally speaking to individuals who have the right um you know certifications such as being democratically elected or i think things like that um and with it hopefully within certain bounds of like what actions they're not meant to take so like you know in an ideal world if i could have uh all the alignment decision rather i wanted um i i set that up uh in a much better way but i do think that you know this core idea of obedience uh it feels valuable to me because yeah like uh the problem of politics is hard the problem of ethics is hard uh you know i don't think uh it's if you can solve the problem of of making a system obedient then we can kind of like try and leverage all the existing solutions we have to uh governance things like uh you know like all the systems and structures that have been built up over time uh to try and figure out how we're gonna deploy these advanced technologies i i don't want people to try and like have to reason these things through from first principles uh while they're trying to um build aligned agis okay i guess another thing that you mentioned ati safety from first principles is transparency is this important um part of a agi alignment um so could you first say perhaps briefly like what role you see transparency research is playing i can speak i i'll speak differently for like the the wider field and then for myself so i think in the wider field it's playing a pretty crucial role right now in terms of transparency is in some sense one of the core underlying drivers of many proposals for alignment i'd say the other core driver here is just using more human data just like trying to get more feedback from uh humans um so that we can use that to nudge systems towards fulfilling human preferences better so yeah a lot of a lot of research agendas are kind of assuming a certain amount of uh interpretability or transparent transparency i'm using those interchangeably um for my own part you know i defer to other people uh to some extent on how much progress we're gonna make because it does seem like there's been pretty impressive progress so far i feel a little confused about how we could how work on interpretability could possibly scale up as fast as we're scaling up uh the most sophisticated models when i think about trying to understand the human brain to a level that's required to figure out if a thought that a human is having is like good or bad that seems very hard people have been working at it for a very long time and now of course there are a bunch of big advantages that we have when we do interpretability research on neural networks like we have full read-write access to the neurons but we also have a bunch of disadvantages as well like the architecture changes every couple of years so you have to switch to a new system that might be built quite differently and you don't have you know introspective access or necessarily very much cooperation from the system as you're trying to figure out what's going on inside it sure so you know on balance i feel personally kind of pessimistic but at the same time uh it seems like something that's we should try and make as much progress on it as we can it feels like very robustly good to just know what's going on within our systems yeah so if you're pessimistic um do you think we can do without it or do you think we'll need it but it's just very hard and it's unlikely we'll get it i guess it feels like so i have a pretty broad range of credences over how hard the alignment problem might be um you know there's a reasonable range in which interpretability is just necessary or something equivalently powerful is necessary you know there's also ranges in which it isn't and i think i focus a bit more on the ranges of difficulty in which it's not necessary just because those feel like the most tractable sure places uh to pay attention so i don't i don't really have a strong estimate of how um you know the ratio between those let's say okay sure and another thing i guess to talk about is this idea of ai cooperation um so i've uh i guess vincent konitzer has actually recently started a seminar series on this you hear people having this sense of like oh look i yeah i believe andrew kritsch uh wrote a thing basically arguing that look even if you solve a ai alignment you're going to have you know a whole bunch of ai systems and solving the problem of making sure that that interaction doesn't generate externalities which you know we're dangerous to humans or something is even harder than solving the alignment problem hey a quick note from future daniel to the best of my knowledge there is no written piece where andrew kritch makes this argument he has said via personal communication that he prefers to not debate which problem is harder that he would like people working on both so i'm wondering yeah what do you think about the worry about yeah making sure that ai's coordinate well i guess it feels like mainly the type of thing that we can outsource to ais once they're sufficiently capable you know i don't see a particularly strong reason to think that systems that are comparably powerful as humans or like more powerful than humans are going to make obvious mistakes in how they coordinate yeah so i think again it's kind of like you know you have this framing of like ai coordination like we could also just say uh politics right like we think that geopolitics is gonna be hard in a world where ais exist and when you have that framing you're like you know geopolitics is hard but it's also you know we've made a bunch of progress compared with you know a few hundred years ago where there were many more wars um it feels pretty plausible that a bunch of trends that have led to less conflict are just going to continue and so i still haven't seen arguments that make me feel like this particular problem is like incredibly difficult as opposed to you know arguments which i have seen for why the alignment problem is plausibly incredibly difficult all right all right so i guess i'd like to get back to a thing we talked about a bit earlier which is this question about um how much we should think about the optima of various reward functions or like the limit of intelligence or something where i see you as thinking that people focus perhaps too much on that and that we should really be thinking about selection pressures during training i'm wondering like what mistakes do you think people make when they are thinking about in this framework of optima yeah so i think the there are kind of two opposing mistakes that i think different groups of people are making so eliezer and murray more generally it really does feel like they're thinking about systems that are so idealized that they aren't very uh the the the applicability of insights about them to uh systems to the first agis we build or the ones that are a little bit better than humans at making intellectual progress you know is pretty limited and you know there are a bunch of examples of this i think uh like aixa is a kind of dubious it's kind of like one of the metaphors i use sometimes is it's like trying to design a train by thinking about what happens when a train approaches the speed of light um you know it's just not that helpful uh so that's that's like one class of mistakes that i'm worried about and like to do with also related to the idea that uh things that you can just extrapolate this idea of intelligence until it's to the infinite limit like there is such a thing as perfect rationality for example or like you know the limit of approaching perf uh perfect rationality makes sense so that's one mistake i'm kind of worried about i think on the other hand it feels like a bunch of more machine learning focused alignment researchers don't take this idea of optimization pressure seriously enough and so it feels like there's a few different strands of research that that just reroute the optimization pressure or like block it from flowing through the most obvious route but then uh just like leave other ways for it to to cause the same problems um probably the most obvious example of this to me is the concept of myopia which um especially evan hubinger is pretty keen on it and a few other researchers as well and to me it seems like you can't have it both ways if you've got a system that is highly competent then there must have been you know some sort of pressure uh towards it achieving things on uh long time frames uh and then that's like exactly even if it's like nominally myopic or like i haven't seen any particular insight as to how you can have that pressure applying without making the agent actually care about or pursue goals over long time horizons uh i think you know the same thing is true uh uh with cooperative and inverse reinforcement learning like where it feels to me like it just kind of reroutes the pressure towards the sort of instrumental reasoning in a slightly different way but it doesn't actually help in preventing that pressure from pushing towards misaligned goals so yeah so i think um the the key thing that i would like to see from these types of research agendas and to some extent also stuff from arc like imitative generalization and eliciting latent knowledge is just like a statement of what the core insight is or what the core reason why uh this isn't just reframing the problem uh or like which part is actually doing the work of preventing the optimization pressure towards bad outcomes because i feel pretty uncertain about that for a lot of existing research agendas yeah let's talk about cooperative universal reinforcement learning actually which is a jet i'm like relatively more familiar with so i think like the idea uh as i understand it is something like look we're going to think about human ai interaction as you know the human has a goal and like the human ai system you know somehow has to optimize for that goal and so the alignment is sort of coming from it being instrumentally valuable for the ai system to figure out what the human goal is and then optimize for it and so you get alignment because the ai is doing what you want in that way i'm wondering like where do you think this is missing the idea of optimization pressure so the the way that stuart russell often explains his ideas is that the key component is making an ai that's uncertain about human preferences yep but the problem here is that like the the thing that we really want to do is just point the ai at human preferences at all like make it so that it is in fact optimized in the direction of fulfilling human preferences and like whether or not it actually has uncertainty about those things uh about what humans care about it is kind of just like this intermediate like problem where the fundamental problem is like what signal are we giving it that points it towards human preferences or like what are we changing compared with a sort of straightforward setup where we just like give it rewards for um doing well and uh penalize it for doing badly and one thing you might say is like look the thing we're changing is that we have the model of um human rationality like we have some assumptions about like how the human is choosing uh their actions and that could be the thing that's doing the work but like i haven't seen any model that's like actually is making enough progress on this that it it's plausible that that's doing the hard the heavy lifting if you will like most of the models of human rationality i've seen are very simple ones of like noisely rational or things like that and so if that model isn't doing the work uh then what actually changes in the context of cell that points the ai towards human preferences any better than a straightforward you know reward learning setup and that's the thing i'm i feel i i don't think exists okay cool so i guess on the other side of it in terms of people who focus on like optimality more than you might in my imagination there's this reasoning that goes something like look when i think about like super optimal things that ai systems could do uh in order to achieve some goal or something as long as i can think of something of some property of an optimal system you know it would be really surprising if like the ai system whatever ai system that we train to do some goal uh even if it's like as competent as i am if it can't think of that thing and therefore like any thoughts i can generate about like properties of optimal ai systems there's going to be some selection pressure for that just because like in order to get something that's more common than me at achieving goals it'll have like this a similar sort of reasoning ability as me i'm wondering uh do you disagree with that first of all and if you do disagree with that why and if you don't maybe can you generate a stronger statement that you hear that you disagree with so suppose we're thinking about the first system that is better than humans at reasoning about optimality processes and how they relate to ai alignment it seems like our key goal should be making sure that system is uh using its knowledge in ways that are beneficial to us so you know maybe telling us about it or maybe using that knowledge to design a more aligned successor or something like that but that system itself like it's it seems very unclear the extent to which that sister we can reason about that system via talking about optimality constraints yeah so it's really it's really it feels like my thinking here is about handing over to ai systems that are aligned enough that they can like do the heavy lifting in the future and i i don't think it's like necessarily a bad strategy to focus on just like solving the whole problem in one go but i do think that going about it via thinking in like precise technical terms about the sort of limit of intelligence seems a little bit off um so i guess there's this thought that's like look anything well like maybe i can think of some strategy something the limit of intelligence might take or some you know some idea that like oh well if you were really smart you would um try to like you know treat humans instrumentally or something and i think there's some concern that like look if i can think of that then sort of definitionally uh like i guess maybe not quite definitionally but almost definitionally something smarter than me can think of it and so it seems like there's some instinct that says like look once you have systems that are like smarter than me if they you know if these sort of ideas actually like help it achieve goals or you know help it you know do whatever it's being selected to do then that should happen do you agree with that argument do you think that it's like doesn't get you the kind of reasoning you regard as worrisome or well it seems like the key question is whether those systems will in fact want to you know achieve goals that we're unhappy about okay um you know like as as one uh yeah and if it feels like the ways in which their motivations are shaped like whether that system will decide to do things that are bad for humans or not is going to be the type of thing that's kind of shaped in a pretty messy way by a bunch of uh gradient descent on a bunch of like data points in kind of the same way that you know i am a human i'm reasoning about you know very intelligent future systems and trying to figure out which ones to instantiate but like the way in which i'm choosing which ones to instantiate is very dependent on these sort of like gut emotional instincts that were kind of like honed over you know long period of evolution and like childhood and so on and those emotions and instincts are things that feel very hard to reason about and precise in a precise technical manner okay so the idea is that like uh sure an ai system can you know maybe can figure out various strategies anything that i can think of yeah but maybe like you know maybe maybe we have reason to think that uh we can have some sort of training process that like maybe in the limit of the training process would produce an ai that wanted to do stuff that the thing it's supposed to be aligned to disapproves of but actually it's just it's just like very unlikely to come up with an ai that has those sorts of desires and therefore that's right you don't have to worry about this kind of reasoning that's right um even though if it did have those desires it would do better on the kind of outside objective we just like won't train the overall objective to optimality right and you know that's in some sense the reason that we're not worried about um dogs wanting to take over the world even though like we've you know done artificial selection on dogs for a while and like in theory at least the signal that we're sending to them as we do artificial selection incentivizes them to take over the world in some sort of you know uh kind of stylized abstract way but in practice that's not the direction that their motivations are being pushed towards and you know we might hope that the same is true of systems even if they're a bit more intelligent or like significantly more intelligent than humans uh they're just the extreme uh optima of whatever reward function we're using is just not that relevant or salient okay so moving on a bit you had this conversation um with elias kowski you know trying to get to the i guess trying to really formalize your disagreements it seems to me like i'm wondering if any of that conversation changed your mind and if you can say how it did i think yeah the two biggest things that i took away were number one as i tried to explain my views about ai governance to eliezer i realized that they were just missing a whole bunch of detail and nuance um and so any optimism that i have about ai governance needs to be grounded in you know much more specific details and plans for what might happen and so on and that's uh led to a bunch of recent work um that i'm doing on you know formulating a bit more of a governance research agenda and like figuring out what the uh crucial considerations here are um so that was one thing that changed my mind but that was a bit more about just like me trying to flesh out my own views sure i think in terms of eliezer's views i think that i had previously underestimated how much his position relied on a few sort of very deep abstractions uh that kind of like all fitted together around like i don't think you can really separate his views on like intelligence his views on consequentialism or agency his uh his views on um recursive self-improvement things like that like you you can kind of like look at different parts of it but it seems like there's this yeah this underlying deep deep-rooted set of intuitions that like he keeps trying to explain in ways that people like pick up on uh the particular thing he's trying to explain but not not the uh without uh sort of having a good handle on the overall set of intuitions so like one particularly salient example of this is that he keeps talking about uh utility functions or he like previously he talked a lot about utility functions and then myself and rohan shire and a bunch of other people like tried quite hard to figure out you know what like what specific argument he was making around uh utility functions and we basically failed because because utility functions for him uh it feels like in some sense they're almost like i i don't think that there's a specific precise technical argument that you can make with our current understanding of utility theory that um that tells us about agis i think it's much more like so you don't have a specific argument about utility functions uh and their relationship to agis in a precise technical way instead it's more like utility functions are like a pointer towards the type of later theory that will be a much more precise that will give us a much more precise understanding of of how to think about intelligence and uh agency and agis pursuing goals and so on and to eliezer it seems like we've got a bunch of uh different handles on what the shape of this larger scale theory might look like but he can't really explain it in like precise terms in like uh you know maybe the same way that uh for any other scientific theory uh before you kind of like latch onto it you can't uh you can only like gesture towards a bunch of different intuitions that you have about and to be like hey guys like there are these links between them that you know i can't make precise or rigorous or formal at this point okay sort of related to updates if there is one belief about existential risk from ai that you could create greater consensus about among the sort of the population of people who are professional ai research who thought you know carefully about existential risk from ai for maybe more than 10 hours what belief would it be probably the main answer is just the the thing i was saying before about how you know we want to be clear about where the work is being done in a specific um alignment proposal and like it seems important to think about having having something that doesn't just shuffle the optimization pressure around but like really uh like gives us some deeper reason to think that the problem is being solved where you know uh one example is that in uh when it comes to paul cristiano's work on amplification i think uh one sort of core insight that's doing a lot of the work is that imitation is uh can be very powerful without being equivalently dangerous so yeah this this idea that you know instead of optimizing for a target you can just optimize to be similar to humans and that might still get you a very long way and then you know another related insight that makes amplification promising is the idea that uh is the idea that decomposing tasks uh can sort of like leverage human abilities in a powerful way um now i don't think that those are anywhere near sort of like complete uh ways of addressing the problem but they sort of gesture like this is where the work is being done whereas like you know for some other proposals i i don't think there's an equivalent story about like what's the um deeper uh idea or principle that's like allowing the work to be done to solve this like difficult problem maybe a second thing is that uh i personally like in my recent more recent work about the alignment problem i've been moving a little bit away from the term optimizes or like talking about the distinction between the um a clean distinction between the outer alignment problem and the inner alignment problem yeah i i think that you know it was like an incredibly important idea when it first came out because it kind of like helped us clarify a bunch of confused intuitions about uh you know the relationship between reward functions and goals for intelligent systems for example but i think at this point we might be at a point where we want to be thinking about you know uh what's the spectrum between a system that's like purely guided by a reward function versus a system like a policy that has been trained on a reward function now like makes no reference to the uh reward function at all i think uh you know these are these are two extremes and like in practice it seems unlikely that we're gonna end up at either extreme because reward functions are just like very useful so so yeah we we should we should try and be thinking about what's going on in the middle here and like sort of shaping our arguments accordingly another way of thinking about that is kind of like you know which is the correct analogy for uh reinforcement learning agents is it the case that uh gradient descent is like evolution uh or is it the case that gradient descent is like you know learning in the human brain and the answer is like kind of a little bit of both like kind of uh it's gonna play a role that's like intermediate between these two things or like has properties of both of these and we shouldn't kind of uh treat these as two separate possibilities but rather as like two gestures towards like what we should expect the future of ai to look like okay i guess a sort of related question i i think by now the there's some community of people who are like you know i really yeah i really really think that agi poses an existential risk and that it's really dangerous and they're doing research about it and there's some you know intellectual edifice there what do you think the strongest criticisms of that intellectual edifice around the idea of agi except risk are that deserve more engagement from the community of researchers thinking about it i think the strongest criticisms used to be insufficient engagement with machine learning and this has mostly been addressed these days i think plausibly another criticism um just like we [Music] as a movement uh probably haven't been as clear as we could be in communicating about risks yeah so probably this is like a slightly boring answer but like you know i think there's uh there aren't that many explanations for example like of the ai alignment problem that are like short accessible like aimed at people who have you know uh competence with uh machine learning and compelling like i think when you point people to like super intelligence well it doesn't engage with machine learning when you point people to uh something like human compatible it actually like doesn't spend very much time on like the reasons why we expect risks to arise so i think that there's some type of intellectual work here that just uh you know was kind of nobody's job for a while um i think that uh you know agi safety uh from first principles kind of like was aimed at filling this gap and then also more recent work by uh for example joe cal smith um from open philanthropy uh with a report called is power seeking ai an existential risk or something like that um but i still think that there's a bit of a gap there and i think it's like a gap that's profitable not just for uh not just in terms of like reaching out to other communities but also just for having a better understanding of the problem that will make our own research go better i think that um a lot of disagreements that kind of initially seem like disagreements about which research is promising are actually uh more like disagreements about uh what the problem actually is in the first place um yeah so that's kind of like my sort of standard answer yeah and and i guess sort of converse sort of like the converse of that question um you've worked at a few different ai research organizations with a bunch of people who are um you know working on just making ai more capable and um better you know reasoning and pattern matching and such um but not particularly not working on safety in terms of preventing essential risks why don't you think that they're that they have that focus it varies a bunch by person i'd say a bunch of people are just less focused on agi and more focused on sort of you know uh pushing the field forward in a sort of reasonably straightforward incremental way um a bunch of people are i think there's a there's a type of like emotional orientation towards taking potentially very impactful arguments very seriously and i think that a lot of people are quite cautious about those arguments and you know to some extent rightfully so because uh it's very easy to be fooled by arguments that are kind of very high level and abstract um so it's almost like there's this habit or there's this predilection towards thinking man this is really big if true and then like wanting to dig into it uh and i think uh my guess is that that's probably the main thing blocking people from taking these ideas more seriously is just not getting that instinctive like oh wow like this is kind of crazy uh reaction uh but like crazy enough that i should actually try very hard to figure it out for myself yeah so i think uh in his most important century series of blog posts holden karnovsky does a really good job at kind of like addressing this intuition just being like uh this does sound crazy and like here i hear like a bunch of like outside view reasons to think that you should engage with the craziness here are a bunch of um you know specific arguments uh on the object level and here's kind of like a sort of emotional attitude that i take towards the um towards the problem and yeah that that feels like the that feels like it's pretty directly aimed at uh the types of things that a lot of machine learning researchers are like that's determining their beliefs sure so moving on if somebody's listening to this podcast they're like ah i you know richard no uh seems like uh productive researcher somehow and they want to know like what what what's it like to be you doing research uh what's your perhaps you could say your production function right i've been thinking a little bit about this lately because uh we're also hiring for the team i'm uh leading at openai and uh you know what what are the traits that i'm really excited about in a researcher and i think it's the combination of like engaging with high-level abstractions in a comfortable way while also being very cautious about ways that those abstractions might break down and or like you know miss the mark so i guess it's to me it kind of feels like a lot of what i'm doing with my research i use the metaphor um of bungee jumping recently which is like going as high as you can in terms of abstraction space and then like uh you know jumping down to try and like uh get in touch with the ground again and then like bounce like trying to carry that uh like ground level observation as far back up as you can um so that's the sort of feeling that i have when i'm doing a lot of this conceptual research in particular and then yeah my personal style um is is is yeah like very start very top down and then like be be like very careful to try and fill in enough details that things feel um very grounded and incredible um was that like roughly the thing you're aiming towards yeah it seems roughly like an answer and i guess the second to last question i'd like to ask well might be the second last uh is there anything else that i should have asked you and if so what i feel like there's something important to do with what the world looks like as we approach agi um it feels like there's a lot of work being done by by you know either assumptions that oh the world will look very either the world will look kind of roughly the same as it does now except that these labs will be producing this incredibly powerful technology or else the idea that the world will you know be radically transformed by like new applications like floating around everywhere and some some of this has already been covered before maybe the thing that i'm really interested in right now is how do people respond to like what are the types of warning signs that actually make people sit up and pay attention as opposed to the types of warning signs that people just kind of dismiss with a bit of a shrug and you know i think uh covert has been a particularly interesting case study where you can imagine uh some worlds in which you know covert is just like a very big warning sign that makes everyone pay attention to the risks of you know engineered pandemics or you can imagine a world in which people kind of collectively shrug and like okay that was bad and then like don't really think so much about like what might happen next and you know it's not totally clear to me which world we're gonna end up in but uh it feels like uh we should be thinking or somebody should be thinking hard about like what are the levers that steer us towards this world or the other world and then in the analogous case of ai what are the levers that steer us you know from a given like very impressive large-scale application of ai towards um you know different possible narratives and different possible responses and that's something that i'm thinking about a bit more in in my work today it feels like a under addressed question okay yeah well i guess the final thing i'd like to ask is if somebody's listened to this interview and uh if they want to know more about you they maybe want to follow your research how should they do so so the easiest way to follow my research is on the alignment forum where i uh post most of the sort of informal stuff that i've been working on another another way you could follow my thinking in a sort of higher bandwidth way is just via twitter where i share you know work that i'm releasing as well as uh as as well as just like more miscellaneous thoughts and you can find both of them by looking just my name richard no and then probably the other thing that i want to flag is the course that i've designed and which is currently running which is the agi safety fundamentals course and this is my attempt to make the core ideas in ai alignment as accessible as possible and you know have a curriculum people can work through and really grapple with the core issues uh and uh it's run as a um as a facilitated discussion group so you sort of go along uh every week for a couple of months and have discussions with a small group of people um yeah and and we've been running that this is the third cohort of people that we're running now and will probably open up applications for another one in six months or so so that's something to keep an eye on or even if you don't want to do the uh course itself you can just have a look at the curriculum which is uh which is i think a pretty good reading list for learning more about the field we've also got a parallel program uh on uh we've got two tracks one is technical alignment the other one is ai governance and we've got curricular for both of those so that's if you want to learn more you can check those out um which we've like pretty carefully curated to convey the core ideas in the field great well uh links to all of those will be in the show notes so thanks for joining me today and the listeners i hope you got something out of this conversation absolutely thanks a bunch daniel this episode is edited by jack garrett and justice mills helped with transcription the opening and closing themes are by jack garrett the financial costs of making this episode are covered by a grant from the long-term future fund to read a transcript of this episode or to learn how to support the podcast you can visit axerp.net finally if you have any feedback about this podcast you can email me at feedback accerp.net [Music] [Music] [Music] you
Related conversations
AXRP
7 Aug 2025
Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -5 · 133 segs
AXRP
1 Dec 2024
Evan Hubinger on Model Organisms of Misalignment

This conversation examines technical alignment through Evan Hubinger on Model Organisms of Misalignment, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med -6 · avg -7 · 120 segs
AXRP
11 Apr 2024
AI Control with Buck Shlegeris and Ryan Greenblatt

This conversation examines technical alignment through AI Control with Buck Shlegeris and Ryan Greenblatt, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med -6 · avg -9 · 174 segs
Future of Life Institute Podcast
7 Jan 2026
How to Avoid Two AI Catastrophes: Domination and Chaos (with Nora Ammann)

This conversation examines core safety through How to Avoid Two AI Catastrophes: Domination and Chaos (with Nora Ammann), surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -3 · 85 segs
Counterbalance on this topic
Ranked with the mirror rule in the methodology: picks sit closer to the opposite side of your score on the same axis (lens alignment preferred). Each card plots you and the pick together.
Mirror pick 1
TED Talks
18 Dec 2023