Library / In focus

Back to Library
80,000 Hours PodcastCivilisational risk and strategy

Pushmeet Kohli on DeepMind’s plan to make AI systems robust and reliable, why it’s a core issue in AI design, and how to succeed at AI research

Why this matters

This episode strengthens first-principles understanding of alignment risk and the strategic conditions that shape safe outcomes.

Summary

This conversation examines core safety through Pushmeet Kohli on DeepMind’s plan to make AI systems robust and reliable, why it’s a core issue in AI design, and how to succeed at AI research, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Perspective map

MixedTechnicalMedium confidenceTranscript-informed

The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.

An explanation of the Perspective Map framework can be found here.

Episode arc by segment

Early → late · height = spectrum position · colour = band

Risk-forwardMixedOpportunity-forward

Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).

StartEnd

Across 62 full-transcript segments: median 0 · mean -2 · spread -250 (p10–p90 -80) · 5% risk-forward, 95% mixed, 0% opportunity-forward slices.

Slice bands
62 slices · p10–p90 -80

Mixed leaning, primarily in the Technical lens. Evidence mode: interview. Confidence: medium.

  • - Emphasizes alignment
  • - Emphasizes safety
  • - Full transcript scored in 62 sequential slices (median slice 0).

Editor note

A high-leverage addition to the AI Safety Map that clarifies one important safety bottleneck.

ai-safety80000-hourscore-safetytechnical

Play on sAIfe Hands

Uses the global player with queue, progress, speed control, and persistent playback.

Episode transcript

YouTube captions (auto or uploaded) · video GYQrNfSmQ0M · stored Apr 8, 2026 · 1,872 caption segments

Captions are an imperfect primary: they can mis-hear names and technical terms. Use them alongside the audio and publisher materials when verifying claims.

No editorial assessment file yet. Add content/resources/transcript-assessments/pushmeet-kohli-on-deepminds-plan-to-make-ai-systems-robust-and-reliable-why-its-a-core-issue-in-ai-design-and-how-to-succeed-at-ai-research.json when you have a listen-based summary.

Show full transcript
welcome everyone my name is t price I'm the uh academic director of the center for the study of existential risk and the professor phosy here in Cambridge welcome to this event which um is among other things the latest latest in a series of uh marting research seminars that um Cesar has been running this year and we have very distinguished um visitor um to deliver this seminar for us um to introduce him I'm going to introduce um Professor Zubin Rami Ry zuin from the um the machine learning um Institute in the faculty of engineering um who's somebody that we've recently become involved with in Caesar in in our projects to uh direct more attention to issues of safety and benefits in artificial intelligence the kind of uh topics that Professor Russell is going to be talking about so without further Ado I'll hand over to Zubin great thanks thanks you I'm I'm delighted to be here introducing Stuart Russell who is uh professor of electrical engineering and computer science at the University of California at Berkeley and um formerly chair of that uh uh tremendous ly important uh Department in computer science he um has a background in physics from Oxford and then went to Stanford to study computer science and seems to have stayed in uh uh at Berkeley since then although he he did spend a couple of very enjoyable years I believe in Paris as the Sher BL Pascal in Paris um and visits here frequently it seems which is great um uh Stuart has won a number of awards for his uh highly influential work in artificial intelligence and probalistic reasoning in uh limited rationality machine learning and uh many other areas including important applications to Fields like Global seismic monitoring his Awards include awards from the National Science Foundation the computers and thought award um awards from the uh International Society for basian analysis uh and the American statistical Association and uh it's just a it's a great pleasure to have him here in particular for um this uh series of lectures because he's been uh most recently tremendously influential in uh bringing the discussion of the future of artificial intelligence uh into a wider academic community of people who actually work in artificial intelligence like Stewart and I think that's uh important because that discussion needs to be um something that's not just in the public and in science fiction but in academic Circles of people who work in that particular area so it's a great pleasure to have him here and uh we welcome him so so thank you very much zuben and thank you Hugh uh for organizing the meeting um so for some of you this will be Deja Vu if you were at the PTO Rico Rico meeting or a triple AI uh in January um there are a few new wrinkles that have been added uh for some of you it'll be new I hope so this is something I've been thinking about for a long time the first edition of my textbook back in 94 actually has a section titled uh what if we succeed um and I thought it was important to have that discussion uh because the field seemed not to ever think about the possibility I think having having labored for decades and made uh not very much progress uh people had almost stopped thinking about what it would be like if we succeeded um but I got the idea for the chapter from David Lodge uh who wrote a book called uh Trading Places and another one called small world about uh a young English uh academic who um ends up going to a place he calls Euphoria state which is a thinly disguised version of Berkeley uh from uh from Birmingham uh which uh which is called rumage in the book but it's obviously Birmingham where David Lodge uh teaches and by coincidence David Lodge when we used to live in Birmingham David Lodge bought the house uh that we lived in when we moved out uh so a number of coincidences has made me want to read his books and uh and in in one of the books the the young protagonist asks a very distinguished panel of literary theorists what if you were right um and none of them had ever thought about this possibility before uh so so that's how I got the idea for the title of that chapter what if we were to succeed okay so um AI is about making computers intelligent we all agree with that and I think there's there's now a fairly Broad consensus that what we mean by that uh is not so much uh does the thought process have to mimic the thought processes of humans or does the thought process have to follow some some laws of intelligence but really can we get the computer to do the right thing and and the means of of doing the internal processes sort of follow uh from that and and there can be many ways of of doing the right thing and what we mean by doing the right thing very crudely and I'm not going to make any sophisticated points that uh that would impress any economists or uh philosophers in the audience but let's just say for the time being that what it means is maximizing uh the expected value over the entire future um of some measure of how happy we should all be with the outcome so maximizing expected utility this is a very standard formula uh and so that's what we generally mean by saying that the computer is doing the right thing and this can be specialized in all kinds of ways uh to talk about um speech acts to talk about uh machine learning systems that classify objects they're making decisions about what kind of object it is uh to robots um and so on lots of different ways of using the the same basic theoretical framework um and so why are we doing this well initially it's just incredibly fascinating uh to think that we might understand intelligence well enough to be able to actually create something that was intelligent um and obviously the more intelligent the better this this doesn't seem to need much um debate uh and we do believe we can succeed um I think there are those who say well there's no there's no need to worry about the future because we'll never we'll never reach this future where where AI has succeeded I I'm not sure that those people are being completely honest uh it doesn't seem to make sense to to work on a research problem that you believe is not solvable um and it also seems rather you know so if you if you if if you draw the analogy to to to you know the Doom the Dooms Day prophets are saying that uh you know AI is is sort of driving the car of the human race you know along a road that leads straight off the cliff well the people who will say will never succeed are like the people will say well that's fine we'll just continue heading for the cliff and we'll just hope that we run out of gas before we get there um that doesn't seem a very prudent strategy either so let's let's work on the assumption that we will succeed and um it's a very difficult problem I'm not going to say that uh we're going to succeed in any particular amount of time because I know that I'll I'll then be quoted in the newspapers and they don't want that um so let's uh but let's say that this can happen um when I talk to physicists they don't see any uh any reason why we can't uh arrange the atoms in such a way that they uh do much more effective computations than humans can do and it's obvious I think to everyone now that progress is accelerating this is not just commercial hype although there is a lot of that um but within the field uh I talk to many of my contemporaries and uh we've all had what we might call Holy Cow moments uh where we see some piece of progress that we simply didn't expect would be possible at this stage and we we're uh taken aback by the rate at which things are moving so one reason for this is that uh the theoretical foundations of the field are are now pretty solid uh based on rational decision-making and statistics and understanding how perception and language works as as inference from evidence um and people like Zubin have made uh great contributions to this progress so um with those theoretical foundations it becomes possible to build cumulatively something that we never saw in the early days of the field uh in the first 20 years it was very rare to find one paper that built on the previous work of maybe more than two or three predecessors but now uh there are layer on layer on layer of theory and that's advancing the practice at the same time so in applications uh such as speech recognition and vision uh reinforcement learning deep uh the Deep learning techniques which have been uh gradually uh refined from ideas that existed 20 years ago but or with those tweaks all of a sudden have started working very well uh we've seen really dramatic progress um on those core problems um we now have ways of describing uh parability models that of arbitrary complexity using uh universally expressive languages that can essentially anything that uh can be written down in any language can be written down in one of these languages uh concisely um and we're starting to understand one of the most important problems for intelligent decision-making which is how do you make decisions over long time scales where uh your plans will involve millions or billions of of individual primitive physical actions uh but yet you would like to be able to plan over those kinds of time scales um I sometimes give a talk called um life play and win in 20 trillion moves uh 20 trillion is roughly how many things you do in your lifetime you don't think of it that way uh because you don't think at the level of individual muscle activations but that's roughly how many uh how many things you do in your lifetime um so we've seen all these very uh high profile public uh successes uh probably the first really big one uh in our memory was uh the victory over Gary Kasparov by Deep Blue in 97 um more recently uh just a couple of months ago um one version of Poker was uh Was Won by a machine beating the best human players and um uh the Deep Mind system so so deep mind are now part of of Google developed uh a reinforcement learning system that uses deep learning to represent the value function that it learns um and the input to the system is the screen of an Atari video game so 640x 480 pixels um and that's it so it opens its eyes it's born it opens its eyes it sees above its crib as it were uh a video game uh and with in a few hours it learns to play that video game uh at a superhuman level and it does that for over 30 different video games um and uh if you know if that was your baby in the crib and you know in the evening of the day of its birth it was playing most of the Atari video games at a superhuman level you'd be perturbed I think so um so the subject of perturbation appears if you the gentlemen on the left this is the final of Jeopardy which is a a quiz program it's not uh anything like Mastermind or one of the good British quiz programs um but it's you know it's it's something um and uh so Watson which is a system built by IBM uh to play this Jeopardy quiz game uh and the quiz game it's it's actually not uh it's not trivial it's not just factual retrieval it's sort of sort of like a uh a slightly Bland version of a cryptic crossword so there are lots of puns and tricks and so on in the way the questions are posed uh and Watson has has uh been trained to understand about it turns out there are about 110 different categories of Jeopardy question in terms of the way the question and the answer are logically related to each other um so in addition to having to have encyclopedic knowledge which it's extracted automatically it's not it's not a database that's been created by hand um all this is extracted from text that it finds on the web um so it's able to perform substantially better than the best human players who are very um uh intelligent well- read learned people some of them have even been even been Berkeley professors and um so the gentleman on the left saying I for one welcome our new computer overlords um and some of you have seen uh the progress been that's been made in self-driving cars um and I'll just show you assuming it works a little video this is a footage from Google's car which drives itself around the world and um this is the typical kind of image quality that the system deals with and uh and yet it's driven um almost a million miles now without any serious accidents okay so this is actually my colleague Peter ail's robot um and you'll see in front of it a pile of towels uh and you can actually if you have your laptop handy you can go on the web and see the video just type in robot towel folding uh so it picks up one of the towels and it looks around tries to find out where to where to grab it when it knows it has the corners and then it goes over the table and folds it up very nicely um so it can do you know basically it can start from a big tub of laundry put it in the washing machine run the cycle take it out put it in the dryer run the cycle take it out fold it all up and put it back in the basket so it's um really quite impressive progress if only I could show it to you um so this one's not a video good so um this is a fairly new task in uh in machine learning which is to take images and produce a caption for the image it's sometimes called automated captioning um and so uh these are trained again from uh millions of images that humans have captioned on the web and then this you you give it a new image and you ask what should we call this and the system comes up with a group of young people playing frisbee right so um when you when you see this kind of thing uh happening in systems that have essentially been trained with very little uh manual structuring or programming um then this this is the kind of thing that causes people to have the holy cow moment um let me talk very briefly about some work in my in my own group that um that Zu mentioned the global seismic monitoring um so this is uh a map of all the places where nuclear explosions have taken place uh so there were two of them in Japan in 1945 over there um they killed about 200,000 people and then all the rest have been uh tests or uh also in the Soviet Union there were so-called uh peaceful uses of nuclear explosions uh where they used them to dig uh canals or large underground caverns and things like that um not a very successful uh Enterprise um and all in all the test killed another 100,000 people uh from radioactive fallout uh and they also nearly caused several Wars um here's a little picture of uh the crater from a fairly small nuclear test uh the biggest nuclear test was 500 times the size of this one so um nowadays we don't think too much about nuclear weapons um but they are still uh a very serious concern for uh for the countries around the world that have to deal with them so there's a treaty uh the comprehensive nuclear Test Ban Treaty or ctbt and uh it bans testing of nuclear weapons on the earth and uh it's a very strong treaty uh if there is convincing evidence brought to the Council of States then they can essentially invade uh the country in question and occupy a 1,000 square kilm area uh and look for uh The Smoking Gun of a crate underground uh explosion chamber um and unfortunately the US uh proposed this 57 years ago uh and has not yet ratified uh the treaty that it itself proposed um and that that's a an embarrassing situation I would say and the claim was uh so when when President Clinton tried to get it through the Senate uh his opponents claimed that um it wasn't verifiable that we didn't have good enough detection mechanisms to prevent other nations from cheating with clandestine explosions so the UN has built a very large network of stations around the world uh most of them sizmic stations uh for detecting the vibrations that are produced by explosions and um this turns out to be a very difficult problem because the Earth uh produces things that are as big or much bigger than nuclear explosions hundreds of times a day and um uh and also people can try to cloak the nuclear explosion using various clever techniques so that they appear much smaller than they really are as seismic events they can set off two nuclear explosions one right after the other to produce something that produces the same radiation pattern of seismic signals as a as as a as a natural earthquake and so on so it's a very very difficult problem and uh so I I bring this example up to show you the the value of having the kinds of universal parabis languages that have now emerged in AI um So within uh a couple of hours of learning about what the problem was uh we could use one of these probabalistic languages and here's the here's the program that we wrote in this language and this is essentially the solution to the problem um so this this describes the physics of uh the occurrence of of seismic events I don't expect you all to read it and understand it and there's not going to be a test um so it describes the physics of how um seismic or nuclear uh events occur how signals are propagated how they're detected the kinds of errors that are made the kinds of noise processes that exist in the earth and um so what we do is we we write uh a large parability model like this uh we provide as evidence the signals from all the seismic stations around the world and then we simply ask the question what happened right what events took place where did they take place how large were they and uh the system answers it gives you its best explanation for the observations that are provided and um this this chart shows the the failure rate of the current system uh which for different magnitude ranges between 30 and 50% of the events that occur the real events that occur are actually missed by the current automated detection system and um applying the model that we we just showed uh we were able to get this down to about 10% or 11% and uh we're pushing that down quite a bit further right now and this is a picture of the North Korean test that took place in 2013 the the uh the terrain of the the test site uh there's a tunnel down here so we know that the North Koreans drilled a tunnel here and so the explosion probably took place somewhere in the vicinity of that uh black cross this is the estimate from our system and up there is the estimate from the combined expertise of of most of the world's geophysicist plus the the UN automated processing system so the the point is that what used to require uh the work of of possibly hundreds of people I mean the the UN system is over $ hundred million just to develop the software part of the system and they've been trying to solve this problem for about 10 103 years uh since the first paper was written on seismic uh localization so um in in the space of just a few months uh from learning about the we had a system working substantially better than uh what was available um simply because the the field of AI had produced tools that were uh sufficiently expressive to Simply describe the problem and solve it uh in an almost mechanical fashion so another thing that's going on and I'm sure hasn't escaped uh your notice if you have any students with any uh computer skills is that it's very hard to keep your students uh at the University because they they are being stolen at a rapid rate and um industry is investing uh probably more in the last 5 years than governments have invested since the beginning of the field so the the rate is really astonishing and the reason is very simple that um in several areas and I've listed some of them we're starting to see performance that in the lab has grown fairly steadily uh and perhaps recently has accelerated somewhat but to us in the lab it looks like continuous progress from the outside and in the commercial world uh it was completely invisible because it was not at the level where they could sell it uh as soon as it crosses that threshold it begins to exist and then you create this virtuous cycle of feedback you create a market uh and then once that market exists there's value in further improvements in the quality of performance because uh you can charge more and you can enlarge the niche of the market you can find new new applications and so you invest more money uh and you get a cycle that happens on a completely different scale of speed uh and and the uh magnitude of research funding that's provided so I think we're going to see we we've already seen this in automated vehicles um but I think we're going to see this um in uh domestic robots and intelligent assistants these will be the two uh after uh self-driving cars these will be the two new big areas where AI will have dramatic impact uh for ordinary people's lives okay so having having said all this uh then we get to the question of what if we succeed and and I good wrote wrote a very piffy uh article um that included this very very famous sentence the first Ultra intelligent machine is the last invention that man need ever make um and the reason is that of course uh what we have as human beings is is because of our intelligence we're not particularly strong or fast or scary um but we are relatively brainy and that gives us what we have as human beings so if we have more of that if we have machines that we can use to amplify uh our intelligence um then we can solve all the rest of our problems uh to First approximation so uh we have very uh positive outcomes that could come from uh increasing the power of AI and this is this is the vision that drives most people in the field that uh it's not it's not just uh as um litill put it that we're frustrated men who are unable to have children and so we build robots instead um it's that uh that AI could humans with AI could do wonderful things that we couldn't do um uned so he also went on there to of course make the other um famous point that uh an ultra intelligent machine could design even better machines there would be an intelligence explosion and the intelligence of man would be left far behind so that's usually the part that's quoted um he does go on to say It's Curious that this point is made so seldom out outside of Science Fiction so even then he was complaining that the the field of AI uh and should we say the serious uh people were just not paying attention to what he considered to be a really important problem and just to come back to that that metaphor of driving off a cliff he's saying look you are on a road the road goes straight you can see that the road goes off a cliff somewhere up ahead uh why are you continuing or shouldn't you just think about maybe uh changing uh changing direction so um there was a recent article uh that um I think Max tegmark initiated it uh but it went out with with several several of the CSR Advisory board members as co-authors and appeared in The Huffington Post and uh it included these two sentences uh and I think I I I don't think the first sentence is an exaggeration the other interesting thing is that this sentence has now sort of gone into uh folklore almost uh so when I was in Geneva I was surprised to hear the Ambassador from China actually quoting quoting this in his opening speech uh in the the UN discussion on uh a potential treaty to to ban autonomous weapons um so this this is we're not going to get any more scare mongery than this right the point is a very a very simple one how do we make sure that that road does not take us off a cliff um it's so given that it's this important I would say that there hasn't been enough serious thought so I I made a little thought experiment you know suppose that we got uh another you know candidate for biggest event in history would be if we were visited by Superior alien civilization that would be that would be a pretty big deal um so we get this email says to humanity un.org um we'll be there in 30 to 50 years okay and what do we reply well we we send them they out the out of office auto reply right this is this is sort of a state this is the current situation right the field of AI has sort of just said yeah whatever you know which is an odd an odd reply um to these concerns um part of the problem is that there is a lot of non-serious uh stuff being written um so here's a quote from the normally very sensible and serious Harvard Business Review uh in 2014 by 2025 these machines will have an IQ greater than 90% of the US population uh what on Earth could that possibly mean right right um so they don't have they don't have an IQ machines don't have an IQ at all um it's not like they have a small IQ they don't have an IQ at all the concept simply doesn't apply to machines uh you know a machine can be World chess champion and totally unable to play Checkers in fact they are World Chess Champions and totally unable to play uh Checkers so um and the notion of IQ is the entire point of it is that generally speaking there's a high degree of correlation between a person's ability to do one intellectual task and another uh and that's why it makes sense you know some people argue that no actually we have to have seven IQs um but for machines it would be you know in essentially infinitely many IQs you'd have to have a chess IQ a Checkers IQ um you know writing uh writing French IQ or writing English IQ and you know every single task uh as we currently understand how to make those tasks work um is done separately and has no bleed over into any neighboring toss and this is one of the reasons actually why the Deep uh the Deep Mind result was a little bit uh more impressive even than you might think because it's it really does learn lots of different games uh from scratch and do and do them all well and the games are are quite different you know some of them involve shooting some of them involve driving some of them involve you know batting things backwards and forwards with paddles um and yet it's able to deal with all of them so it's you know an early sign that generality is possible another thing you'll see a lot is predictions based on on uh the awesomely increasing power of computers uh and the problem is that the word power is being used in order to make the sentences true it's being used in a in a very simple technical sense of the number of floating Point operations per second per dollar or something like that uh and indeed that number is increasing uh quite rapidly um but it's the problem is when you start putting various kinds of brains on the y-axis and say well you know in this year 2029 you know we'll exceed the human brain and 7 years later we'll exceed all human brains combined uh that starts to be completely ridiculous because um at the moment our programs are pretty stupid at almost everything um and giving us machines that are a thousand times faster would just make them produce stupid answers a lot quicker um so I I exaggerate slightly but it wouldn't really constitute uh the kinds of qualitative bre throughs that we need it wouldn't get us anywhere towards uh the completely new structures of reasoning and learning that we would have to have to have general intelligence so I propose that we just ban these kinds of diagrams Al together um another non-serious thought that that we're getting a lot from journalists is you know if you raise this issue it must be because um you know it's urgent it's imminent uh you know I've had people tell me you know I have to you know I have have to stop my research right now uh because you know whatever I do and if I put my hands on the keyboard I'm putting the human human race at risk uh you know it's not like that um it's it's not right around the corner and it doesn't have to be right around the corner for there to be a reason to be concerned right you don't have to be right on the edge of the cliff for someone to say well you know you are driving towards the cliff you know you sure you want to do that right you don't have to write at the edge before someone's allowed to say that um the other thing is that as we'll as I'll illustrate in a minute it's very hard to predict the breakthroughs so I I although I don't believe it's imminent I can't say it's going to be at least 50 years either um it's very difficult to figure out when these breakthroughs will occur um so the point is if there if there's a non-negligible possibility and this is sort of the Mantra of uh CS in General right that it doesn't matter whether the risks are small if they're non- negligible as a serious argument uh that they they're not vanishingly small um then it's certainly worth trying to figure out you know how large is the risk when is it likely to be serious can anything be done and this is true you know for climate change for asteroid impact and for AI uh and I think the sooner we do something the better um many articles about this just seemed to be an excuse to to pull out the old pictures of Terminator robots it doesn't matter what I say I could I could issue a press release with a blank sheet of paper and then people people would come up with a LED headline and put Terminator robots in in in in the picture um so but armies of robots actually that's is not this is not what people are talking about that that isn't the danger and if you think about it right um you know when humans have done terrible things to the World they've done it by speaking right they've done it by convincing lots and lots of people to do things that uh upset lots and lots of other people in very serious ways um and so you can have impact on a global scale simply by having an internet connection and therefore you know access to three billion um screens another thing you see a lot uh is that you know we we don't need to worry about this because you know the the only problem arises from machines spontaneously becoming malevolent and that couldn't you know why would that happen right we design the machines they're not you know we are AI people they don't become spontaneously malevolent uh so we can simply ignore it well I mean yes spontaneous malevolence is a is a ridiculous idea and and some of the um some of the stories about how the risk uh could arise seem to involve this and many movies of course involve it um but actually the risk comes from comp competent decision- making which is so it's not that there's some evil guy in his garage or as in ex mackina uh in his in his weird uh Patagonian Retreat um uh developing something with without uh thought about what he's doing this is the goal of the field right our goal as a field is to make better decision- making systems and that is the problem that's why I'm saying that the road appears is to lead off the cliff if we just follow it as we're currently doing um so let me try to explain why the road appears to lead off a cliff and uh and then talk about what we could uh what we could do about it so the problem with AI that gets better and better is that uh it gets better and better and what it wants to do is not what we really want done right if it was what we really want done then I suppose we wouldn't complain is so we might you know we might complain that we have nothing left to do you know we're all going to sit around watching reruns of of um Survivor or something like that but um but obviously if a system is doing something that we don't like uh and it's much much smarter than we are then that seems like a uh that seems like a problem um and uh I put asterisks by week we um because obviously there isn't a single we there are lots of us and we all want different things I put Asis on the word really because it's actually very hard to figure out what we really want and that's part of the problem right you're going to build a super intelligent machine you have to give it something that you want it to do right and the danger is that you give it something that isn't actually what you really want because you're not very good at expressing what you really want or even knowing what you really want until it's too late and you see that you don't like it um and all the fields that study better decision-making pretty much take the objective as something that comes from the outside and we're going to study how to optimize it um but it's entirely up to you what it is right and this is sort of almost a religious uh preceptor of utility theory that it can be anything as long as it satisfies the right set of axioms uh we'll treated as a utility function um and I'm I'm arguing that that's actually that's a fundamental mistake uh that we can't make um so many people have made these kinds of algs this um Nick bust in particular has done a good job of collecting them together and making them very persuasive uh he has the paperclip example Marvin Minsky had uh calculate pi example you could even have something as innocuous sounding as as cure cancer and if those goals are given to a superintelligent machine without uh it really understanding all the rest of human values and how in fact the this is not an absolute goal but the there are lots and lots and lots and lots and lots of trade-offs uh that have to be made uh which a human would naturally make right so the human naturally understands that when you say I want you to uh to cure cancer you know eliminating the human race in the process of curing cancer isn't what in fact you really want right um and same with make paperclips right so we don't want the whole world to be covered in paperclips uh we just want enough paper clips you know to do what we need to do in terms of clipping paper together so this is a very old story right so King Midas had exactly this problem right he didn't specify what he wanted correctly he said I want everything I touch to turn to Gold he got exactly what he said and then he realized too late that it wasn't what he wanted and he could he could no longer eat or drink and died a miserable death um and we all know what happens with a genie right you ask for something and by the time you get to the third wish you you just wish to undo what you did in the first two wishes uh because you misspecified your goal and and this is a it it recurs over and over and over again in in mythologies um so if you if you specify uh an objective for an elligence system that isn't exactly what you want um there's this extra argument that Steve alandro uh pointed out that for any goal you give a system uh in order to achieve that goal uh one of the things it has to do is to make sure that it doesn't get Switched Off in fact that's that's the most likely if you're a superintelligent system the most likely reason for failing in your given mission is that someone switches you off so you're going to take prec itions to make sure that can't possibly happen right uh so this is one uh one argument and then the other one is that you can increase your uh probability of achieving the goal by acquiring more and more financial or physical uh resources to carry out the task and I think the Transcendence movie does a good job of of illustrating these points pretty well um so it it it describes what it's like uh to see an intelligence explosion happening in a system that basically grabs control of lots of money and resources and uh figures out how to defend itself uh extremely well um so if you combine that with value misalignment now you have a super intelligent system that's doing things that you don't like uh and it's extremely good at uh defending itself and it's acquiring all the resources that the rest of the human race might want for something um so this is obviously a problem right so this is your kind of 2001 Space ay problem where H doesn't doesn't want to do uh what the humans would like so if you think about it just in terms of you know an optimization problem so the machine is solving an optimization problem for you and you leave out some of the variables that you actually care about well it's in the nature of optimization problems that if the system gets to manipulate some variables the sort of that don't form part of the the objective function so it's free to play with those as much as it wants often in order to optimize the ones that it is supposed to uh uh optimize it will set the other ones to extreme values right and this so in a simple mathematical sense uh the problem is is sort of inevitable uh unless you have a uh an approximation to the the objective function that's pretty good and pretty complete um another pretty obvious problem uh so besides just you know how do you make sure the system does what you want is that these systems are in principle going to be difficult to analyze right I mean they're uh in the in the standard thought experiment they're already smarter than you so how can you figure out what they're going to do because if you could do that uh you'd be as smart as they are right much you want um and also you know it's there's a lot of difficult technical questions involved in in software you know we're used to doing formal formal analysis of of software that's fixed you know we write software that doesn't uh overwrite itself we prove it doesn't overwrite itself and um and then you know it's sufficiently manageable that we can actually verify that it satisfy certain properties but um here we have to allow for the fact that the systems could uh you know since they're smarter than we are they can um they can program just as well as we can uh they can build new versions of themselves that are more powerful and and then program those and so this takes you into um areas that are far beyond what we really understand how to handle uh in computer science so my proposal um is that we should stop doing AI in its simple definition of uh just improving the decision-making capabilities of systems we should add some extra adjectives right uh eventually these adjectives will disappear into the word AI just as with civil engineering we don't call it um you know building bridges that don't fall down right we just call it Building Bridges of course we don't want them to fall down and we should think the same way about AI of course AI systems should be designed so that their actions are well aligned with what uh human beings want but it's a it's a difficult unsolved problem that hasn't been part of the uh the research agenda up to now um okay fine all right so how do we make them again I put asterisk because you know in some sense it's almost oxymoronic to put provably in beneficial where because beneficial is is a is a vague term what what does that mean um and since we don't have a precise meaning how can it be provably okay uh let me um just go over some of the ideas that uh people have proposed so one of the first things that would occur to you is to say fine right if the problem is that a system can act on the world you know on a on a global scale uh and affect things in ways we don't like let's simply take away the capacity for Action altogether right we put it in a box right this is so it's a very natural thing if you have a tiger you put it in a page right that seems like a natural response to this problem so we seal it off um you know we don't connect it to the internet uh and so on but you have to connect it to something otherwise it may as well not exist at all so um so people have proposed making uh systems that only can answer yes or no to questions and um I see elasa uh nodding that yes so U and there have been lots of arguments showing how systems that even Limited in this in this way can still cause you arbitrary amounts of grief so I'm not going to go into those details but you know one one point is that so even if it's provably a provably a question answering system that provably gives you you know either yes no or probability answers to questions it still has to have the ability to uh control its own computational behavior right so in some sense it's still a real agent at the what we call the meta level so it still has to have the ability to decide which computations to do uh and I think if you take away that ability then you're probably taking away the ability to be a super intelligent question answer um altogether we I don't know that for sure because no one's really thought about it very much um but if you uh if you do have uh an agent at The Meta level then you may be just replicating the same sets of problems that uh you try to avoid another approach might be a kind of a stepwise progress so um if it's really difficult to to be sure that a proposed design for a superintelligent agent is is safe then perhaps we could build a you know if if we if we have some pretty good idea of how to make super intelligent uh decision-making systems then we could probably make super super intelligent verifiers right whose only job uh was to look at the designs we're proposing and say you know yes this one is safe if you deploy this it will it will not be a disaster for the human race okay um that might be so then there would be a kind of a a stepwise You' improve the verification technology and that would allow you to deploy a new generation of agents and then you would keep uh kind of uh Lea frogging one over the other uh and make progress that way that this might that might work I'm not sure um but there's it's an open question as to whether the problem of verifying that an agent uh is safe when it's a you know when it's a system of you know that can achieve decisions with quality X is is that verification problem actually easier than making a system with that decision quality in the first place and uh we don't know the answer uh there's a lot of questions that just ought to be answered and I think many of these probably can be formulated uh mathematically and and solve precisely so um at the moment we talk about intelligence systems having goals you know like it has the goal of making paper clips um it has the goal of Defending itself and so on this is all just words right is there an objective sense in which we can take a machine and say this AG this this agent implemented in this machine uh has a particular goal right as yet there's no answer to that question right would seem like a pretty sensible question to try to answer um and I think it can be done um there there isn't really an answer to the question you know is this agent better than that one um we would like to understand questions like uh can we be sure that uh this agent design will never overwrite the goals that we provide so if we do figure out the right objective or goal to give it and we can prove that in fact yes it does have those goals can we prove that it can never um overwrite them uh with something else right this would be um not uh easy to do right now but I think it's the kind of thing that uh with accumulation of uh of mathematical results we could get to the point where this is quite feasable to do so my own uh inclination right now is actually to work more directly on the problem of value alignment if if misalignment of values is the the root cause of of the the difficulty then perhaps we can just get systems to have values that are sufficiently well aligned with those of humans so there's um there's a concept called um inverse reinforcement learning uh which is uh a reinvention of something in economics called structural estimation of mdps um and what it means is um sort of the inverse of reinforcement learning so reinforcement learning means uh that you're given rewards depending on how well you behave um and then you learn to behave in such a way that you maximize the sum of rewards that you receive okay um and there's there's pretty good evidence that this is both effective in practice it's the method that Deep Mind uses to learn the um the Atari video games there's also evidence that a uh a lot of human and animal learning actually follows this General pattern uh so inverse reinforcement learning is the opposite of that so here we're observing a behavior and we're trying to figure out what reward function or what objective function is that behavior uh trying to maximize okay and um so I would argue in fact that although we certainly have learning mechanis learning mechanisms for our values as humans that are based on the reward functions that are built in by Nature pain pleasure and so on um I think that it's also the case that a lot of what humans do as they grow up and learn value systems is actually inverse reinforcement learning that they're acquiring a value system sort of directly from other humans by sort of reverse engineering how everyone else behaves how people react to that behavior and so on um so there are there's now a fairly deep literature on inverse reinforcement learning plenty of theorem showing that you know in such and such circumstance these algorithms will after a certain amount of experience learn to behave uh nearly as well as the agent that they are observing or learn an approximation to the value function of that agent that uh that gets arbitrarily close uh in the limit so those kinds of theorems uh exist so you know what you would like is that um you know the your household robot watches you struggle out of bed in the morning and do funny things with little round brown bean-shaped objects in a very noisy machine and then something that make Steam and and so on and then you put this brown stuff in your mouth uh so you want your robot to understand that yes coffee is uh something that's good for uh for people to have in the morning you don't you don't want the robot to want the coffee right so so we don't want the robot to learn the values of the human in a sense that that it it takes on those values itself right what we want is that it wants to maximize the value that humans have it actually doesn't care about itself at all right in fact it's the right the right way to think about it is that the robot has no desires whatsoever uh its only desire is that the the outcomes that it brings about are ones in which the human is happier so we'll call this Cooperative uh inverse reinforcement learning um so in Game Theory language uh we want to learn uh a value function for this multi-agent problem where the Nash equilibri are the ones that maximize the the payoff for humans and when you start out um the robot doesn't know what the human payoff function is okay so you know being basian we can put a prior over payoff functions um and when you have a very broad prior uh then there's a question about you know what what should the robot actually do before it's learned what what humans want and our intuition my intuition is that it should basically not do very much right it should uh stay in the background uh and watch and learn maybe ask questions maybe do some a cautious exploration um of uh of our preferences and um so risk aversion in in the agent payoff function seems to help uh with that but there has to be more than that right there has to be some belief that doing nothing is somehow safer or better you know than doing any other real physical action uh and that has to be just a bias because you know from the point of view of of Game Theory all actions uh are a priori just tokens action 1 2 3 4 one of them happens to be the do nothing uh and there has to be some bias that makes you prefer doing that so there's still some questions there I don't quite understand um I'm H happy to discuss that if people have some IDE ideas um and so the goal would be to to analyze these kinds of systems these multi-agent systems and show that uh in the limit the total loss that the human might experience as a result of having this robot uh in the world uh is Is Not Great uh because the robot doesn't do much until it has a pretty good idea of what the human wants so you know it hopefully it learns fairly quickly that it's fine to you know pull little children out of the street so they don't get run over by buses and things like that but you know it perhaps doesn't initially know whether to intervene in a fist fight between two adults or something like that right so um so the idea would be that with with these kinds of uh prior and risk aversion that you could bound the loss to the human to something fairly small um and so clearly it it depends on how well you approximate uh what the human being cares about but it also intuitively seems to depend on on how smart the agent is and this is the part that's that's really puzzling right our intuitions tell us from watching science fiction I suppose that uh when the agent is super intelligent when it's really able to optimize payoff over an arbitrary long Horizon and take into account essentially all possible available information and and do all the the the basing calculations correctly that um that an error in the payoff functions is somehow potentially magnified that it could could result um in uh much worse things happening because you have a really really capable agent and so there's a question about how how fast does the error in the payoff have to go down uh as the intelligence of the agent goes up uh in order for not bad not too bad things to happen um so this seems like a very interesting set of questions that uh you know at at the level where you could find a really really bright uh PhD student or postdoc and actually you know turn these into things concrete enough to actually do research on um so there are some obvious difficulties right so if you then sort of go away from the IDE idealistic world of computer science uh academic papers and say okay Mr Professor uh so are you really proposing that by these algorithms uh you will be able to have machines that that over time by observing human behavior learn uh a pretty good approximation to the values of uh what we might call an admirable human being um it's a very difficult question right so so clearly uh one of the obvious problems and inverse reinforcement learning uh addresses it only a little is that humans don't behave rationally According to some single consistent utility function um they have multiple uh means of choosing how to behave they have uh reflex impulsive behaviors they have deliberative behaviors um they have parts of their brain from you know early evolutionary periods that respond uh to very crude stimul in very crude ways uh so there's a very uh complicated uh system of decision-making subsystems uh inside the human brain um and we're also fundamentally computationally limited so we can't behave rationally so just think about a chess game right so you observe a human the human actually is trying really hard to win the game um but being unable to consider all 10 to the 55 possibilities uh makes a move which eventually uh turns out to be the one that lost the game well if the system is really smart and observes that move it shouldn't infer that actually the human was just trying to lose at that point right even though that's in fact what the move was uh succeeded in doing right so so the computational limitations of the human are absolutely fundamental to understanding how we behave uh and inferring what we're trying to do so you can you can do actions that that don't achieve a goal and still be trying to achieve that goal and that's how it should be interpreted um there's obviously questions about the fact that individuals and cultures have very different values and what uh what should be done there and also the fact that you know we're not really individual decision makers and our preferences are shaped by by Evolution and Society um not to benefit us as individuals um but the preferences are there to some extent to make sure that our species or our kin group or whatever it is our genes uh get propagated and survive and succeed so we act in ways which which moral theorists look at and say well that's clearly irrational right um for example the the preference people seem to have for um alleviating local suffering rather than suffering on the other side of the world even if the suffering on the other side of the world might be much worse well yes I mean if you're thinking about an individual decision and just looking at the consequences of that one person's decision that seems wrong that they should devote more resources to you know poor little puppies uh in their Hometown than someone who's starving to death on the other side of the world but uh if you if you think about how everyone in the world behaves right if everyone in the world found the one person who was suffering the most and gave all their money and resources to alleviate the suffering of that one person that wouldn't work either that would be total disaster right so so this sort of locally apparently IR rational Behavior may be sort of optimal or better on a on a more global scale so the so it's not clear that we should even uh analyze the values of individuals because they may not uh may not make sense um okay so why why do I think this is a positive idea one is I think that um we we have time to do this and we have uh lots and lots of information right everything we almost everything we write almost every movie um every newspaper story is about people doing things um often to other people often against their will uh and there are um there are lot there are lots of reactions of of people recorded and so um presumably A system that was able to be super intelligent would be able to solve this Cooperative inverse reinforcement learning program from this massive amount of data and come up with a reasonable approximation to uh what we might think of as an admirable uh set of human values the other reason I'm optimistic is that there's a good solid economic reason for solving this problem long before we reach super intelligent systems which is that um if a system doesn't have the kind of common sense about what humans do and don't value for example a domestic robot um you know might well notice that the fridge is empty you know it's supposed to uh feed the kids before Mom and Dad get home from work uh and uh decides to put the cat in the oven for dinner and that only has to happen once for the domestic robot industry to basically be be eliminated uh so this is really strong economic incentive to to get the value systems of of system of things like domestic robots um right okay um so when I say this kind of thing to people there are a number of responses and I'm sure that there are other people in the audience who've been saying these kinds of things to AI people and and we get uh a range of responses common one uh it'll never happen this is the the argument that uh it's fine you know we'll run out of gas at some point before we go off the cliff and um I usually like to quote a little incident from history this is ruford very good and uh so on September 11th 1933 he gave a speech to the British Association for the advancement of science anyone who looks for a source of power and a transformation of the atoms is talking moonshine right and this was not a oneoff comment this was something he said over and over again in newspaper articles and and uh and interviews and and speeches and so on and this this particular comment was reported in the times the next morning and read by this gentleman Leo zard and uh and he got so annoyed by that that he invented the nuclear chain reaction um so we went from never to about 17 hours uh and so I I I'm not ready to say it will never happen um and a few years later this is what um Zill wrote in his notebook so we switched everything off and went home that night there was very little doubt in my mind that the world was headed for grief um so he had just demonstrated uh a sustain uh nuclear chain reaction that was observed with with scomet and so on and um one of the reasons the world was headed for grief was because they were doing this work in the context of an arms race uh with Germany and then later on with the Soviet Union and uh it was too late at that point to think about how you might control this technology how you might make sure it was only used for peaceful purposes uh and so on they they had no choice but to to go ahead and and produce weapons and interestingly so HG Wells wrote about atomic bombs uh in 1913 he didn't quite get the the technology right uh instead of going one enormous explosion his bombs made a explosion sized ex ordinary explosion size explosion that continued for several weeks uh instead of all at once um but he had the rough magnitude of the energy about right which was already known so so view is we don't I don't want to come home and and write that in my notebook and nor does any other AI researcher want to write that in their notebook um so I think we have to solve this problem of control the sooner we solve it the easier it will be um it might turn out that it'll be easy that would be great you know just just like the people who worry about asteroids you know they could take out a few envelopes and figure out oh yeah you know fine we just need this kind of rocket and this kind of propulsion uh repulsion system and this kind of explosion to uh to deviate the asteroid you know and anything up to 200 mi in diameter we can easily you know that would be lovely to have that kind of outcome uh then we could all just go back to doing AI um but so far no one's really thought of a clear um and easy solution to this problem inevitably uh there will be more and more pressure both from uh commercial interests and Military interest and Military interests are are maybe much less constrained by this public relations problem of putting the cat in the oven uh and things like that you know um you know if their systems work 61% of the time they're pretty happy um and uh they have a different utility function from most of us the other thing is it does take a long time to do this kind of stuff so if you look back at I don't know whether we need a regulatory framework a legal framework treaties um you know that that systems have to be certified before they can be released uh that kind of stuff um but if you look at how long it takes to go from for example horror over chemical weapons to a natural treaty horror over biological weapons to a natural treaty uh it seems to be on the order of 60 to 70 years uh and I don't think we have 60 maybe we have 70 years I don't know but I I wouldn't be too sure um another response is well it's too late this this is George Bush's response to climate change after so having denied that it existed his he then moved to it's too late to stop it um so I I I hope not um a version of that is well there's nothing you can do right this is research you know and you can't pass a law saying saying that certain types of equations shall not be written on whiteboards right that that can't possibly work and and really you know the breakthroughs that we're thinking about it's not you know creation of a disease organism uh with certain kind of DNA or anything like that it's it's really mathematical equations it's it's understanding uh that these breakthroughs are composed of um and so it's a little more difficult but there is a there is a very successful precedent which is uh what in genetic engineering so in um a workshop in 1975 held at ailar uh which is a retreat in California um there was uh a very conscious decision that even though the whole purpose of understanding um genes or one of the main purposes of understanding genes had been the hope that we might be able to make ourselves better uh that we might be able to understand how we're constructed and improve improve that constru uh to to eliminate uh defects that um there was a very conscious decision that by going that route we would end up with a society that we probably wouldn't like right we might be happy in it but from this vantage point it doesn't look like society that we want to get to um and so they had a ban a voluntary ban um that has various kind of enforcement mechanisms and it's been pretty successful um that as far as we know there have been no um sustained heritable modifications to the human genome uh even though that's been feasible for a long time it just became um a lot more feasible one of the main things was that you can't you can't get FDA approval uh for any uh medical treatment based on such um modifications uh there's a new method now called crisper um which uh is a much more accurate uh and straightforward technology for modifying The genome and there was another Workshop held earlier this year where they went back and said okay let's revisit this question because you know one of the original reasons for not wanting to do it was the risk of producing uh people with very serious deformities or other kinds of genetic problems um not understanding well enough what you would produce uh now I think people have a very good grasp of of how to do the editing and they can be very precise uh they Revisited the question and buy and Lodge they decided that we still want to keep this ban in place I think it's a much uh weaker right now and there are some groups in China that uh deliberately went ahead and started doing some experiments um but I think the consensus will hold um but it may need some some more exhortations um it's a little different also because in the medical field in general right just like civil engineers automatically know know that bridges are not supposed to fall down right they don't have bridge design and then making sure the bridge doesn't fall down people right it's just the same person that's what they do um and uh the same is is true in medicine that's just part of what you do that that work on work in medicine is designed to be beneficial to humans there is an asterisk on pervasive because there's a subculture which is the synthetic biology sort of garage hacker subculture which deliberately takes the opposite view right that we should be able to do anything we want uh and any kind of constraint whether it's voluntary or or legal is anema to to our ethos as as independent uh inventors I'm not sure I agree with them okay another response is uh is very common um in particularly common because the the people raising a lot of these concerns have been outside the field of AI well you don't know what you're talking about right you just Lites you're just trying to stand in the way of progress you know and we all know what happened with Lites and their Lites were wrong right we we talk about all those silly people who who said when the automobile was invented that it would result in in Carnage on the road that that thousands of people would die I what what idiots right um so but um actually no it certainly my goal is not to stop AI research and and um I don't think that's the goal of of many of the people it is the goal of some of them but I I think for the most part they want to just make sure they want AI to happen because of all the good things even even some very uh far out ideas about about what AI might do for the human race uh in the positive direction um those things can't come about uh if the these questions are not addressed so we want to allow AI to continue because it can be very beneficial in lots of ways um we have to solve the problem first right um we want it to we want to change the field so that you know it feels like civil engineering or like nuclear fusion right it's it's easy to achieve the goal of nuclear fusion if you said the goal of nuclear fusion is to produce unlimited amounts of energy right just like unlimited amounts of intelligence then wow they succeeded in 1952 right they created a hydrogen explosion hydrogen bomb explosion unlimited amounts of energy right more than we can possibly use um but it wasn't in a socially beneficial form right and now it's just what Fusion researchers do right containment is what Fusion research is that is the problem that they work on right the part about generating energy will just sort of follow once they figure out the containment problem so this is what we want for AI right uh we should stop calling it ethics of AI and putting it in a separate workshop and sort of imagining that there's there's someone who's going to go around looking over people's shoulders and checking on what they're writing and making sure it's ethical uh this is a total non-starter as far as um as far as having a real impact on the field um so I'm I'm happy to be talking at a time when progress seems to be moving in the right direction um so thanks to some of the people uh here in the room and some other people uh I should uh mention uh Yan Talon in particular who who was has been pushing this area and with his own resources for quite some time um there was a conference in Puerto Rico organized primarily by Max tegmark um which brought together uh some of the mainstream AI people some influential decision makers in the field some um some people uh such as Elon Musk and Pier om and others uh who have expressed a strong interest in this and wanted to understand more about where things were going was this the right time to uh to intervene in the way they're uniquely qualified to intervene um and I think they decided that yes it was uh so out of that conference came uh this open letter called research priorities for robust and beneficial AI which uh lots of people signed within a week we had over 6,000 signatures on that letter um and then a few days days later uh Elon Musk announced his donation of 10 million to uh fli the future of Life Institute uh which is at MIT in Harvard uh and that uh money is now in the process of being given out through a competitive uh Grant uh process so we're um we're moving along and uh so actually looks all these things happen just in January right uh and then triple AI which is the main Professional Organization for AI for a long time had a very strong position that that ethical and political issues were not part of what a professional Society should be concerned with um that this was a scientific society and these were not scientific questions and therefore um they weren't really interested in having uh having any official position uh that seems to be changing rapidly there's now uh a tri a committee on uh ethical uh and societal impacts and um we had a debate on autonomous weapons in particular at triple AI this year and uh we may move towards a vote by the professional Society actually taking an official position uh on that topic sometime later this year so I think that um from a situation where you know even 5 years ago the mainstream AI Community just didn't take this question very seriously there was a task force that Eric hovitz put together when he was president of triple AI uh that sort of took a look at the question um you might say the committee was stacked because it didn't have anyone uh on the committee who had Ever Raised any of these questions it was really just mainstream uh AI people who do regular AI research and they pretty much dismissed I mean they said yes this is worth thinking about but we're going to dismiss it right there's nothing really that's worth paying attention to and everyone should just move along that position I really think has changed uh dramatically and it's because um I think not not because people have suddenly seen the virtue of the arguments I think they've suddenly seen that AI is really making progress uh and they thought about it themselves and said oh yes okay we are moving forward we are probably going to move along this road let's have a look and see where the road goes so um so in terms of this rate of progress yes I think we we are uh going to reach unless we choose not to uh we're going to reach human level AI uh in some sense um and uh fortunately uh maybe not too late the field is changing the way it thinks about uh what it's trying to do to move away from this this pure goal independent notion of intelligence to incorporate uh the purpose uh into a into the definition of what we try to do okay I'll stop there and take questions thank you

Related conversations

AXRP

3 Jan 2026

David Rein on METR Time Horizons

This conversation examines core safety through David Rein on METR Time Horizons, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med 0 · avg -0 · 108 segs

AXRP

7 Aug 2025

Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med 0 · avg -5 · 133 segs

AXRP

6 Jul 2025

Samuel Albanie on DeepMind's AGI Safety Approach

This conversation examines core safety through Samuel Albanie on DeepMind's AGI Safety Approach, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med 0 · avg -4 · 72 segs

AXRP

1 Dec 2024

Evan Hubinger on Model Organisms of Misalignment

This conversation examines technical alignment through Evan Hubinger on Model Organisms of Misalignment, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med -6 · avg -7 · 120 segs

Counterbalance on this topic

Ranked with the mirror rule in the methodology: picks sit closer to the opposite side of your score on the same axis (lens alignment preferred). Each card plots you and the pick together.

Mirror pick 1

AXRP

3 Jan 2026

David Rein on METR Time Horizons

This conversation examines core safety through David Rein on METR Time Horizons, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Spectrum vs this page

This page -10.64This pick -10.64Δ 0
This pageThis pick

Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.

Spectrum trail (transcript)

Med 0 · avg -0 · 108 segs

Mirror pick 2

AXRP

7 Aug 2025

Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Spectrum vs this page

This page -10.64This pick -10.64Δ 0
This pageThis pick

Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.

Spectrum trail (transcript)

Med 0 · avg -5 · 133 segs

Mirror pick 3

AXRP

6 Jul 2025

Samuel Albanie on DeepMind's AGI Safety Approach

This conversation examines core safety through Samuel Albanie on DeepMind's AGI Safety Approach, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Spectrum vs this page

This page -10.64This pick -10.64Δ 0
This pageThis pick

Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.

Spectrum trail (transcript)

Med 0 · avg -4 · 72 segs