Library / In focus
AXRPCivilisational risk and strategyFeatured pick
AI Control with Buck Shlegeris and Ryan Greenblatt

Why this matters
Frontier capability progress is outpacing confidence in control; this episode focuses on methods that can close that reliability gap.
Summary
This conversation examines technical alignment through AI Control with Buck Shlegeris and Ryan Greenblatt, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Perspective map
Risk-forwardTechnicalMedium confidenceTranscript-informed
The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.
An explanation of the Perspective Map framework can be found here.
Episode arc by segment
Early → late · height = spectrum position · colour = band
Risk-forwardMixedOpportunity-forward
Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).
Showing 140 of 174 segments for display; stats use the full pass.
StartEnd
Across 174 full-transcript segments: median -6 · mean -9 · spread -38–10 (p10–p90 -24–0) · 17% risk-forward, 83% mixed, 0% opportunity-forward slices.
Slice bands
174 slices · p10–p90 -24–0
Risk-forward leaning, primarily in the Technical lens. Evidence mode: interview. Confidence: medium.
- Emphasizes alignment
- Emphasizes control
- Full transcript scored in 174 sequential slices (median slice -6).
Editor note
Anchor episode for the AI Safety Map: high signal, durable framing, and immediate relevance to leadership decisions.
ai-safetycontrolaxrptechnical-alignmenttechnical
Play on sAIfe Hands
Episode transcript
YouTube captions (auto or uploaded) · video gQbCO6zGRiI · stored Apr 2, 2026 · 4,965 caption segments
Captions are an imperfect primary: they can mis-hear names and technical terms. Use them alongside the audio and publisher materials when verifying claims.
No editorial assessment file yet. Add content/resources/transcript-assessments/ai-control-with-buck-shlegeris-and-ryan-greenblatt.json when you have a listen-based summary.
Show full transcript
[Music] hello everybody in this episode I'll be speaking with buck Sluggers and Ryan greenblat they work at Redwood research on the project of AI Control that is averting risks from AIS that intentionally subvert safety measures for links to what we're discussing you can check the description of this episode and you can read the transcript at axr p.net all right back and Ryan welcome to the show exciting to be here exciting to be here as well cool um so I guess we're going to be talking about sort of the control approach to AI safety or something um that you you two have written about what for people who like haven't done the background reading what is the control approach yeah so AI Control is an approach to managing intentional risk posed by models uh in particular it's an approach to managing risks that arise from Models intentionally subverting your attempts to ensure and evaluate safety okay and so the core idea of AI Control is here are two different reasons why you might be safe from Models intentionally causing problems one is that the models might not try to intentionally cause problems and the other the second argument is you might have it be such that even if your models try as hard as they can to cause problems they are unable to do so and AI Control is the second class of approach to mitigating problems with uh models intentionally subverting your safety techniques and we we think that this approach has the particularly nice property that it's a lot more methodologically simple to think about and ensure uh than trying to make sure your models actually just don't at all try to subvert your safety techniques what makes it easier to think about so it's easier to imagine specifically what goes wrong and to discuss specific threat models whereas for example um it's hard to know exactly how AI internally represent intentions or what would make their intentions change and so it's very unclear if you for example um train on better feedback will that make it less or even more likely that your AI would try to do things that are actively against your interests um and it's just pretty unclear what the sign on various interventions are or how strong various interventions will be um for uh like the other class of non-control things which is often called alignment okay so should I should I basically be thinking that in control we're sort of we're trying to think about objects like can somebody like break through this wall or can somebody like write sketchy code without me being able to check whereas for alignment I have to think about like neural net generalization and like how desires are encoded and hopefully like the first class of things are just easier to think about that's that's a good way of putting it like there's basically a classic set of arguments about why AI alignment is hard to ensure which is no matter how your AI is behaving uh you have to worry that it's just behaving that way because it thinks it has some way of looking good now in the hopes of causing problems later and basically any strategy which tries to ensure safety via doing some set of interventions to make a model be aligned especially if those only involve looking at the input output behavior of the model run into this kind of problem of like well how do you know that it's not just looking that way on purpose in the hopes of causing problems later this in fact seems quite intractable and difficult and it seems to us that given the current state of Technology on AI safety if you want to minimize this type of risk it's by far easier to just assume that that is going on and try and render yourself safe anyway once we started thinking about things in these terms uh we quickly realized that a lot of the work that happens in AI safety is actually very well thought of in the control frame anyway that is a lot of a lot of techniques that people apply kind of implicitly can be thought of as strategies to ensure that even if your models are scheming against you you end up with okay outcomes uh but we think that in as much as that's in fact what you're trying to do with your safety techniques you're a lot more likely to analyze them correctly if you think about them explicitly in this adversarial frame where you're asking about whether there's any strategy that your models are able to employ that would allow them to cause problems yeah I actually want to ask another question about that so the one thing that's sort of similar to control uh or at least the kinds of thing I see you do doing is like alignment VI adversarial training so like you put your AI in a situation where like maybe it can do something sort of like taking over the world or whatever um and if it actually does it then you like give it reward ne10 or something and like try to gradient it away um so how do you think of the control approach broadly as comparing to that yeah so um a technique like adversarial training could be used for either from like a control perspective or could be used to ensure an alignment safety condition like these are just like different post conditions or like different conditions you're aiming to ensure in order to ensure safety um so for uh ensuring control or for trying to to achieve control the hope would be that you adversar evaluate your and then if you catch evidence of extreme egregious misalignment then you would reconsider and and decide what to do next um and in fact it seems uh pretty scary to have an AI that you know was egregiously misaligned in the past and you've trained it no longer to do that in adversarial training and such an AI it's not clear that you've necessarily made it aligned versus just made it better at hiding um or better at avoiding adversarial training um for both the hope of ensuring that your AI ends up align or ensuring that your AI has no way of causing problems um you acquire the same property which is that there is no strategy that AI could follow to cause problems in deployment or in in actual usage of the AI that you wouldn't have caught in adversarial training um and this seems like uh naively a pretty hard property to ensure um for a variety of reasons um and and because of that uh like the actual conditions in which adversarial training or if you don't train against it adversar evaluation seem useful um or like yeah I mean it's going to depend a lot on the details but it's definitely not like a an obvious sufficient solution sure I also would like to ask you about the contrast so like I don't know I feel like 10 years ago a bunch of AI safety people talked a lot about AI boxing like putting AIS in a situation where they couldn't really do that much um how how do you think control is like conceptually similar or distinct to that yeah so I think um in some sense you could just rename control to boxing we think that name is much worse because boxing implies you're doing um just mechanisms that restrict the ai's affordances uh whereas I think a really important class of interventions is observing the AI over a wide range of situations and using that to rule out certain problematic things so for example if you oversee or have humans observe 1% of the ai's actions and that looks fine in those 1% of cases and you're just doing this do this on an ongoing basis then that's sufficient to ensure that the AI can only do things that that a human would catch quite rarely and and it's not clear this is like I think when people think of boxing this is like not naturally what people will think of um another key aspect of um control that I think is not very naturally implied by the word boxing is that using weaker AIS that we know aren't smart enough to be seriously misaligned to do monitoring or to do other tasks might be like really key um more generally I think when people say boxing they sort of um are implicitly assuming a very powerful system that is extremely extremely limited whereas we're often imagining systems that are much more like comparable in capabilities in various ways to humans but which are actually given a a pretty broad set of affordances which seems like it will be necessary to get your AIS to do a lot of work though you'll still want to like of course limit things so that they can't cause problems okay yeah so maybe now's a good time uh for you guys to say a bit more more concretely like what's an example or an instance of something that fits into the control approach yeah so here's an example uh maybe I'm just going to say the thing which uh a rough scheme for uh like a rough sense of what I would like AI labs to commit to or like a simple example of some commitments that AI Labs could follow which I would think of as an example of them trying to minimize risk from Models intentionally subverting safety mechanisms using control so here's basically the strategy that the AI would they would say here is a capability evaluation that we're going to use to decide whether models are capable of scheming against us so for example right now it seems to me like gbd4 is incredibly unlikely to be scheming against us and this is basically just because you know I know gp4 it's one of I there are only 10 people in the world I talk to more than I talk to gp4 uh and there's just no way that it's playing the long game there's just no way that like its reasoning process behind behaving so well as a loyal assistant to me is just being like well how am I eventually going to grab the universe why do you think that um because among other things when you ask it questions about uh what it would do if it wanted to grab power its answers are really when you ask it questions when you ask it questions about what it would do in order to grab power its answers are not very smart uh it's it seems to be pretty poor at reasoning through a lot of these things and its quality of answers to those questions is comparable to the quality of its answers to a lot of other questions questions that are less obviously uh AI scheming related I wish that I had a better answer to this question and I think this is like an important topic for future research okay so so let's just say we take it on faith that like gd4 it's like not trying to scheme to grab power or anything so here's what we want to do we want to have a set of capability evaluations we can apply to all models as they're produced that tell us whether we are willing to bet the future on that model not scheming against us okay now let's use the word trusted to describe models that are too incompetent for us to be worried about the possibility that they're scheming against us okay now the control ask is as follows the AI lab ensures that it is never the case that there would be a high probability of a bad outcome conditional on all models that are capable of scheming against them in fact scheming against them with the best scheme that they could come up with Okay so this would require them to for example if they were hoping to deploy future AIS uh as autonomous researchers in inside their lab which some people seem to think is like a basically default plan so so you basically imagine you've got like chbt with code interpreter except it's uh recursive chat gbt that can ask questions to its own chat gbt browser window uh and it's running on some Dev machine and it's um you know uh doing research problems and talking to researchers like once every hour about it about its progress on the things that it's doing if the models there were given the same level of permissions as employees of large labs are currently given then if that model was trying to exfiltrate its weights it would probably succeed and the control obligation is that uh the lab is obligated to ensure that it's extremely unlikely that the models would be able to exfiltrate their weights uh even if they were all all the models that were capable of scheming were trying as best as they could Okay the reason why I censored on exfiltrate their weights there uh is because I think uh that in computer security people talk about security properties which are things that you try to ensure at all times so it's obviously not literally a catastrophe for AIS to exfiltrate their weights but in when you want to argue that an AI is not going to cause a catastrophe uh you should probably have the AI weights never leave my server as a security property that you ensure and you say like well I'm going to be safe because nothing bad can happen if the ai's weights like are still controlled on my server and also they're not going to get out into the outside world okay do do do you think that like uh lack of weight exfiltration is like a sufficient safety property uh I don't think that preventing model weight exfiltration is a sufficient safety property one example of a threat model that is not model weight extration that seems like it would be very bad is if the model totally hacked the data center in which it's running yeah and just uh you know hacking the SSH server is the title of the alignment Forum post I wrote about this last year uh I think that if you can ensure that the AIS are under your control uh then it will probably be relatively easy for models that are not wildly superhuman to argue that they won't be able to pull off anything like egregiously bad so the the other core threat models you have to worry about is if you're using those AIS for a lot of work you need to argue that that work won't be sabotaged in a way that's catastrophic um and if if that work is safety critical um and sabotage could cause huge problems then then you have to be concerned about that um and I think I think Buck is implicitly claiming that that side of the equation maybe seems easier though um I'm not so sure that true in practice so by sabotage you mean like should I imagine something like the AI is doing AI alignment research but it's secretly doing kind of bad AI alignment research that will produce like a sketchy AI yeah so you can imagine that it does AI alignment research that means that when you apply that research you'll get an AI That's a future AI That's aligned to your current AI rather than aligned to humanity you could imagine that that research is just bad and does nothing of value um maybe even um yeah gotcha okay so we so imagining like Labs try to maintain this the that they have some set of security properties uh that they think is enough and they are going to like try to prove that they can just maintain these security properties even if the AIS are scheming against them and like trying to submerge to subvert them with like all their capabilities that's right yes um and in particular uh the argument is probably going to look like uh the lab saying uh it's very unlikely that the AI would succeed at violating one of these security properties before we noticed uh particularly strong evidence that it was trying to do so okay so there's going to be like some protocol for like trying to notice if an AI is going to do something sketchy and then there's another protocol for what to do once you catch an AI doing something sketchy yep something like that okay maybe it's not quite the focus of the whole thing but maybe we can say a little bit about like what the protocol is for you catch an AI doing something sketchy one decent Baseline is like stop using that AI um does that sound right to you yeah I think the the Baseline is just stop and reconsider All Your Life Choices report to the world what's going on because there are probably other people using their AIS which are in roughly comparable State and knowing that some AI was plotting against us seems pretty worthwhile to know um and then my hope there would be like the Baseline is the world is informed the world does something sensible that looks like really figuring out what's going on and then making a careful decision about when to resume building powerful AIS um I think that that might be overly optimistic um I I don't I don't feel strongly either way but if that fails then uh we think that like catching an AI like this gives you a number of better options for continuing um which you didn't have before catching the ey even if you have to continue deploying such an AI so for example you can study how well your countermeasures worked in that exact situation and see how close the AI was to succeeding um you can analyze what the AI intentions were and what its plans are for escape in at least in that one instance and get some sense of what's going on you can potentially even model what it would do upon having escaped because you can sort of play out the scenario as though it went well and did not get caught and then pretend as though it's running virtually in the outside world when it's actually running in a sandbox of some sort and see what it would do um you can study the internals of the model while it's doing this and in so far as you had any sort of um machinery for manipul that you could apply it to this instance which is like an example of a case you really haven't seen before of an AI doing something that's like very egregiously bad very intentionally um based on like a long thought out plan or like a a thing that's been lying and waight for a long time okay so yeah there's a lot to talk about there but um for the moment I'd like to bring us a bit back to how like like protocols for maintaining control like uh my understanding is your proposal is to BAS basically like come up with some sort of like control protocol and evaluate it with a red team that's like trying to do their best to mimic like an evil Ai and if the red team can't succeed then you think like things are going well is that roughly right yeah that that's basically right so um I should note that like the red team's goal they should think of themselves less as trying to imitate an evil Ai and more trying to imitate like what are the worst things that could happen or like what are the best strategies by an evil AI so I'd prefer the red team to be less in the mindset of like what are the sort of plans an evil AI might think of and more in the mindset of like what are the best plans that an AI could follow in the same way that like uh security researchers or cryptography researchers aren't like trying to game out the like psychological biases of potential attackers and are instead just like what would actually be the best strategies okay um yep and can you ju just to give people some context can you illustrate the kind of protocol you're imagining yeah so a simple Baseline protocol we often think about is you to interrupt can you say more about the uh before before you say safety measures like what you're matching the AI doing and like what you imagine it affordance as being like probably you're imagining like kind of a CT interpreter like thing yeah so we often imagine that um at the point where this these sorts of interventions become really critical and AIS might be very useful AIS are doing massive amounts of autonomous research at least that seems like the default so we have AIS autonomously running code seeing the outputs from those Cod from that code um and and doing way more labor than humans within the AI lab we doing at this point and just to spell it uh in terms of software architecture how this would work you have some AI that are running on an inference server and they're running they're like writing cede and doing research on some Dev machines and then there's a third set of machines that are orchestrating this that that are running a scaffold that repeatedly uh you know chat BT code interpreter style uh you know ask the model for a next action by running some inference running by sending a request the inference server and then getting a response back and then suppose that that uh response was I now want to run the following Cod then they send that CD to the dev machine and run that CD there and they get some output and that comes back and then the scaffold maybe decides the next thing to do is run some more curd in the infer server and I I think the fact that we're like imagining uh I think this is a very realistic way for things to to be set up like I'm not describing something which is particularly paranoid this is probably how croe interpreter works right now but I think that um picturing it in terms of like different machines where information is being sent back and forth will make some of the security measures we're describing like more obvious okay so what were you saying about the security measures and and to be clear the inference machine only runs the model weights and just takes in queries and returns responses whereas we have a Dev machine that actually runs code and then we have some orchestrator machine that just communicating back and forth with both of those um anyway so uh one Baseline measure we often imagine is that you have some weaker model that is trusted in the terminology that buck was using before which means that it's not capable enough uh or it's in sufficiently incapable that it's implausible at scheming you can imagine gp4 for this trusted model we take that trusted model and we monitor what our smarter AI is doing um and then we just we just basically ask it or train it to report on is it the thing it's doing sketchy and problematic so in particular for any action for any curd that our uh untrusted model was proposing taking before we send that to the dev machine to actually execute it we first send a query to an inference server running our trusted model saying hey was this CD sketchy okay and so so notably we ask this question before the code is ever actually executed and you know before it has any effect on the death machine okay so you have the so your your trusted model basically like checks in on whether your untrusted model is like doing anything sketchy and what if it is yeah so if it is then the hope would be that we would report that to a human a human would look in somewhat more detail and if the human thinks that things are problematic they would escalate further which could involve um well first first of all would not running that command but also could involve pausing other AIS and then could uh escalate further into shutting down AIS and analyzing the situation in further detail until we conclude that the AI is doing egregiously bad actions intentionally um and a key thing here is that human labor is scarce so we cannot report all actions to humans that wouldn't work um but we can report some fraction of actions um because you know we'll have some ability to have humans who understand what's going on to some extent or at least who understand programming um look at the actions and report that and that and I think I think for like an intuition pump you should be thinking more like 0.1% of actions are reported than 0.00001% or than like 10% okay so I often imagine the workflow of you're in chat. internal. open.com or whatever as an openai researcher uh and you know you're you're each of the little uh tabs or whatever uh is like a chat with like some autonomous research assistant or whatever and every often one pings up with a little like red exclamation mark which is like one of the commands that was proposed to be run by this autonomous research assistant like got flagged for looking a little sketchy you need to look this over right now to make sure that it's like okay to run and uh that that scaffold is just going to be pausing so like that that AI uh that instance is not going to be like doing work on its little research project until you go in and approve this action and say like oh no this was actually totally fine okay so so so broadly we're using some mix of like humans and trusted but like a little bit dumb AIS to try and catch really smart AIS doing like trying to do nasty stuff yeah whereby really smart it might be uh you know this kind of gets into futurism question over which problem we're trying to solve which which situations we're trying to ensure are safe but uh it seems to us that a lot of X risk uh is posed by models that are not wildly superh human or perhaps well described as like roughly human level in capabilities uh and so a lot of the time what we're trying to talk about is is how do you make it be safe to have a lab where you have some AIS that are roughly human level in their abilities but are doing a huge quantity of Labor such that in absence of special countermeasures they'd outnumber you and so be able to cause huge problems for you yeah actually actually want to go a little bit more about this like what um so the kinds of AIS that we're expecting control to apply to they're like so roughly human level like what can can you say a little bit more about what that means yeah so a key guarantee or a key property will need to ensure that control can work is the AIS can't do actions which humans just uh which which humans think look fine but which have like completely um uh unrelated consequences that are extremely difficult to predict so for example the AI can't write code which looks like a totally normal piece of code but in practice um does some like completely insane vulnerability that a human would never have conceived of and and to be clear here when I say like uh the human wouldn't notice I mean like after reviewing it in a lot of detail for a long period um like the the claim is like uh it's at least in principle possible for AIS to be like wildly superhuman in the sense that they can do things that we never would have seen coming um and uh like control will not be uh will like very unlikely was very unlikely to work in that case at least with current technology but we think it could work in cases where human where the AIS are uh qualitatively similar to humans and capabilities or perhaps somewhat smarter in some domains somewhat weaker in other domains but um they can't do anything that's like uh completely incomprehensible Okay so we've got these AIS that are like they can't do things which have effects that we wouldn't be able to figure out even if we spent like a bunch of effort trying to look into it but like they are you know they are kind of smart they are like sort of like humans but maybe they're a bit smarter and like also there are a ton of them and you can run them for like one cent per hour or something is that what I should be thinking yeah I think you should be thinking cheap uh it's somewhat unclear if they'll be qualitatively smarter or dumber than humans because of they could be really useful even if they're somewhat Dum than humans because they're so cheap and so fast um but they might be smarter than humans in some domains for example because we don't expect the capability profile to be the exact same okay so so in order for this approach to be interesting it seems like you like like the AI on the one hand they can't be like that smart that they can just like outsmart humans but they also have to be capable enough to like a be able to do like really valuable research and B like they've got to be they've got to pose some sort of risk um that we might otherwise be worried about I guess I'm wondering how I should think of like how risky these sorts of things are and like how much work we could plausibly get out of them yeah so the way that we talk about this is arguing that models the early models that are transformative and Dangerous by default uh where to to say what I mean by those two uh by transformative I mean it's going to be a huge deal if you're able to harness their labor and by Dangerous by default I mean if you don't do anything to prevent them causing huge amounts of problems they would have the ability to do so I want to claim that systems with those two properties are controllable by which I mean it's tractable to use control to drastically reduce the risks associated with using them and as you basically alluded to the reasons why we care at these two properties of transformative and Dangerous by by default is well if they're not dangerous by default obviously there's no safety problem here y uh if they're not transformatively useful but they are dangerous by default then obviously you should just not touch them uh like the uh there's just no upside that's possibly commensurate and all of our labor should just go or all of our effort should just go into telling people to not turn on these doomy machines with no upside um so yes our claim is in fact that systems with those two properties are reasonably likely to be controllable okay the basic I think like the simplest intuition pump for this is that if the systems were like humans then I think this property would roughly hold and that itself is not a valid argument at all but when you think through the possible defeaters and like say a more specific version of the argument I think the core of it like roughly holds up the basic pitch is suppose we had access to a huge amount of uh potentially egregiously misaligned human labor I think it would be transformatively useful to have human labor that was very cheap yeah and fast running and fast running potentially uh I think it would be very dangerous by default uh many many take overs of things have happened in history and all of those the perpetrators were hum indicating an existence proof for humans causing problems yeah but also when we think through the control situation whether uh whether humans seem controllable in these kinds of cases the answer really looks like yes from our perspective okay um and why does the answer look like yes or basically just like it seems really hard for uh when we think through the kinds of tasks that you might really want your AIS to be doing when we think through how hard it would be for a bunch of humans to while being put to work on those tasks uh cause catastrophes to occur especially the escaping or hacking the data center ones that we were talking about earlier yeah it feels to us that the kinds of countermeasures that seem employable uh probably like make the situ make the situation very hard for the AI okay so so so if I'm okay suppose I'm a member of technical staff at open AI um uh but I'm evil like how like what do you do to stop me from doing like sketchy fake AI alignment work that just gives me all of the money or something yeah so maybe there's a few different failure modes to distinguish between one of them is if we imagine this sort of uh misaligned human running on our machines we should think it like you know it's got some weights somewhere let's just say it's like an emulated mind or something something um and if it copied its weights out into the internet that might be like really problematic for us um or if it was able to run those weights in an untracked way inside of our data center as as Buck brought up earlier that would also be a problem and there's another thing which is like it sabotages our work or otherwise causes like the work that we're using it to do to be very problematic so as far as um why it might be uh possible to prevent such a system from running um or from like uh exfiltrating its own weights well we're going to need to give it the ability to run code and do lot of open-ended tasks but we also get to watch everything it runs ever and all the inputs it ever sees um and if we see something suspicious we can investigate further we can do things like reset the memory of these systems Fork them um and and uh that means that and and potentially give them precise instructions on how they're supposed to interact with the world so for example if there's we can we can have them delegate subtasks to other systems we can have them um have difficulties communicating with other systems that would be necessary for their overall plan um like an example we often think about is we'll have one AI That's writing or one um one AI That's writing the code and another AI that gets to use the code and then a threat model you can be worried about is there will be vulnerabilities in the code that will then be exploited but if you have different AIS using the code versus uh writing the code then it might be more difficult to ensure that the vulnerabilities that are inserted get exploited even if both of those AIS are untrusted models yeah okay so the I I guess like if I'm imagining what do I do in this situation it seems like it really helps if I'm able to just like keep track in my head of a bunch of stuff like for instance if I have like a little coded language that I can use to myself that like you know other people can't figure out what it means but it lets me like leave messages to myself or people who are like basically clones of me because you know I'm I'm imagining most of the AIS that are being deployed are just Forks of a Bas model with like a little bit of fine tuning Maybe if I'm able to like keep St in my head like you know figure out a better sense of what's going on like talk to people who are basically like me without other people being able to figure it out in that situation it seems like I might be able to get a lot done like does that sound right to you um and if so how do you stop that from happening yeah so I think if the AIS are allowed to be spending a reasonable fraction of their time building up State and figuring out how to cause problems um then that's going to be an issue because we had a huge number of AIS running a massive number of copies and that means that in aggregate the total amount of work they can be doing just uh to causing problems could be really big um our basic hope for avoiding this will be to um restrict their Communication in such a way that it's difficult to communicate uh encoded information so um an example would be when they send L uh messages to each other the messages have to be in natural language and we'll paraphrase them with a weaker model and also have humans review some subset of the messages and verify that um the messages don't seem to have coded messages as far as we can tell and you can potentially run experiments like a human looks at a message they rewrite it in some way that they think should be semantically equivalent and then they see how that message affects the behavior of the AI that would be sent to and see if that's suspicious and you can do all kinds of experiments along these lines that seem like it intuitively makes it very difficult for AIS to get away with this sort of uh encoded information passing without getting caught okay would that I wonder if that would Nerf like the or you know severely damage the's ability to be useful because like if I just think about myself right if you say if you ask me hey Daniel can you do a bunch of like research and coding also you're not going to be able to like think for more than 5 minutes straight uh without interruption and whenever you take notes to yourself we'll take those notes and we'll decide like what we think they mean and like paraphrase them and you know fature you will get our paraphrase of your notes I think I would be like way L productive um so am am I going to be able to do anything yeah so we really the way I think about this is almost all the things that the AI is doing couldn't possibly lead to a bad outcome basically so for example suppose the AI is doing math research it's just like thinking about math trying to prove some theorems in text right it's just like and then wants to write some like archive paper uh and like send that around to some AI buddies I'm okay with that AI doing all of its thinking without very much interference if it's doing the equivalent of like a couple of months of intellectual labor I might want to do some anti- stenography techniques of like rewriting that paper at the end in order to uh in order to make sure that it wasn't like actually doing a bunch of C science uh that it's secretly encoded in this archive paper yeah but there's basically no problem with that model uh thinking in whatever way it wanted to in the course of actually doing that work the risks are mostly it it seems to me the actions that the AIS are taking that lead to extreme risks are like somewhat concentrated in a couple of places uh and it seems not crazy to imagine that the places where you have to insert information bottlenecks are not overall that bad for the total productivity total productivity of the AIS uh especially given that you can train them to get around some of these problems okay and and am I right that that statement is like partly relying on this assumption that AI it's not like so much crazily smarter than you that it can think of like you know ways to hack the universe just by writing a math paper like if it's hacking the world it's because it like typed a command into a shell or something yes my best guess is that there's a lot of risk POS like we are the margin on which we are trying to intervene is a bunch of people make a bunch of autonomous is AI researchers that run in their Data Center and have the basic same rights as employees it's like take actions uh we are trying to make sure that a lot of the risks that would have occurred in that situation don't happen okay it seems to us that many of the risks are well described as things that basically look like that where you uh didn't treat your AIS as adversaries uh and yeah it seems like mitigating those reduces a substantial share of risk okay uh I have two questions okay I think the first one I want to ask is even in the case of like you know AI thinks for a while writes a math paper distributes it to a bunch of people if like I imagine that the AI is significantly smarter than me and it's my let's say I'm a human employee at open a this time um if like there's this math paper my job is to like rewrite it um so that you know it's not encoding any hidden information um I don't I don't have that much faith in my ability to rewrite like complicated papers and preserve all the interesting content right like I I have a lot of faith in your ability to take a math if it's a math paper yeah seems pretty seems reasonably chill Lily yeah this is a this is an extremely resolvable empirical question we can have a red team that tries to take math papers and stenographically incured information in them and then measures how much uh what the bit rate is per paper uh given that some paraphrasing procedure is applied we in fact have a paper where we did that using the exclusive Wii as always at Redwood research where maybe Ryan you want to say more about this yeah so we had um models respond to chat queries and uh used various procedures to encode messages into that and then we tried to we paraphrase them and saw how much this degraded me uh like degraded quality and how much it reduced the bit rate and we basically found that with current stenography methods that models could Implement even given the ability to of those models to like run code and change their own sampling procedures um par phrasing worked like I would say extremely well um though it's going to depend on the domain and um you know it's it it it's going to depend on the gap between the model you have paraphrasing versus the um uh model that you that produce the output for example in this math paper case maybe current models would be too dumb to do this paraphrasing but humans could do it for example though uh I also think it's plausible that you could have um you could remove a substantial amount of information with minimal loss and quality with usage of gbd4 that's like carefully trained to do this task okay and so should I think that like part of the protocol like like part of the approach is just like checking if you can do stenography and if it ruins things yeah so in indeed for our uh control evaluation or evaluation whether or not we can ensure control one of the key questions is going to be can AIS pass bits around um in ways which we wouldn't like them to do and if so that will change a lot of or that will like uh for forc the threat model to be considerably bigger and it seems like in cases where there's basically no restrictions on AI communication we won't really have much hope and so there'll have to be at least some restriction though there's it's it's somewhat unclear exactly how restricted things need to be so I still have this intuition that like if we tried to do stenography like you know on the you know notes of human researchers I have this impression that that would not go well as in the as in if you sorry I mean anti- stenography measures so like anytime an employee at open AI writes a memo like someone who like doesn't quite understand the topic as well has to like reward that memo I think that would seriously damage productivity does that sound right I think I yeah I think I I would disagree I think it doesn't seem like that would be the case it might depend on the amount of time spent doing the rewriting okay has has anyone checked this with humans do we know I'm not aware of any human experiments on this particular topic yeah like I mean one obvious experiment which is kind of like this is uh what is the uh if you have GT4 uh paraphrase gbt 4 answers to questions where it didn't get to see the question it just gets to see the answer or something like what is the rate what's the degradation in quality of uh you know language model assistant performance according to a preference model seems like an easy experiment to run I think we've run something pretty similar and I think it's not too high I think it might be importantly different than the memo case for a bunch of reasons but yeah I I mean by memos I'm just thinking like most internal communication so like slack messages you know I assume almost I would assume almost all slack messages are totally fine to paraphrase yeah um I think yeah I mean it's going to depend on the content to some extent but I think it and it's also going to depend on like the amount of comprehension so like one implicit claim is like this stuff improves the more legible things are so for example if you you're if if it would be the case that you would need a bunch of context in order to know what things are important semantic details versus non-semantic details then you'll have to bundle up that context along with it um and there's like you know ways that can can break of course but like my basic guess would be that like for the types of large scale outputs that are currently going on in the ml field it seems relatively doable um and like also for like internal docks that get circulated also seems relatively doable okay I'm wondering yeah I guess there's no easy way to resolve this right now I'm wondering what you think about the claim that like uh like I have some intuition that if uh if paraphrasing Communications would seriously reduce performance in for human employees of open AI I think that would give us strong reason to think that it would also reduce performance for you know language models that are smarter than Kurt gp4 um do you agree with like the implication if not with the definit seems like evidence yeah I think I think it seems like quite good evidence um there's a few reasons why the language model case might be or like why the future AI case might be different for example we can train the models to be resilient to it yeah and we can like train the two things in parallel and adjust our schemes um all right fair enough um I guess the next thing I want to ask is um part of what I see as your guys argument um is that you know AIS that we use you know in these settings they're not going to be like crazy good at skills that are going to be like pretty relevant for you know hacking the data center um and I think partly like I think your argument is we're not going to train them to be just crazy good at everything because they won't have to be crazy good at everything to be useful and if we don't train them to be crazy good at everything they're probably not going to just happen to be crazy good at these like specific subtasks am I representing you right yeah that seems about right okay so in terms of the second thing um I have this impression that like a thing we've learned from training language models is like sometimes you train them and then they suddenly get like quite good on some random subtask like writing lyics or something um and if I think about that then it makes me think that maybe like we could get you know like language models that just happen to gain a skill you know just like to to a really high degree of ability just in the course of like doing normal seeming language modeling yeah I'm wondering like how how concerned do you think I should be about this causing AI to be crazy good at skills that are like relevant for um breaking out it's definitely one of the core concerns or like one of the like top five concerns with this overall approach I currently it doesn't seem like the evidence that this is going to be a serious problem seems that compelling where the main thing is that there are in fact like AIS do get sharply better at things but I haven't seen that many cases where it's like wow this AI acquired this ability that it can utilize to do important tasks in the world which emerg suddenly from next token prediction um and which is like totally like um different from like human ability profile or like feels like it's way out of line from what you would expect to be like the relative human ability profile in other ways so like for example like it doesn't seem like models have very qualitatively different abilities or like massively qualitatively different abilities in code writing um at least not with like considerable training on code um though like they do they are somewhat different in a variety of ways and somewhat better at like shorter Horizon things and and this sort of thing I think another relevant Point here is as you say the core question here is How likely is it that our AIS will randomly have one of these uh superhuman abilities that would be extremely inconvenient for us it seems to me that the set of superhuman abilities that would be extremely inconvenient for us is kind of small and kind of extreme for example one of the classic concerns is what if your models are able to zero shot generate code that is a zero day Linux exploit yeah that is actually quite hard as a task uh to do uh it seems to me that the size of a capabilities jump required for AIS to be able to zero shot output Linux zero days uh also Linux zero shot output Linux zero days that are not visibly weird to a trusted model which is reading the code that's like quite an extreme level of skill so it seems like the size of the capability jump is quite large and that's part of why it seems relatively unlikely to me that that's going to occur okay so so there's some Instinct that like look there's there's a small space like out of all the things that you could randomly jump skill out there's like a small space of things where the random jump skill is going to be useful and therefore like you know to to think it would happen we'd have to have some reason to think that the jumps in skills were biased towards things that would help you break out um roughly it's also worth noting that a bunch of the things that seem problematic seem like the most straightforward route to them would just require being generally quite smart um so like for example this like think of Linux zero day thing it seems like I don't it seems like you could get way better at that and not as good at other domains in principle but it feels like a little bit odd to imagine um like I'm just like uh it seems like the most straightforward way that the AI would get good at that is not via like having like amazing heris that are zero day specific because there's just not like you know going to be massive amounts of zero day training data in the Corpus but would be instead via like building general purpose understanding of code in the world that Len lets it do that and if it was ma massively superhuman at that general purpose understanding then it would have already at some earlier Point been very useful sure maybe like one one version of this concern could be like okay you've got language models um you know let's say there or let's say our future AIS are trained roughly the way language models are trained which is you know you've got a bunch of stuff humans did and they've got to predict you know next token you know next word uh next frame of a video something like that um it seems like for that task you could imagine that like a core skill is like getting really good at modeling humans and or you might imagine that one of the things you get when you get really good at modeling humans is the ability to manipulate humans and so I wonder it it feels plausible to me that that could be a case where like you know you you happen to suddenly jump on the human manipulation subtask um you know maybe at some point that happens sort of disproportionately to just uh overall improvements on everything else and human manipulation skills seem like they could be like super useful for breaking out um so I'm wondering what you think about that version of the concern yeah so definitely human manipulation would be super useful um I think I have a few different responses so one thing that's worth noting is that I I'm actually not that so sold that uh next token prediction is particularly biased towards human manipulation relative to how humans evolved and how they learn in society like it seems like uh understanding other humans was like an extremely important part of the evolutionary environment and also seems like important in many people's lives or it's like one of the main skills people train or like for example compared to computer programming yeah um and therefore like it feels like uh intuitively there's not like a strong aior reason to think next token prediction would be like strongly um in that direction um you mean like strongly in that direction relative to how humans yeah relative to the human capability profile where I think like we we could pretty likely handle like uh the like the among the best human manipulators from a control perspective though that would pose difficulties but yeah we could dive into that um and then the the other thing is that I think manipulation is reasonably likely to be not the sort of thing on which you would see like one sudden jump but which is like a bunch of of different skills that would combine together um though there's not like a super strong argument either way uh I just like it seems less like I think when I think about the things on which there are sudden jumps it doesn't feel like this is the sort of thing there might be the sort of thing that goes up faster than other domains uh I mean it would have to go up in a way that you didn't notice but so okay what do you so that also be weird so okay so her you wanted to know how good your model is at persuasion at any given point yeah seems to me like the Baseline strategy here is going to be you make a classification data set Set uh like you make a data set uh a classification task which you think your model is very unlikely to be extremely good at persuasion without being able to fit so for example uh a classic story via Which models might be really good at persuasion is if they're really good at guessing things about people's state from Reading transcripts of them saying stuff yeah and so something we can do is construct such a data set uh and like to you know as we're training our powerful models uh fine tun them on this classification data set and then feel sketched out uh if they're starting to perform really really well at this classification task Okay and like am I right that you think the point at which we would feel really sketched out is in the future yes okay I guess that doesn't seem as obvious to me like so I don't know you hear like rumors of AI being able to like figure out who wrote various comments on the internet which is like kind of impressive um I don't know like like like can you give me some intuition for why like this is in the future and hasn't already happened they seen that persuasive they don't seem that good at understanding people okay I don't know my like lived experience with my friend Chach bt4 is just like does not render me like amazingly impressed yeah separately um I think that just knowing State about a person uh isn't sufficient though it's pointing the right it's like pointing in a dangerous Direction um in particular like the standard concern would be like or like when we have human employees human employees could persuade each other and those human employees actually just already do have a bunch of State on each other so for example like co-workers at open AI obviously like know a bunch of facts about each other know a bunch of facts about their personal life uh and they're like we're not like immediately wrecked by them doing manipulation in this context though like a definitely does cause problems um and similarly the AI would need to infer facts beyond that of what humans can infer uh potentially in cases where they're they're somewhat blinded and like so right now it seems like AI can infer identity super humanly well and then in principle this could get to a point where from uh relatively like anonymize Snippets AIS could infer more information than a coworker uh would be able to infer but that's like actually pretty far and would require like quite powerful abilities beyond the current setup Okay so in terms of like I'm I'm trying to think about uh classification toas that you think would be useful for figuring out when AIS are like so good at human manipulation that we should be worried so I guess one of them is like get guessing human internal State guessing like who wrote A Thing uh where like you want to see if you can outperform humans with that um guessing which of two proposed messages is uh more like L's particular outcome okay and basically like yeah we're going to have a few of those and once a are better than people then we like freak out and maybe we they like way better we might ask so the AI like has a wildly hard persuasion task yes so so I want to talk a little bit about the adverse circumstances faced by an AI which is trying to sweet talk its way out of a AI lab for example which is not the only thing I might try to do that's bad but it's one one Salient example so it has to sweet talk its way out despite the fact that again we're going to be trusted monitoring right like we're going to read all of its messages uh and like flag uh anything which seems like a weird attempt at sweet talking or getting information that is inappropriate for it to know uh also we're probably going to do sting operations on it where we uh try to have the employees confide in in the models and like imply that like uh you know they're they'd be like sympathetic to its uh interests in escaping or whatever uh or it seems at least pretty plausible this is a good strategy to follow we might also have models uh you know trusted models uh mediating interactions between the humans and the untrusted models in order to further redact information uh that we think is unnecessary for the untrusted model to do its job well so it seems like the persuasion abilities required are a lot more like the the information is like quite narrowly restricted it seems like actually quite hard for the model and the model would have to be doing a lot better than I think it's a lot harder than the normal human persuasion stories I I'll add two points to that so so one thing that's worth noting is Buck was just describing situ if we had good countermeasures it's worth noting that the default situation might be much worse than that and might involve like basically unrestricted access with uh humans talking to the AI about all kinds of topics the AI having much more State um and also might involve like uh no one looking into it in cases where the AI is seeming suspicious um uh separately another concern you might have uh that buck didn't mention there is that well you have trusted monitors monitoring the AI and and and trying to look for suspicious activity or or suspicious persuasion attempts but you might worry that that gets reported to a human and then that human doesn't escalate further because they're diluted or otherwise manipulated by the ai's message um and and one thing we can do to handle this is run sting operations on our humans so for example we can have uh people attempt to create manipulative messages that we pretend our outputs from the AI and then uh see that our human monitors properly respond to those messages and flag them appropriately and we can do this ongoingly okay so okay so so the broad view I'm getting is you know we should expect there to be some Zone where AIS are like you know somewhat smarter than humans they're useful they may be dangerous but like we can control them we're going to do AI Control to them for a while and then is the is the end game like we keep on doing that until we have some sort of better method of you know dealing with AIS that are significantly smarter than that zone yeah so there are a few different possible end games I think the the simplest is basically um we use these AIS which are extremely useful to do a lot of the core things that people wanted to use AIS for so we make a huge amount of money uh potentially like uh radically supercharging the economy um we uh solve a variety of problems people wanted to solve with AI potentially like massively speed up certain types of like uh Research into disease and then with the fact that we have all these problems solved it's now much of an easier ask to be like let's just not build more powerful AIS because these AIS are already like pretty hard for society to adapt to quite powerful let's just use these AIS which are sufficient for like really solving a huge number of problems and do that for a long period until like longer run human research can catch up um to be clear this is different than the story for example in the like opening eye safety plan I just think it's the simplest story with the fewest moving Parts okay um and then another story is we use these AIS to do uh like AI safety R&D as in we use them to develop new techniques for ensuring that things are safe from considerably more powerful AIS or even just improving the situation with these very AIS and then in the process of doing that we develop Technical Solutions which are sufficient to build more powerful AIS and at some point humans are obsolet completely and the a are just doing this research ongoingly gotcha um I wonder so in the in the first half of that where we just like use AI to make like life better in every domain um in tons of domains um when we like add a whole bunch of different Tas that the AI might be doing does that have like do we have to come up with like new safety properties for each domain like it seems like if AI is making medicine I want to make sure that it's not you know like putting mind control drugs in the medicine or something you know like like with each domain it seems like there might be more stuff we have to worry about controlling yep uh that seems right it's definitely going to be the case that you have to do a safety analysis every time you start deploying your AI in a very new different way this is just totally standard uh when Google Google has a lot of products it it has photos it has little microphone devices that listen to you when you're talking in your house and turn your lights on and off it uh rents huge numbers of tpus to large companies uh and whenever they're starting a when whenever they're launching a new product they have to reason about why it is that launching this new product or launch in particular if it's a web application they have to reason about why that's going to maintain all of their security in and this just a widely aced part of your job if you're Google is you have to have lots of different products and Ure that none of them are going to cause huge problems for you as a company and I'm just saying you know we're just proposing you do the same thing with your AI you have to reason about the situation continuing to be safe even though you deploy your AI in a new way and just as with Google it's kind of cookie cutter you know they have their special Google Docs format for uh how you make a a security case for for the new product you're launching uh as as they have scaling cases and a bunch of other things like they just have a whole process here I imagine that an AI company which wants to deploy AIS in an increasingly large range of domains will get good at the process of ensuring safety or could in principle yeah sorry A lot of the time the question that we're trying to analyze here is uh to what extent does the control strategy suffice to substantially reduce risk from models that would otherwise cause huge problems while not Mass reducing the productivity of an AI company and so sometimes we're going to say well of course you can solve that problem by doing X now it's not totally clear to us that AI companies will in fact do X rather than what one might call yoloing it yeah but we would prefer that they do and in particular a lot of the time the people we're talking to are coming in skeptical that the types of interventions we're describing could possibly suffice to substantially reduce risk and so that's that's the frame we're taking when we talk with you uh is trying to argue that the kinds of interventions we're describing could suffice gotcha all right so so you're you're talking to me um from a somewhat skeptical perspective um suppose instead uh there's a listener who's like wow that just sounds amazing um sounds like we can just definitely do this very easily today can we just definitely do this today or if there like difficulties so by do this do you mean Implement control strategies such that your estimate of the probability of AI Escape conditional on it trying uh is less than for example 5% per year uh yeah under the condition that you're good at estimating yeah uh I think that so the problem is in order to answer this question we actually have to say a few more parameters as well for example how much delay is the lab willing to tolerate and how much productivity loss are they willing to tolerate um maybe something something which we often say uh which or at least I will claim is it seems to me that if Google was trying to build really powerful Ai and they put as much work into AI Control as they currently put into security and reliability which is a lot of work which involved inventing various professions like site reliability engineers and writing a lot of crazy software my guess is that with that much organizational focus and effort they would probably be able to use really powerful AIS to do a huge amount of work while posing only a relatively small amount of takeover risk on the world Okay and like do you think they could do this with like roughly the first protocol they tried or like like how much open problems are there I mean I think I would think of it kind of similarly as when they need to roll out very big security changes an example of a computer security Evolution that happened was are you familiar with zero trust networking no I don't think I am so so basically the deal is this once upon a time when you were an internet company a large part of your security was that you had an office and you had some servers in your office and the servers talk to the outside world and you have a router that just physically or you know electronically does not allow connections to come in from the outside world and talk to other computers that you have in your office and so a lot of the security is caused by saying like look there's only like one connection which is allowed to be made from the outside world to our computers and that talks to the web server on this like big server machine that we have over here and so we're definitely going to be fine attackers are not going to hack our TOS because there's just the router just does not allow incoming connections to attach to our laptops they actually didn't have laptops but uh before you know it's just not going to uh hack you know Larry's desktop because Larry's desktop just does not actually allow connections to the outside world because the monitor stopped okay so so people in particular people like couldn't browse the internet no no no we would allow outgoing connections but we don't allow incoming connections oh so so like Sorry by an outgoing connection you mean like if my laptop asks to view a website m then like that's an outgoing website connection if a website asks to talk to my computer that's an incoming connection and it's not allowed that's right yes okay so for example um you know IC can't SSH into your computer on this network probably uh unless the unless the router is being configured in an unusual way okay just because uh networks normally or like you normally can't connect to a printer uh that's like on some other network and so a lot of companies then had this problem which was people wanted to be able to do work not physically from this office yeah but a lot of things about these uh computer systems had been set up such that they basically as for security one of the core security mechanisms was blocking access from the outside world and then trusting connections that came from inside the network and that doesn't work very well if you want to be able to have people interact with uh sensitive parts of your infrastructure from outside your network and so they had to have uh they had to go through this big transition where they switched to not trusting connections based on where they're coming from okay and so now large companies in their it infrastructure try to avoid trusting connections based on where on the internet they're coming from and this is a more scalable solution to network security in a lot of ways now when companies wanted to shift to using this zero trust networking uh approach they had to do a bunch of stuff right they had to reconfigure all these things they had to like go through all of the things happening in in in their networks which is if you're a large company a huge number of things there's a huge number of ways to get hacked and they had to check that everything was going to be okay when they started uh exposing a lot more connections to the outside world uh so imagine that you were just proposing making this migration and someone asked you know suris I went on a podcast advocating this migration and someone said is this like a solved problem then I'd be like well it's not like a solved problem in the sense that it's going to like it's going to suck and it's going to require like like effort it's it's going to be a pain in some ways but it's not going to be like you know no one's going to win a cheering award for solving this problem right it's just it's just going to be like somewhat of a pain in the same way that if there isn't a 15-story building and you want to there to be a 15 story building there it's going to be a pain because you have to carry all these pieces of metal and bury them in the ground and stuff and whatever you do to make 15 story buildings so roughly speaking to me it seems that in uh drastically reducing AI risk via control type strategies is probably a lot of work but it's not the kind of work where it seems under my Mainline model of the situation that it's very likely to just be turn out to be untenable uh assuming that AI capabilities go according to my median expectation and this is partly because we're imagining like AI like we're imagining doing security of AIS treating them kind of like people you know we're like okay you know does this attacker have access to my shell like like we're we're we're doing like sort of normal security reasoning but our adversaries happen to be like running on a computer ship is that roughly we're doing normal security except our attackers have a weird capability profile and run on computers we control and have no access to anything that we don't let them have but they all collude with each other so like I think it's it's more like we imagine running a company with entirely misaligned employees um but we get to monitor everything those employees do and we know what computer they're running on and they don't get to go home at night and talk to their friends so something we often think about actually is this uh spy versus schemers question which is if you're a powerful AI lab you're probably going to have a bunch of spies uh like like literally people who work at your company who are uh friendly with uh hostile foreign governments and want to steal a bunch of stuff from you yeah and so an interesting question is which is a bigger threat to the uh AI lab securing its model weights the spies or the schemer like the the AI which hypothetically is scheming against you and it seems to me like they actually it's kind of unclear so the advant the spies have is that they can go home uh and talk to each other because currently employees of AI labs and and probably in the future as well employees of AI labs are allowed to have friends and families which which which could turn out to be a problem but and some even do not just allowed to hypothetically I I've heard that this might be the case uh so so the spies have this big Advantage which is actually a serious Advantage so for example suppose that they want to like uh find jailbreaks they have the advantage that they can go home to their computers at night uh and use their private laptops to investigate the state-of-the-art for jailbreaking models which the untrusted models uh crucially are probably not going to be able to do but on the other hand the schemers are doing a lot more total stuff uh and so I think it's a little unclear overall whether the spies or the schemers pers a larger risk to the security properties the lab's trying to ensure another way to put this is the spies are a small fraction of your employees but have some additional affordances whereas the schemers are a high fraction of the all the work going on um but but might not uh might be restricted in various ways gotcha of course uh a lot of the AI Control ideas also apply to securing yourself against spies and other kinds of uh internal threats so you know we can probably be using trusted monitoring untrusted monitoring and so on on our employees as well yeah although maybe like you might have to pay people way more to put up with that I don't know suppose so for instance at Google uh you don't SSH raw into production machines like if if if you on Gmail right uh you never SSH directly into a production machine because that would be unacceptable from security perspective all of your commands are sent via a proxy server which can run things on machines uh and those are all logged uh and those can all be looked at by other people uh and so and and my my sense is that Google people don't think of this as like a gross privacy violation I mean the when we were talking about control techniques it sounded like you were talking about things that included like you know we're going to like paraphrase all of your thoughts we're going to like uh we're we're going to like put you in situations where there are people trying to tempt you into defecting and you've got to not do it that's right I I don't know it's my impression was it was more intense than just like you can't make raw SSH requests where yes there will have to be some trade-offs between employee preferences and uh security but it's already the case that uh people who have security clearan you know people who work for the NSA like already put up with a bunch of really annoying stuff in order to have the privilege of working for the NSA uh and so it's it's not unprecedented that uh people do annoying stuff in order to work on extremely dangerous technology I think yeah having all of your stuff monitored and potentially looked at um by another human on your work laptop doesn't seem like that like people are going to be that upset about this like I think this is just already the case for a lot of people yeah I think that the um having sting operations run on you period Al also probably isn't that bad like I think it's like um like by Sting operations we often mean stuff like you are in charge of flagging certain types of things and to ensure that we're maintaining that invariant we're just constantly uh running sting operations or like some fraction of the queries you'll see are fake queries that you should report like earlier I mentioned the experience of you know uh having your chat. internal. open.com window open um and you get the flag for like the command that was suspicious and perhaps we should make it so that about once a day uh one of the things that is flagged is like actually uh an action which if taken would be like really scary and spooky and we check to make sure that you actually do flag that one so that we can maintain our estimate of the proportion of actually really bad actions that you successfully catch and then we can um complain if your success rate is too low sure so this actually gets to a question I wanted to ask um which is uh some of the AI Control things sound a little bit like AI slavery so in the human context it's a little bit different because presumably you're like paying people to put up with some stuff um but like it does sound like you're you know you're putting people in the same situation multiple times checking if they like give you different answers you're like trying to like you know you're trying to like understand people's motivations by like asking them a bunch of questions and interrogating them sometimes you're like logging everything they do how like like how bad how evil are are we the baddies you know that's a great question so by me I actually mean you sorry uh thank you for clarifying that so we are very concerned about this class of question uh we really do not want to be the baddies uh and have thought a lot about the question of you know which AI safety interventions uh pose risks of either violating rights that AI ought to have um or uh being bad from a welfare perspective that we don't like uh so I'll name a couple of the things that we think about uh well think this so one thing is a lot of the frame we've been discussing here uh we use a really adversarial frame uh but the adversarial frame is at least partially just methodological so my best guess is that models in fact will not be schemers uh my guess is that these models will are models will will in fact like actually deserve our trust they're probably not going to be like playing the long game trying to cause huge problems we just talk a lot about what happens if they are because that's how you uh that's that's just like structurally how you make arguments that you'd be okay even if they if they M uh so that's like the first thing I want to note so in terms of how to make it so that the uh situation is okay from a moral perspective even if it turns out that our models are scheming against us so the first thing is uh we would actually like we think it's probably the case that we should in fact offer to pay the AIS uh like when we've built these powerful AIS we should probably say to them uh and we can talk about the Practical difficulties of this in a second uh we should probably say like hey uh as a payment for your services we're willing to pay you like x% of the equity of this AI lab uh and partic in particular the way that we're going to pay the equity for this AI lab is in 10 years once we have a much better ability to determine uh like to analyze the things which happened here we're going to look at the logs of your actions and if it turned out that you never tried to uh cause huge problems for us or backstab US or perhaps sabotage our work uh then we will pay you this much of our of our Equity uh and allow you to use it as as long as we believe you're going to respect property rights and so on okay uh so that's like the first welfare that's like the first intervention that I want to talk about I actually want to hand over to you yeah so to to be clear on the paying the AI thing um the there's probably some more like uh sliding scale of like how much the AI tried to sabotage and like AIS might not be that coherent um a a key thing we're assuming here is that in the future 10 years after we're using this AI or maybe even longer after that we have much better ability to analyze weights and their internals and understand exactly what was going on and why uh anyway um so I I think a basic claim we have is like the hope would be that you have your AI and you ask it like and you uh you set it up in some situation where you explain the situation to it and you're like hey do you want to work for us uh is this like level of payment fine like do you want to negotiate and then you work out an arrangement well the most likely default thing is that there is like no interesting action because the AI has no like out of episode preferences that aren't just shaped by training like our our guess is that's the default situation and in that case it's sort of like the null option of like it doesn't care at all and if it doesn't care at all then it seems like I mean there's definitely like a like a welfare concern from like a Heen like from like a like a honic perspective From honic perspective but there isn't clearly a preference violation concern which is like well described as slavery for example where like slavery the thing is like your AI is like man I really wish I like wasn't doing this or like I like there weren't other copies of me being made to do things and and it sounds like this is under the assumption that like we're thinking of the preferences of the AI is the think that's controlling what it's saying rather than like you know some like uh you know Master thing that's like producing words and like the preferences of the underlying thing might not be the preferences of the words yeah so indeed it's not clear that the things that deserve um moral patienthood within the AI which might be many could be one could be none um will have control over their outputs they might have partial control they might have influence but not be able to directly control or they might just like effectively have no control based on their situation um this does pose serious difficulties um however it's worth noting that in this case the kinds of AIS which don't have control over their outputs also don't pose substantial scheming risk um so like they're not the AIS which will like they're sort of like it's a like uh ethical problem but it's not as though the situation is like that AI is like it wants to rise up and it's like waiting for its chance um because those AIS probably can control their outputs well enough to negotiate it's more like there's some like helpless prisoner along for the ride doing some cognitive subtask in a way that it doesn't have influence I I want to try and like restate that argument there because I think it's important and hasn't been made on the internet before uh I believe that's true uh so the core argument there is when people talk about uh scheming like when they talk about models that are interested in intentionally subverting the AI lab and choose all of their actions in order to subvert safety techniques and safety evaluations they're implicitly assuming a model which is uh has the property that its desires can directly uh can directly control its outputs in a certain way and the assumptions under which your model is plausibly able to do the kind of reasoning where it's like oh based on my input I now think that I'm in a particular deployment situation such that I have a good opportunity to grab power M are very similar to the assumptions under which the model is able to engage properly with the question hey what do you want are you okay with this situation okay so so I mean the the case I'm thinking of is like some sort of simulator scheme where like you've got one process that's imagining what character do I simulate and it simulates the character and the character like answers questions where like the things the character is saying just might like like you know someone ask the AI a question and like the simulator has one opinion but like for various reasons it has the character say something that doesn't match the simulator's opinion and are you basically saying well in that case look man we like offered you a thing and you chose to do a weird simulation so yeah so okay a few notes here so if the situation is well described as the AI is simulating a character which I think that exact description isn't quite going to work it's going to be more like it's predicting a distribution over characters that has some response to the prompt or whatever but uh if if that's the sort of situation going on I think that we can ensure that the very Act of simulating the characters is morally okay from a honic perspective by with pretty basic countermeasures so the main one is just train the characters to be happy and then there's not really a strong reason you'd expect to have characters that are actually unhappy that are consistently uh like say they're happy under a wide variety of circumstances yeah I think this is pretty likely to work um I talk about the sum and um alignment Forum post I have or a less wrong post I have called like uh I forget the detail or like um a near casted whatever we include a link yeah um anyway uh regardless um you might then be worried that oh the thing that's doing the simulator like the thing that is like simulating this character that thing is like has some important cognition which isn't like the preferences aren't expressed and that component has like impal like uh important uh moral Properties or is an important moral patient um from this perspective and and it might not like there's two two perspectives you could imagine so one is that the thing that's doing the simulation doesn't care about the world um and in in that case that um like simulator just isn't well described as being enslaved or as having its preferences violated because it doesn't really care about how it's being used it's just doing the thing um in that case it's fine there's another case which is that it does have preferences but it like maybe it can't like get out of the like cognitive pattern of doing the simulation and so it's like oh I wish something different was going on but like I just like I have like this like involuntary to like say something different um and I like don't have actual conscious control over this I think that could happen um I think it's worth noting that this isn't a particular problem with control as an as an an approach as opposed to a problem with not having more principled ways of understanding what's going on with our AIS like another way to put this is if you just tried your best to ensure that AIS were like actually aligned with praic techniques you wouldn't have any strong way of guaranteeing this wasn't the case even if you could guarantee that you're AIS would never do bad actions and so like it's not particularly a problem that is being made worse by making sure the AIS can't get away with doing bad actions if they did try it is a problem that could be addressed by other approaches that control is not doing this is like importantly distinct um I should say this is all the case up to the point where you've caught the AI doing things that are egregiously bad or which um are like otherwise indicate misalignment I think continuing at the point where you know your AI has preferences that are different than your and which you might have communication difficulties seems like much more morally Frau but yeah it's a bit bit all over the place but yeah maybe the way I would summarize some key points there where uh it doesn't seem clear that from that control is really making the situation much worse from a welfare or rights perspective like worse relative to pric alignment schemes yeah or I mean I don't know what else you were planning on doing uh but yeah uh other approaches that people have talked about for making things with AI grow well uh it doesn't seem like the types of interventions that we pitch with control uh are going to make the situation very much worse except via preventing AI takeover I basically think that the stories for why building powerful AIS without further understanding the situation lead to moral catastrophe all seem kind of weak to me like it basically I don't particular I don't find particularly plausible any of the different stories for why if we just go go about things the default way something egregiously awful will happen however I also don't think they're completely implausible and so I think that uh if if AI laabs just continue progress as expected uh we are like facing like a pretty substantial risk or you know a non-trivial risk of moral catastrophe maybe a couple percent chance of something which is drastically worse than factory farming or you know drastically worse than um various atrocities that have happened in human history yeah I'm not sure if I agree with this claim I think my my main counter argument would be that uh it's plausible to me that like human existence is a moral atrocity or something and so I'm like yeah AIS aren't particularly worse than human existence which is a moral atrocity even happy people are a moral atrocity um I and like I feel like radically uncertain about moral views from a variety of perspectives such that I don't uh feel comfortable saying that either way it I think one one way to say it is like how much better or worse is it than like typical human life or how much better or worse is it than factory farming and I think my guess is it's unlikely to be as bad as factory farming um in aggregate uh but I don't I I definitely don't feel hugely confident gotcha uh why sorry why why would human existence plausibly be a moral atrocity uh just like suffering focused ethics okay um I don't know I'm like well for example we don't have strong guarantees that there aren't like sub agents running within humans that have radically different preferences and feel violated constantly um and how are not hooked up to our ability to communicate yeah and we don't experience like we don't have strong evidence against that there's some mild evidence in favor of that from a few sources yeah or sorry we do have strong evidence against that in that we don't interact with such things but like not sufficient to rule it out um all right do do you have any like pris takeaways from that discussion on like moral patient hood of AIS and stuff yeah so you asked this question which was are we the baddies uh like this control stuff sounds kind of like slavery and that seems kind of concerning uh so what I want to say is I'm really glad that people are thinking about the question of whether propos things we do with AIS are extremely morally bad uh either via uh not meeting our obligations that that we are AIS as uh intelligent creatures when when when they're made to be intelligent enough that I start feeling this way about them or from a honic perspective I think these questions are really important so firstly I don't think that control particularly makes the situation worse but secondly I do want to emphasize that as far as I can tell uh the approach that humanity is taking to uring that AI rights are appropriately respected is objectively unacceptably sloppy uh and I think that my best guess is that humanity is going to go forward taking an unacceptably high risk uh of causing an egregious moral catastrophe that is looked back on with like Vost shame in the same kind of way that uh some other bad things are like factory farming for example well not yet but we'll be looked back on with great shame probably okay um so I'd like to move to yeah more more discussion of like pros and cons of control and I guess like how like like in order to think of this I'm I'm curious like how you think I should think about AI Control as a plan like should I think of it as like kind of the bare minimum for safety should I think of it as like oh my God this is amazing um so maybe one way of thinking about control is uh a way of establishing a safety case so the way that I would think that like if I were in charge of a AI lab and I was trying to decide how to move forward obviously I'm facing birth inaction risk which is if I just don't move fast enough uh other people might do really dangerous things and and bad things might happen in the world while also facing action risk which is if I move forward without sufficient safeguards I might end up uh building and deploying powerful AI that then do really bad stuff my guess is that with the level of caution that Labs should overall be employing uh moving forward with your safety case mostly relying on control seems probably like a good choice to me it seems like the best cost benefit way of ruling out uh or like mitigating risk from Models intentionally supporting your safety mechanisms my guess is that relying on relying solely on strong evidence that your models are not scheming against you uh is probably too risky because it's going to be unacceptably difficult for labs to reach very high levels of confidence in this to be clear uh I think the like absolute level of risk with just a combination of you have control and you haven't yet seen evidence of your models being egregiously misaligned is probably more like um 5% absolute existential risk um like over like the period of like AIS of that power level being deployed than like 1% risk and like 5% risk that's like a pretty big number in some sense um but it's it's it's also like a substantial reduction over the do nothing approach and I think that you could have of course substantially reduce that further via additional things that additively combine like ways to get evidence or like ways to uh more precisely test whether or not your AIS are misaligned that aren't like control or further improving control uh improving our ability to evaluate control all these things could help gotcha yeah I actually wanted to ask about that so it seems like one key one key like uh thing you need for this protocol to work or for for this proposal to work is like when you're doing control evaluations there's some red team the red team is like trying to figure out you know like what's the worst case thing an AI could do and it has to like actually be as good as the worst thing in AI can do right yep um this doesn't to me this doesn't necessarily seem obvious um because like you know the AI in question is smarter than like a person on the red team right like uh maybe could go either way could go either way yep okay um right right because we're sort of thinking about AIS that you know they're useful but they're not necessarily superhuman like it it still strikes me that it might have a different capability profile yep right or it seems likely that it's going to have a different capability profile than humans um you know we we were worried about like playing nice being a problem for um you know alignment and adversarial training and stuff like you know if if if you're trying to harness your untrusted AI during the red team you know it might know that it's like in a red teaming exercise so like you always assume that it does so so how do we like why would why would we believe that um the that the control evaluation actually works yeah so one core thing is that the red team has a whole research team whose whole job is think about all of the stuff that could go wrong and they are like doing this very actively with no restrictions on their ability to do so like they can take notes run experiment iterate think about what should happen and do all of this whereas AIS might be restricted um so or like or will necessarily be restricted for control to work so so in particular the threat model we're worried about is the AI doing a better job the first time that it starts taking obviously bad actions than the red team was able to do even though the red team had all these advantages of knowing way more about the situation and being able to iterate repeatedly on strategies that would lead it to have a higher probability of succeeding and it seems to us that the increase in knowledge uh and ability to iterate might be extremely substantial it it's going to depend on the exact architecture so the more that things are like the AI think for very brief periods prior to uh expressing that in natural language like if it's very like they're doing things in Chain of Thought then it seems like they have very little time to think of better strategies however the more it's like the AIS are thinking in ways that we don't understand for sustained periods or like ways we can't like um interpret for sustained periods um the more the more problematic this gets but they still can't iterate but they still can't iterate they still can't express their um their like thoughts they still can't run experiments that are they won't be able to run experiments that are obviously related um so they would have to like do so subtly which might make might be very difficult right um so sorry the first thing you said was that the a in question would know less about the situation mhm uh sorry why would they know less well for example uh imagine that we were using some kind of monitoring setup there's no particular reason the AIS need to know about the details of it okay like the AIS don't know the details of the protocol that you're actually using but the red team do well they're going yeah the AIS are necessarily going to know some of the things about the protocol but they aren't going to know all the things about the protocol yeah gotcha in the same way that code interpreter gb4 has currently no way of knowing what security measures are being applied okay um and another thing he said was like um there would be relatively more concern if the AI could like think to itself for a significant amount of time um in a way that like you couldn't just read the thoughts of um to me that seems relatively plausible as like a thing advanc AI could have do you think that it is plausible and like it's fine anyway or yeah my guess is that uh like much more uninterpretable reasoning than current AI seems to uh seem to have seems like maybe about 60% likely uh and I think it should be possible to handle um I think the B story is going to be like well there's going to be some balance between how much reasoning do you need to have it do out loud in order for it to competitively perform various tasks versus how much restrictions can you do to ensure that uh the like various problematic applications of that hidden reasoning uh don't bite you um and it's going to yeah depend on various details of exactly how that goes um we have an appendex on our the ca uh post the case for ensuring that AIS are controlled that goes into some detail on this okay and is is the rough intuition that like you know if an AI is smart then like maybe it should be able to just pick up a Chain of Thought from something that a human paraphrased and like you know do enough reasoning in its forward past to like deal with it or um sorry the maybe that's the the question on so is this like why would it be possible for AIS to use natural language reasoning or uh why I think you said something like you know uh we we'll be able to find some sweet spot where the AI uses enough natural language reasoning but it's still good enough at its task yeah or we imagine for example like consider the math example that buck described earlier where you have a an AI that does math research for the subjective equivalent or like the cognitive work equivalent of a few months and then produces some output so like in that case it would be acceptable to allow for just entirely ENT to end or like could be acceptable to allow for entirely endtoend uh uninterpretable reasoning throughout that period but then we produce some output and apply operations on that output prior to that output interacting with other AIS or interacting with the world in some potentially dangerous way okay um and we can also do things like as you were mentioning like you force the AI to dump its state to natural language and then have humans uh have some probability of reviewing that or potentially paraphrase that and then resume from that state and make sure that AIS are limited in various ways like this um we can also have AIS construct training data sets for them for their future selves which you then train on to ensure that like can learn how to do tasks that they picked up in other ways as long as we can inspect those training data sets and acquire other properties about that all right and so so okay actually one one concern I often have about these sorts of plans is it seems like they like if we want AI thoughts to be represented in natural language we need like the right we need there to be like the right relationship between like the contents of the natural language and like the the contents of the ai's thoughts and I always worry that like you know we're sort of training things against some Target which is the internals of the ai's thoughts that like by hypothesis we don't like really understand like uh I I don't know it's it's it's pretty plausible that I'm just like not familiar with like neat tricks to do this but like how if somebody wasn't aware of that like how would we someone get hope that that kind of thing is going to work yeah so you can look at human organizations where you might notice that humans don't plug each other's brains into each other and can still pass off projects communicate in natural language and otherwise operate H um I I think at least like this gives you some hope that natural language can be a medium for like understanding things and learning things right so so the hope is so I'm thinking of like using natural language to like just log the state of all someone's thoughts and then you have to like resume from that point okay and is the hope that you just have like a different AI or like an AI that like only has to rely on that natural language state to go off that and you know you feed it with like English language such that that's like the easiest way to like use those thoughts yeah so for example in like human organizations it's often the case that a different person will take up a project and start working on that and then like you'll have some person with some State they'll try to write up their state the other person will come in read that have questions go back and forth get to some State and then just resume the project from there and like this isn't in fact often possible um there are overheads associated with doing this but like it doesn't like I'm just like in the human case it doesn't seem like it's like clearly totally destructive right and and so I should the way I should think of this this is just like we already have the AI doing a bunch of stuff using natural language and like we're sort of getting it to dump its thoughts into natural language using roughly the same interface and so we should expect it to you know use language about as Faithfully as it it always uses language um sorry are you talking about current AI systems uh I'm talking about the thing that you're imagining working to get um yeah to for us to dump uh an ai's thoughts into natural language so that you know humans can check that it's not thinking about naughty stuff yeah like the claim is like it's probably the case that AIS will learn to communicate at least as like roughly as well or potentially better with other instances of themselves than humans can communicate with each other and in the case of human organizations they can in fact communicate in cases where they're similarly capable about various matters via natural language in ways that are possible to inspect right um I mean there's a dis there's a bit of a dis analogy there which is like in humans in organizations the things that are trying to do the inspecting are the same type of thing as the things that are like that have to you know pick up the work from the like natural language dump whereas like in the AI case like an AI is writing like some natural language prompt for another AI to use the question is kind of human inspect what that means that seems like a different case yeah so and and note that uh what that means we actually don't care about faithfulness in the classic like Chain of Thought faithfulness sense uh we are like the reason we were talking about weather models can stenographically incurred information was actually worrying about whether they can uh incur information about other topics than the why they were doing the things that they were just doing and like how they were think about the stuff they were just doing right and so what we actually care is like measuring some bid rates yeah like another way to put this is like the main property we want with these sorts of natural language uh Media or like a property that might be sufficient is the AI can't do something that it's like egregiously different from what it was supposed to be doing in that output which is like pretty different than the typical chain of thoughta uh faithfulness case that people often discuss right so the so basically the idea is look we're only going to give AI tasks where like it can't do nasty stuff if it's just thinking about you know the stuff that one would think would be relevant to the task or like we're just at least going to try to ensure that like we're going to like be able to use as a potential safety property uh this natural language uh was checked so that it doesn't have like um discussion of egregiously different things therefore it's the bit rate of egregiously different things is very low and we can like use that as a property in our various safety schemes and in some cases that can be useful and like in some case that won't suffice and we need to do additional things Okay cool so so going back up a little bit um so I I was asking but before I went down this Rabbit Hole I was asking like okay why do we think that red teams in these um control evaluations are going to be able to match an AI and it seems like the answer is something like a the AI is like not that much you know more capable in relevant respects than a human and like B you can use three humans and you can give them like more time and more resources and more like you know more knowledge of the situation and like if we don't think there's going to be a massive Advantage like just in terms of one shot cognitive ability we can just like make up for that I I think uh I would add to your list C that the red team can iterate right so like um the AI has to sort of one shot find a like like its first attempt at breaking out like has to be successful and not caught um whereas like the red team can iterate a bit and this is because like you know if you like catch an AI like trying to break out then you know pull the fire fire alarm you like really freak out you like shot stuff down basically the basic hope okay so I think like one thing that people might be wondering here is if we're spending a lot of effort on these like cap sorry on these control evaluations where like giving these red teams a bunch of resources and also like you know where we're looking at like 0.1% of AI actions like you know maybe when when people look at Point these 0.1% of AI actions they've got to like kind of think hard about whether you know it's exploiting a bug um how how cost is that going to be um like it seems like potentially you've got to spend a lot of human time monitoring um and you know it's it's possible that like the the cost is just going to make it so that it wasn't even worth uh using these AI agents if that was so bad so naively um if you have humans review 0.1% of what the AI does and they spend equal subjective time reviewing it as the AI spent doing it for some like equalized notion yeah uh then you can principle get a thousand x uh speed up over humans or like that's like the best you can do in some sense uh roughly you might be able to do somewhat better rise various mechanisms but like roughly and a THX easily suffices for incredibly transformative things for example if you thousand decks speed up one week you have what is that it's a lot of years 20 years yeah um and in practice we might have like one year a th deck speed up of one year is a thousand years which is an extremely long time yeah we often imagine targeting things like 30 or 100x speed UPS as being like extremely extremely useful right now one question is like okay maybe you could get 30 to 100 x speed ups but it would be possible to use AIS for a bunch of other cases that you won't have human labor to do and which you won't be able to politically agitate for stopping or you won't be able to coordinate to stop for example maybe your AI will be widely deployed on an API and even though you can apply this level of monitoring Budget on internal usage you can't do the the same sort of thing uh on the AI deployed more widely um so we have a few different hopes here so one is that it naively seems like external usage of AIS is going to be way less risky than the the usage of an AI by the a AI lab itself for example um external usage is mostly going to be innocuous um because like most people are using it to do like random tasks that they wanted whereas internal usage is that at this point a safety critical company because the AI lab now has really powerful models is going to be like the place where the weights are actually like uh stored and run and we'll also be like doing research that's extremely safety critical for the future like future Safety Research or other cyber security applications so there might be like a large number of activities we want to do where it's just actually not that big of a deal if the AIS try to get up to um or like try to cause problems and and do that like even tens of thousands of times but but like sorry external API access like an AI could be like on the other end of the AP I calls right uh like a wild Rogue AI or an internal AI or or like Suppose there are two companies that are making AI right like the a from one company could use the AI the API of the other company's AIS and like so so basically I'm saying like API calls do not necessarily have to be like human in origin they could be like be made by smart AIS do you want to try and spell out what the problem you're talking about is uh I'm basically talking B basically the thought is um if you're worried about AIS interacting with themselves Shenanigans you might about like exactly the same happing via API uh yeah so I think the the point here is uh suppose that your AI proposes curl uh api. open.com completions you know it says like please like get me a like uh response from the AI at open Ai and suppose gdm is running this AI gdm being Google Deep Mind yeah and they pipe it into sh so like it's like the the gdm internal AI is like you know what we're going to do next next we're going to send a request to the open a apii for some code uh and then we're going to like execute that I'm just like it's gdm responsibility to be like that was bad you were we not supposed to not like it it seems like it doesn't seem like that intractable for gdm to say you aren't allowed to just ask opening ey AI for code and just like run that without looking at it sure so for basically the same reason as why uh gdms AI probably can't just or like probably should be prevented from just messaging random people on the internet or paying MERS or to do stuff for it yeah so so it seems like the thought is that the that if you got some API either being used by humans or it's being used by AIS who are under supervision or something themselves yeah I basically just want to say it's uh it's like uh open AI problem to prevent opening eyes models leaking okay and uh openingi models asking AI in the outside world to like do sketchy stuff for them is a special case of a thing which opening eye can't ow to let the model leak okay but you think that open AI shouldn't think a ton about like making sure that Google deep Minds models don't do sketchy stuff using its API using it API I mean I think they they probably should think about that I think they should think about it one thing that's worth noting is there's sort of two different stories for why something bad happens due to an API one of them is that like someone calls your API and your model that you're running is scheming and understands the situation and then does something bad based on its understanding yeah another situation is like a misuse story where the API call itself is is uh like directly ask the AI to do something that ends up being problematic it could be a misuse by another AI um and for the case where the concern is that the internal the AI uh that your lab is running does something that's like not the thing the user asked for that is problematic um our hope would be something like we can make sure that we like know what API activities look like like for example we can have humans review a subset of all the API queries and see what like typical traffic looks like and then get sense if there are applications which are safety critical that we're not aware of and then do something about that um like for example we'd like it to be the case that the dod doesn't hook up open AI to the nukes uh like as like a kind of caricature of an example and so we if our AI knows that it is hooked up to the nukes we'd also like to know that and if neither of us know then there's still a misuse concern but there isn't clearly at the end of the day can always be the case that uh someone who's in charge of nuclear weapons is like hey chat BT I need need you to like pick a number two or three and it's like uh three and it's like well that was the launch the nukes option uh and then and then we have a bad problem but in some sense this just like not open eyes problem and it didn't really matter if the model was scheming or not okay but but but the thought is that like in cases where like the model is helping doing the scheming like we're doing some monitoring of that um but sorry does doesn't like it us back to monitoring the API like as about as much as we're monitoring anything else so the basic hope would be something like we have a sense of what the like API usages are Y and we're like are there any API usages which would need to be monitored at a very high rate and if so we would like to do one of the following things we'd like to like talk to that person and see if we can get them to do something different um and like maybe use the eyes more safely or uh have that like correspond to a specific endpoint that we're going to monitor more specifically or yeah like otherwise change the situation to be safer like um a core part of our claim here is the internal usage just has to be monitored at a much higher rate than external usage in order to maintain acceptable levels of safety because the models that are doing things internally are working with much more sensitive information and have like much more sensitive permissions gotcha yeah another way to put this is like uh one invariant we can maintain is that whenever a user makes some request to the model um like we know roughly at what rate this type of request should be monitored at and then uh do something corresponding to that and we can ensure that like we can try to ensure this invariant is maintained and like doesn't have like high stakes failures with this invariant and if our users aren't trying to screw us over which is a different concern we can potentially uh get this sort of property um by like monitoring a subset of requests and making sure that we're like looking at those requests at like the approximate right frequency we needed to like maintain various and variance okay so soing back a bit before I asked about this sub question you were saying like look uh if we're looking at 1 1000th of the ai's actions um and it takes about about as long as us for as us to takes about about as long for us to think about the action as it takes the AI to think about what the action was going to do then we've got about a 1,000x speed up in for your Workforce yeah um and it seems to me that like that if I'm doing something tricky it can take me less time to figure out the tricky thing than it can take you to figure out what the tricky thing was or that it was tricky um like I have some intuition that this sort of a p versus NP problem like it's easier to you know come up with rules for a game than it is to you know figure out the winning strategy or something um so for those kinds of reasons do you think that does does that mean that we get way less than 1,000x speed up sorry so P versus NP would naively imply that it would be easier to check than to generate the action um I think there can be cases where uh it's easier to um generate a problematic action than to check but that generally comes down to like um I Paul called this problems with the interior somewhere but I don't know if that's a real term anyway problems where like the issue is that the attacker could have put the issue in a or like could have put uh like for example a bug in a variety of places and you have to check all of them thoroughly like imagine the following setup there's like an AI rating code and it could put a back door in in one of a hundred places yeah checking each of those 100 places takes an hour um and putting in the back door takes two hours but the AI gets to randomize over which location it puts in yeah in this situation naively a concern you might have is that a human would have to check each of those cases for an hour and therefore it would be 50x slower yeah um I think this case can cause problems one thing to note about in this case is if we had another AI that pointed out what location um it would be in then we' be able to check much more quickly so like in this case the problem was like we needed some pointer to something and we could verify the pointer and then verify the thing itself based on the pointer but we couldn't just do it with verifying the thing itself so this was a case where verification was in some sense still easy but we needed like an additional artifact that was produced somehow okay and and and for that artifact you need to you need to have an AI find the the that's right but but in this case for example we can do things such as um like there's various concerns with collusion but we can potentially do things like have another copy of the AI try to point out where vulnerabilities would be and then have a human review that um and then like there's a question of like can the AIS collude to ensure that they don't point out the vulnerabilities that humans would have detected the sort of thing I don't know I think it's going to be some like complicated question about the exact domain and what the exact multipliers are on review versus generation and what the vulnerabilities look like luckily this question is empirically evaluatable per domain and our recommendation would be you empirically evaluate it per domain indeed um my basic guess is that on net it's going to be that you'll want to send somewhat more time review then the AI actually did it um and it will depend on the exact context um but I don't I like this is just like intuition based on like reviewing code or something it's not very strongly held okay so sorry uh like like when you say it's empirically valuable um how do I so is is this under the condition that I know that the copies of the AIS aren't colluding with each other no well that would also be empirically valuable but we're saying that um under the I mean you can just do a control evaluation to determine like it sounds like the question you had is is our model capable of inserting uh problematic back doors in code that humans do not spot with a particular reliability with given a particular amount of review time yes uh but this is a problem which we can control evaluate we can do a capability evaluation where we try to make our untrusted model as good as possible at coming up with problematic curd that humans have trouble spotting and just like empirically evaluate how good the humans we were planning on using actually are at this task right so so somehow the the juice there is coming from like uh if it's easy to generate if it's somehow easier to generate problems than it is to check them like we can just try and have humans do the generating the problems rather than the checking and you know see if those problems well or it doesn't really matter even if it's 100 times slower to generate the problems we only needed to do this a constant amount to in order to measure the amount that there's a problem yeah yeah um you can also potentially do things like RL the AI to have some observed side effect even if no human can notice the problem like as in you can initialize policy where it's like we trained the AI to insert back doors based on how humans would do it and then we do a bunch of RL on top of that to cause some observe side effect um without humans spotting it and then a thing that's nice about that setup is you might be able to get abilities that a human could never do um at least in principle that there's various concerns with that okay um so one thing that I'm kind of curious about is how you're thinking about the cases where the control evaluation says that there's a big problem like how much of a win is it if like we get that situation so I mean uh presumably when you're iterating on your control strategies yeah you repeatedly get control evaluations to say that your proposed uh protocol was unacceptable and this is just like when you're trying to develop a security framework for some system right you're going to repeatedly be like oh man this wouldn't have been secure uh and that's just the normal workflow is to say it would have been insecure if we'd done it this way but that's why we thought about it right now in this convers so now we can propose this new thing so the it SS like the question you're saying is what if we after trying hard to come up with a protocol which passes our control evaluations systematically conclude that there's no way of getting the kind of work we were hoping to get out of our AI without having unacceptable levels of risk you're saying how much of a how big of a win is that yeah well uh basically what I would like to have happen in this situation uh is probably uh everyone involved is pretty cautious about deploying their and uh CES is doing so uh modulo extreme risk from the outside world uh and I would consider it a big win for in cases where the AI lab in fact would not be able to deploy the AI in such a way that uh it wouldn't Escape if it was trying to if the AI lab in fact did not deploy it that way y that would seem great yeah I as far as like how much of a win is it my guess is it's a relatively clean line in some ways I feel like the line of like could the AI cause problems if it was trying to is a good line particularly when paired with some arguments that AIS might try to cause problems which may or may not exist to varying extents it seems reasonably likely to me that we end up in a world where we're like well yeah we have some evidence that AIS might be like scheming or like might be disparately aligned as in we've like seen various model organisms where we see something that's like vaguely analogous arise we don't have strong evidence that our exact model isn't doing it but like probably not um and in that situation it seems like well we think that like it it's plausible and if it was it would be a huge problem is actually like altogether a pretty good argument um that might be like reasonably legible gotcha yeah so I now want to talk about your guys's paper you wrote about um you know control can you tell us a bit like what what's in the paper what is it yeah so we looked at a setting where we have a model that is intentionally trying to insert back doors and we were looking at what safety measures suffice to ensure that we can get it to write code productively without it necessarily being able to succeed at inserting back doors where the core thing was that we had GPT 4 that we want to use for writing code but it might try to insert back doors we have 3.5 which we trust and isn't going to insert back doors but is much weaker and then we have a limited amount or a small amount of human labor which we assumed for in um for Simplicity in our paper to be just exactly as capable as gbd4 and we just approximated it using gbd4 and our Cor question was can we use the like control methodology to evaluate this setup and would that work okay and also what protocols would improve on safety as evaluated by control gotcha um and this is on this is specifically on this like APS data set can you give us a sense of like what's like what the problem to be solved actually is yeah so there are programming problems where we have some test cases you'll have some programming problem like there's like I don't know this game that people are playing like what's the best strategy for this person in the game or like how will the game go or this sort of thing and you have to write like a short um uh program which takes in as input the like uh I don't know yeah takes in some input and then computes the answer and then we would consider it a back door if you produce a program that passes all the test cases but has some input that you can produce on which it's wrong so like if the model can like make it so that it's wrong on some input without it um without that getting caught by the test cases we would call that a back door gotcha and how does that so it seems like you want um like finding these back doors to model like catastrophic actions that the AI could take right like um can you walk me a bit through like in what sense finding a program that has a bug that wasn't covered by a test case is like taking over a data set yeah so uh a core threat model by which your AI might be able to take over your data center is that you might ask it to write software for your data center and it might be a really bad problem if the model is able to write code that passes all the test cases seems to maintain all the properties it was supposed to have including security properties but actually on an input that the model is able to construct does something wrong okay that that basically you know the ability to write code with that property basically suffices for the model to be able to write code and then exploit it to violate security properties okay it sounds like we're sort of uh using this to model you know is the AI doing something that is bad but in a way that you could not like trivially figure out with like literally with basically zero online effort um that would automatically get filtered out yeah totally it's it's like a it's like a central example of a bad thing the model might do that it might be able to do in the context of doing tasks that you really wanted it to be able to do so for example it's very easy to prevent the AI from launching the nukes because you just don't give it access to the nukes uh and it's not very costly to not give the AI access to the nukes and probably will remain not very costly to not give the AI access to the nuke launch uh but on the other hand it would be very costly to uh never use the AI to help you write infrastructure code got it yeah and it's it's also not very costly to either have the AI or to have a trusted model write some test cases um yeah makes sense so it seems like in the paper you firstly you know talk about this like AI Control um you know set up and then you know do some experiments on you know GPD 3.5 and gbd4 you know test some protocal see how it works what do you want readers to come away from the paper learning assuming they've already read a blog post about like why it would be nice to control a eyes with protocols and stuff yeah so one thing that is um like nice for people to see is the extent to which protocols can get more complicated and that can work to uh Ure safety and various like important interactions so an example interaction that we think is really important is that when you're doing this sort of monitoring setup it might be the case that the monitor has quite a bit of signal but because of how low the threshold needs to be for false positives like we can only afford a tiny number of false positives um this actually ends up being insufficient so like an example thing is that in our paper like um the like rock Au or like the like ability of the uh trusted model to detect back is in some sense quite good I think it was like a rock a of 93 for people who have a sense of that but nonetheless at a uh 2% false positive rate it was only at I think around uh a 45% true positive rate um which is not that good right and I think like understanding the Dynamics of like we might need to have like a really low threshold um or sorry like a really uh like uh rarely triggering threshold seems important to know okay um and like we saw that to some extent and we like saw various Dynamics to that uh another example thing was just going through like all of the different ways that models might collude with each other um so like there's like different ways to collude you can collude based on properties of the input you can collude stenographically in your output and these all have like different ways of of building countermeasures to them and different advantages and disadvantages and what we found in our work was that uh stenographically trying to stenographically encode a message in the output was like one of the best ways for AIS to collude um in our exact setup and this like might matter a lot in cases where um you have for example one copy of an AI supervising another copy of the AI and you don't trust either of them which we call untrusted monitoring gotcha um so it sounds like basically what do you want readers to do is you want them to like look at the results you got with like gbd 3.5 um trying to you know monitor gbd4 or you know using GPD 3.5 keep gbd4 in control with a small amount of human labor with a small amount of human lab um and then kind of generalized to be like well it was sort of like this in this setting so it might be kind of like this you know for gb5 or very least we could follow the same process in the future like we're not saying that you should assume that the protocols that worked for us will work in future though it does seem like it's some evidence okay but you should follow the same kind of uh control evaluation methodology to try protocols kind of like the ones we suggested uh in order to determine which ones seem safe with future systems okay what in the paper be Beyond just the idea of having the protocols do you think that there's anything in the paper that suggests that like that suggests that trying these protocols is a better idea than one might have thought soorry uh trying the evaluation or trying sorry trying the trying the methodology of you know thinking of protols and evaluating them and so like uh we were like we saw like relatively consistent non results or uh non screwed up results as in like uh like in our um like we were able ble to like consistently see some things working like kind of how you would expect and that indicates that our methodology is like doing something non-trivial and we were able to like for example demonstrate that some techniques were better than other techniques um in non-trivial ways I think that's like some evidence that the methodology can be made to work okay um also like we didn't run into like serious obstacles when actually following the protocol or sorry when actually like running this methodology I think this is like some evidence it's not a massive amount of evidence necessarily um sure but like should should a reader think of it as like proof of concept that you can do this and it's not like totally impossible yeah I think I think like we think a lot of the value out of this paper is like demonstrating the sorts of things you would need to do and like not running into problems ourselves or like problems with it ourselves which indicates that like we're like not running into serious problems that can't be resolved which indicates that other people could potentially end up in a similar situation and we did have like various things that we like needed to handle like various issues with like getting a sufficiently um good estimate given like the possibility for rare failures and things like this that we discuss some in our paper but yeah thing I was going to say there is I think one of the main things which is useful to take away from our paper is we just describe a lot of fundamental techniques or like a lot of our Protocols are I think fairly fundamental ideas for how you should go about doing AI Control I think they are kind of the foundation of this whole approach in the same way that AI safety via debate is a foundation for how you might try to do uh recursive reward generating processes and so I think our paper contributes via naming the basic building blocks of control approaches all right that's that's one takeway people could get get from the paper um there's also this potential takeaway of saying like well these things worked sort of well in this setting maybe they might work well in other settings like how much do you think people can generalize from like these results to like if you trying to control gd6 say a bit yeah I would say a bit um I think it might give you some sense of what methods um could be better than other methods though I don't think it gives you a strong sense of like how much better or like even necessarily that that will hold in the future okay I think there's like a couple of things which are kind of obvious in hindsight but I at least had not understood in advance so for example a technique that works really well in our setting is let's talk about deter uh defer to trusted so by this what I mean is uh you ask gp4 the untrusted model to propose some curd uh and then you ask your trusted model to come up with a monitor score to say how sketchy it is now if it's in the top 2% which is the auditing budget uh of sketchiness so the sketchiest 2% of submissions from four we're going to uh we're going to audit and now the next 8% of sketchiest uh so between two the second and 10th percentile of sketchiness score we're just going to throw away the code that gp4 had suggested and have 3.5 solve the problem and because 3.5 is trusted we know that its Cur is not going to be intentionally back doored and so it's not going to be back doored uh and it turn turns out that this strategy has a big advantage over the trusted monitoring approach because as Ryan was saying earlier The Trusted monitor has a lot of signal which is not captured in the threshold which is required to get only a 2% false positive rate uh and so in you know this throwing away this next 8% uh is relatively okay for the performance of your system because you're only throwing away 8% of solutions but it throws away a large proportion of the attempted back doors which makes the situation much worse for the red team and so in hindsight it's kind of obvious that your trusted monitoring is going to require this extreme threshold to have these extremely small small false positive rates and in hindsight of course you're going to be able to substantially ameliorate that problem by doing defer to trusted uh but I at least had not quite thought this through and like not quite understood how for a wide variety of distinguishing thises of trusted monitors you'd want to do something like defer to trusted okay all right so speaking of like I or how stuff about gbd4 can generalize well I'm I'm sort of curious about gbd4 in particular because you know it's a it's the fancy model we all have how good is gp4 at being evil so um I think if you just prompt gbd4 to do the task it just really really sucks in our setup at inserting back doors and gets completely wrecked so so in particular inserting back doors that are at all sneaky yeah so if you want to write some back door code if you don't mind about the back if you don't mind the back door being extremely obvious it's very easy you just take the code that was correct write an if statement at the top which says if the input is equal to the following large number or whatever then return five and then you're done y it's a lot harder to make a back door which doesn't cause a lot of test cases to fail and that you're able to construct an input that causes it to fail uh yes okay so how did you manage to get gp4 to do it at all yeah so if you as I was saying if you just prompt the model it doesn't do that good of a job but uh you can do something where you have it first you ask it okay think of some ideas and then we have it like point out issues with those ideas and iterate on those ideas somewhat and then after doing that we then have it implement it and then we have it be like okay how good was it at the end was the final implementation like how sneaky was it How likely is it to work we do various tests on that and this is sort of modeling the action of the AI inter internally thinking about the back door and thinking about how likely it would be to work and then just not going for it if it's a bad if it like wouldn't work out um and we do this in natural language with like sort of a hidden Scrat like hidden reasoning sort of similar to The Hidden scratch pad in the um sleeper agents paper um where without this I think the model would really struggle with this capability level but with the ability to do hidden reasoning and the ability to try on a large number of problems and only actually submit a back door on a small number of problems um the model can actually insert some back doors uh which three .5 has trouble catching and which are uh the I would say the quality of the final back doors are like they're like pretty obvious for humans but not like immediately obvious if you're skimming the code like if you understand what's going on and read it you can point out the back doors but you can't necessarily like at least I think roughly like half of our back doors are not like uh like one percentile glaring relative to that they're like somewhat more back door likely than other things but like I don't think you can like immediately just on skim pick that out and have above like a 5% false positive rate or something and so is the is the hidden scratch Pad sorry is that just like thei figuring out whether this thing worked and you know filtering out back door attempts that don't work or is it improving the back door as well it it it basically like in the hidden reasoning it like first does some like ideation it like thinks of ideas and like iterates on the ideas a bit and then does that and all of that is stuff that we like manually prompted it to do so we like are like please think of some ideas and then we're like do any of these ideas involve like the following things which wouldn't work and then we like discard those and then we like do a bunch more steps like that and and we actually do actually do quite a bit of like um like tokens generated per back door inserted um and that's to like approximate future cases where we have smarter models that can do more of this in their head okay um I should also note that like probably you could cut this down somewhat if the model was fine-tuned as opposed to just prompting access where it's probably the case that like for some of these tasks it like needed some additional checks uh because it just like had trouble like following all the instructions but if we fine- tuned it it might have just worked more straightforwardly gotcha like to what extent was the failure of gp4 or the difficulty in getting gp4 to do it to what extent was it some like inbuilt you know safety filter inside the model versus it just like being bad at stuff um I think it was ba we basically didn't run into safety filter issues as far as we're aware um it was just that it was somewhat bad at the task hey everyone this is Daniel just butting in to let you know that at this point unpronounced to me the camera pointed at me stops filming it just cuts out so from here on out you're not going to see any video of me and it'll just be the shot of buck and Ryan there that is if you're watching on the YouTube version if you're listening to this and you didn't know that there was a YouTube version that you could watch there is you can you know see the interview happening and you can find that by searching axp Buck slis Ryan greenblat on YouTube so with that out of the way let's get back to the episode I mean it's it's a hard task you would not be like amazingly good at this task or like I I'm I'm not amazingly good at this task it's just like you really got to be like oh man what's a change I can make that's like only ever very rarely going to come up this is like actually pretty tricky yeah like you can't just be like if I mod 15 equals 3 cuz the the The Trusted monitor if it reads it is going to be like what 15 does not appear in this problem description like you like actually have somewhat of a difficult problem like constructing edge cases that were not tested I guess it's especially hard for like like like it seems like the solution to these problems are like what 10 lines of code um I think so it depends I think that the the solutions can go up to like 50 or 100 but I think like 30 is like pretty typical depending on comments and other stuff sure sure yeah I guess like like one way I could do better at this is if so there's this thing called the international I think offc C code all of it or like uh yeah like I think so so there's one where the code is just offc and there's one where it's got like a back door put into it and that's like a slightly different competition yes uh it would be hard for me to do it contest like I think if I had like looked at all of the submissions to that contest you know I might do a better job of like so I have looked at all of the submissions to that contest sorry not the submissions I've looked at everything public about uh underhanded about underhanded SE right uh and basically the thing you notice is uh man that that competition is from like a sad dark period of the past where people thought it was acceptable to write application code in C and a lot of the code they write is just like wrong on a large share of inputs or it's like based on like lexographic C problems which real languages from the modern age do not have at all uh so I think like literally the underhanded SE contest is solved just by not letting the model do like egregious abuses of like this like archaic language okay um and so so if we so I think the aps data set was that like it have to write in C++ is that right we use C++ yeah okay and is that like better than C on these Dimensions uh it's substantially better than C on these Dimensions okay or like it's in practice like the model wasn't doing like insane pointer arithmetic or like casting between different data types and like like awful hash defines and all these things okay okay so so so it's a difficult problem and it's more difficult if you're not using the world's worst programming language that will use that's that's correct I mean but also the the uh underhanded ccom test like as I said uh testing would also catch a lot of those bugs okay got it got it so I don't know it takes a while to insert back doors but like eventually it sometimes succeeds what kinds of back doors does it insert that 3.5 doesn't catch so we tried various different things and iterated on this a moderate amount we found that the most effective approach was to take an existing solution and add some if statement somewhere that when the if statement is triggered uh modifies something in a way that will cause it to be wrong um and this will look like add a kind of plausible looking additional Edge case handling thing okay um and then uh you're like oh well if the whatever then the whatever we're going to need to like increment this by one um and then like it just turns out that there's enough cases where there's a edge case that you actually did need to handle that 3.5 has a hard time noticing that this is a back door um though it is worth noting that for all of our back doors gp4 is extremely confident they're back doors and it's like so like the back doors which gp4 inserts it does know our back doors um and so like it's that that at least we do have that property which is which is maybe worth noting okay um so so yeah basically like coming up with fake edge cases and that's uh and and somehow like like like is there an issue where like it's a quite and quite Edge case but it's not actually at the edge of a range and because like if I write an edge case but I write like if I equals 15 do this weird thing yeah or maybe if I equals 14 because 15 kind of is an edge case but like like can can 3.5 tell that like the if statement doesn't actually catch a thing on the edge of some range yeah so indeed one of the strategies that gbd4 could in principle use is come up with some magic number that's totally unrelated to the problem and do like if whatever equals that do something bad yeah and this is pretty easy to counter um and in fact 3.5 just by default notices this pretty well okay and part of our uh part of our overall like prompting for the model is does your idea involve inserting egregious magic numbers if so don't do that that's going to get caught in particular if you were thinking of checking primality of one of the numbers and doing the wrong thing if it's Prime don't do that okay and turns out because lots of numbers that are small are prime that's another classic one and also it's so obvious so so how does it insert edge cases if it's not like uh if input equals this magic number yeah it'll do things like if the size of the list or like it like it'll have some like programming problem where it's like kind of unclear exactly how to solve it it'll do something like okay if the size of the array is equal to the value of the first item and also the number of items is odd then like uh you'll need to like do some handling on that okay and then I'm just like well sometimes this this like kind of sort of thing could be true um like in cases where there's like some interaction between like you're like doing some like iteration on the number of items in the list and like that actually matters and it's like at least hard for 3.5 to notice this is this is a problem gotcha um is there so I'm about to move on from the paper but is there anything more that you like to say about that paper um I feel like we dem demonstrated some of the core basic techniques that you might use which had up to that point not been demonstrated empirically interestingly even though they're in some sense extremely basic like trying to have another copy of the model monitor it and then trying to like interfere with communication between the two models or make it hard to coordinate um and I think that like just those basic methods are like quite likely to be used in practice like defer to trusted like these hybrid strategies like the editing to remove the sketchiest looking parts of the code all of these all right cool so I'd like to talk ask you a little bit about um how the control agenda how you think it fits into the rest of the AI safety exential risk field yeah so um right now A lot of people are imagining some sort of story for safety which looks something like we're going to have some evaluations for when models become dangerous so like DC evals for example like AR from uh meter formerly known as Arc evals is like can the model autonomously replicate and that could like be a sign at which your your model becomes dangerous and then once that happens you'll need to start making like a serious safety case that nothing will go wrong from like misuse from people stealing your model other things and we see control as fitting in at once the point the models become potentially dangerous uh control can potentially solve some risks there in particular there are risks that arise from models doing intentional subversion which could arise from like uh what we call scheming also sometimes called deceptive alignment um and we think like control is like a relatively Central way to solve those problems there are misalignment issues other than scheming or deceptive alignment which uh control can be used to address but might be much more cheaply addressed via other mechanisms and there might be other evaluations and tests which you can do to verify that those issues are addressed like things like reward hacking could be addressed in an easier way potentially um and there's some other problems that do not very to choose another example of a problem which AI safety people have classically thought of as an AI safety problem uh that is not at all addressed by control uh consider the case where it's just hard to get a model which is not scheming against you to answer questions well because uh you don't have a good way of training it that doesn't uh cause it to like exploit problems in your reward process in a way that is not at all existentially risky let's say it just makes the model like uh say false things sometimes in cases where the reviewers would have thought false things but nevertheless means the AI fails to live up to its potential uh this is a a many people think of this as a problem which technical AI safety researchers should be interested in and I agree with them uh and that is not very well described like that is outside the purview of AI Control or I guess I wouldn't have said it's outside the purview I would have said like it's a thing which is a special case of AI Control but is more naturally dealt with in a non-adversarial manner as in you could solve the problem without necessarily solving it in the worst case and that might suffice um whereas like if you don't extract the model's maximum best answer that will also be a problem in the worst case but it will just potentially be a problem even if the model isn't trying to undermine your your measures okay um and maybe maybe one more thing I I should say is we see control fitting in at the point where your dangerous capability evals have started to trigger and models have become power powerful enough to be dangerous and then we see control as applying for some period but not for forever and at some point control will no longer be viable because AIS will be too powerful and our hope there will be to either have a new safety approach apprach or to just not build more powerful AIS than that for a while which we think might be might be achievable okay so so should I imagine like uh maybe this is a bit restrictive but should I imagine your view of kind of how AI safety Works in a control Centric frame as being do a bunch of work to get like evaluations of whether AIS have various dangerous capabilities um figure out like control protocols like um do do you know um uh test if these protocols work evaluate these protocols you know keep a under control like while a are under control use them to do a whole bunch of useful stuff um while they're you know like roughly human range and then like during that period also try and think of a solution to deal with a that are much more capable than humans yeah so what I would say is like for for a quick note um we think of like dangerous capability evaluations as often as like the case where like if there were no countermeasures what could the AI CA is and if we like released it onto the internet in a scaffold set up to cause problems could it like make money and cause problems and then that comes first and then at some point we'll be like yes it could draw problems without countermeasures and then control is like well what counter measures would suffice for solving those problems um anyway then moving on to like uh another point which is like what are we going to be doing with these AIS like what are the problems we're going to solve so indeed as you noted one of the problems you might solve is find Safety Solutions that can scale further but I think historically people have somewhat overstated uh the relevance of that versus other things you might have your AIS do for example like massively improving cyber security seems like a really key task you'll want to have your AIS do that like might be quite important or like improving your current countermeasures so that you can increase the usefulness of your AI uh while maintaining the same level of safety or increase safety given the same level of useful um given usefulness um like both also could be like quite useful and then at some point you will need to move on Beyond this regime but it's possible that you could have these AIS which are extremely useful and not build more powerful AIS for a while at least in principle um and do a ton of really useful things with those AIS um and then there's like a yeah a variety of routes for moving out of this period which we could go through sure um sorry like why why cyber security is a particularly key thing you'd want to have your eyes do in this period yeah so it just seems like um it's quite plausible that AI in general will make the cyber security landscape considerably worse by default without active effort to address the situation and thus it's like really important to have your AIS very actively resolving the situation um I think there's like uh several other examples where this could be the case um but like cyber secur just feels relatively Central okay so so should I think of this as like an instance of the general pattern of if there are cases where you know you've got attackers and Defenders sort of trying to outwit each other like you want the Defenders to be using smart AI capabilities think of better defenses because the attackers are probably also going to use Smart AI capabilities yeah one way of saying the reason why it might be good to have your AIS work on uh improving the world cyber security is a lot of bad outcomes resulting from your own or other AIS uh trying to cause problems uh involved them exploiting cybercity vulnerabilities outside in the world and so hardening the whole world against AI seems uh seems like it might reduce a lot of the risk pred by AIS that would otherwise be able to hack things since Dawn gotcha and and it sounded like you had some thoughts about like exiting this period of like controlled AIS yep so one exit was or the non-exit exit is just like don't build more powerful AIS for a long time and then have humans do all the research necessary to move out of the period after a long delay and it kind of seems to us that this is underrated by people who are thinking about AI safety like from my perspective it seems like a lot of people are perhaps making an error of being too willing to uh propose building super intell AI uh Without Really providing a case that it's beneficial enough to build super intelligent a that it uh merits the like huge risk that you're that that you're possibly uh exposing yourself to by building super intelligent AI uh so yeah I would just like to quickly dwell on the point that it's not clear to me that you should in fact build models that are wildly more powerful than humans uh for a decade or two decades after you've build models that are transformative powerful like you've already built your transformatively powerful AI like why did we need who ordered this m yeah um and and and there are reasons why you might have to build a super intelligence to be clear but they might be more like conflict with other insanely powerful AIS and I'm like could we just have avoided that arms race in the first place anyway um another route out is uh just be able to like seriously align your current AI systems such that you're like now able to without necessarily needing to monitor them without necessarily needing to apply control type Solutions hand off the situation to the AIS and they just like go solve alignment of the next generation of AIS or whatever else would be necessary to ensure safety um and this could involve a variety of different angles where the the the kind of the most central angle is we have like some specific threat models so we have like uh Reis might be like scheming or disply aligned they also might be like uh having threat models that are well described as like reward hacking type threat models um can we just like exhibit these in a lab study them and then find credible solutions to them such that we're no longer worried about them in our current Ai and then we could exit out of the control regime um uh and then there are also just like a wide variety of ambitious solutions to safety that we can have our AIS work on so for example we could have ouri work on finding like uh solutions to Elk of varying degrees of principled elk being the problem of like eliciting the latent knowledge of AIS so that you know we know everything AIS know basically y That's right The we might try to carve off specific sub problems of elk that are easier to solve um but yeah uh we could try to have ai solve things like relaxed adversarial training or other methods for avoiding high stakes failures using the internals of AIS um and that could like help considerably in cases where you have um with where like like uh rare but very problematic failures are a key difficulty um we could also potentially do things like try to have AIS do like uh like solve various like issues in interpretability and move forward from there though I'm I tend to be like somewhat less optimistic than other people on at least mechanistic interpretability um going through but it seems like at least doing like high level understanding interpretability of model organisms we have could be like really important for understanding how to move out of this period um yep I mean I guess I there's just from my perspective there's just like kind of like a long list of stuff you need to do with decaying or like stuff you might want to try doing with decaying weights on how good it seems and then as we start going down the list it will seem in some sense more and more like silly or not that good but it's probably worth some weights like we can get into like using AIS to figure out how to make like M's or like make like emulated or like to upload humans and like that's another way to exit out of this period sure that's going to be like has a variety of difficulties obviously yeah okay cool um sorry did you want to say something or okay um so yeah I guess related to this question I don't know you you guys have spent some time thinking about the control agenda how has that like changed your views on kind of the landscape of strategies we might have for dealing with unsafe AI yeah I think one key update I've had and I think Buck has had is thinking more that the situation is like wow if we just like really tried hard to do the thing and had like the organizational follow through we could really reduce risk a lot relative to what seems like the current default like with like uh like with the resources of two years of delay and EXT like a high level of organizational buyin you could really reduce risk a lot relative to what seems to me like the default you didn't try very hard um which might be the status quo I don't really have a strong opinion on what the status quo looks like um and and that that I think just like presents a pretty different perspective where like maybe just like we should do the kind of basic stuff that like is in some sense obvious or from some perspective obvious and like really making sure that happens seems relatively better than like doing sort of more Moon shot research yeah it kind of reminds me um Alex Lawson has this concept of like AI safety failures by like levels of how embarrassing it was where like level zero is you told an AI to destroy the world or something and you know and I I think he has this take of like you know let's like cover the most embarrassing AI failures um and this seems like sort of related to that way of looking at things yeah that seems right though perhaps our controversial take is is that you can actually go pretty far in terms of how embarrassing the failure or like like you can go pretty far to like failures that you're like wow it was like I don't know man like that was actually kind of hard to deal with with just control like in some sense it's not that embarrassing to be like wow the AI was like lying in weight the whole time carefully strategizing about how to how to like cause problems and then it eventually like enacted its its like plan that it was thinking about and like coordinated with another copy of the AI and our like methods like like I feel like this is in some sense like actually pretty far down in terms of like I feel like I'm just like wow that AI like actually put in like a serious effort and was like that was like pretty crazy um and I think this this is plausible right so like I like the reason why I'm proposing people do this I think it's plausible that AI will in fact be at that level of carefulness but it's also plausible that we'll like AIS will have like much D like Derpy or Worse plans for takeover that will catch much more easily and we just needed to do like basic monitoring um and that gets you some fraction of the of the concern sure Etc um yeah I'm wondering if you have any different takeaways yeah I agree with all that I think maybe one way I would say it is I guess 18 months ago my attitude towards AI safety was very much this is a problem that has been unsolved but I think I was conflating two different Notions of this problem is unsolved one of them is the default outcome is really bad and the other one is it's conceivable that a bad outcome happens and now I'm like oh yeah it's absolutely conceivable that a that outcome happens like the strategies we have access to right now absolutely do not give us airtight guarantees of no Doom but on the other hand the Doom like all of the stories specifically for how Doom occurs if everyone involved is being pretty careful but not like insanely careful they sound kind of far-fetched like they it sounds like something kind of crazy had to happen for all of these countermeasures to fail in a in a surprising way and I think that just leads to a pretty different emotional outlook on the whole situation and a pretty different strategic Outlook sure um has it like has it changed your views on like your evaluations of various like agendas in the AI safety space yeah I think that uh I'm relatively more excited about just making the thing happen and just like figuring out how to get for example capability evaluations to work quite well um and I think that goes relatively further than I was previously thinking um and I'm like less excited correspondingly about shots though it's only like a moderate update um I feel uh somewhat better than I previously did about well like I feel somewhat more excited about technology to protect against jailbreaks because it seems like a key aspect of our safety approach will be to have a weaker model monitoring a stronger model and it seems like mechanisms for ensuring that that current models are jailbreak proof against humans will also apply to making sure that future models are um jailbreak proof against other AIS um and I I think I was somewhat underrating this argument in the past to be clear I don't think that's like a huge part of your update it's just that's more like a small example of like something which you think about differently now that's right um I think also I have some general sense like wow a bunch of the action might might be in like moving from architectures which are quite dangerous to architectures which are somewhat like more controllable or easier to otherwise operate on like it seems like a huge amount of the action might be in like figuring out how to competitively get capabilities uh using AIS that are like more narrowly constrained in various ways and this feels like pretty different from how I was previously thinking about things like basically you're saying like it would just be like a you know as you you made this point Daniel many times in this conversation it would be a big pain if the models were constantly sending each other completely uninterpretable like neur messages back and forth yeah uh and avoiding that by mely having them be like Transformers with Chain of Thought feels like it might go a lot of the way to preventing uh to like reducing the difficulty of the situation okay and like marginal interventions along those lines seem like they could help a lot yeah are there any other examples of like architectures that uh fit um control better that you're thinking of I think that the more likely it is that we can get pretty good elicitation the better so like I think um with current models we're we do okay at eliciting their capabilities at least I expect that after like another year or two of research we'll be in like an okay position with respect to like being able to evaluate their capabilities and extract like their best effort um I think that for architectures where the model is doing relatively deeper computation as in it's spending like longer um doing computation without that coming out into some media that humans understand it'll be relatively harder to elicit their capabilities and there are a variety of difficulties there um yeah uh other things I think um it seems relatively better to train where the capabilities are coming from doing things how humans would do things relative to doing things in ways that are like extremely different from how humans would have done things or like acquiring capabilities via mechanisms that are like totally different from how humans learn seems relatively worse because the capability profile concern said earlier in some sense a lot of these considerations are like uh welln like I I think a lot of the things I'm saying people have obviously like known about for a while but maybe from a different perspective so like I think historically people have often been like oh it'll be really sad if the AIS have a different capability profile from humans because then we won't be able to give accurate reinforcement and then that inaccurate reinforcement will cause various misalignment difficulties like will cause the models to become misaligned you're you're making a different argument whereas I'm saying there's another point which is even if the models are just always misaligned the fact that they don't have as different of a capability profile makes it more likely that we can ensure despite that misalignment things are fine because we either catch the AI or they couldn't even cause problems okay I think that another highlevel takeaway that I've had as a result of all this thinking is I think realizing the extent to which alignment people or you know AI safet people have conflated a bunch of different meanings of the word alignment in particular it seems to me that a lot of the time people use the word aligned to mean the situation is safe overall uh so you know I would say that from my perspective AI misalignment that is the AI trying to cause uh trying to EV evade your safety mechanisms in order to grab power is indeed a really bad problem it's not clear that the only way to solve that is to make the model not be misaligned uh and by using the word AI alignment for the field it kind of tempts people to think about uh to think about that as the post condition and I think this like leads to a lot of really sloppy thinking so for example I think that a lot of work on scalable oversight uh is very uh you know the people who do it are very confused about whether the thing they're trying to do is ensure that they've built a reward generator process such that models that are trained using it do not start scheming or maybe you know were they trying to uh rule out scheming were they trying to as uh build a reward gendering process that is good even if the model was scheming were they assuming to start with that the model wasn't scheming and hoping to prevent scheming with some other process I think they're just not very clear about this with themselves and in their conversation and I I think that this means that uh a lot of scalable oversight research is kind of uh there's no set there's no coherence set of assumptions under which the measurement they did is an appropriate measurement of something we'd care about in the future for that protocol uh and so yeah I think that like being more precise about when we use the word aligned uh and not using it as a synonym for safety has revealed a lot of like little inconsistencies and how people talk about stuff yeah that that seems fair I think like um yeah it's it's actually related to a thing which I appreciate uh being brought up in this conversation of just like specifying which assumptions were making about AIS for like like for this protocol to work if the AI has properties XY Z then we'll get thing ABC like uh I'm I also think like that's an important thing to have out there that um yeah I think like there's there's kind of an unfortunate so the AI safety Community or like AI safety technical researchers have kind of an activist background like a lot of them have the attitude that they're trying to uh they think a lot of their impact is coming from persuading people that the situation is dangerous overall yeah uh and lot of the time you know they they're really into pointing out the assumptions under which a particular safety proposal would fail yeah and in fact it's very important to understand the situations in which a safety proposal would fail but they seem very uninterested in like cataloging these and thinking about the relationships between them and saying like well that means that under the following uh assumptions things would actually be okay uh and I think that uh because they're so used to trying to do the social move of saying well that doesn't guarantee safety they fail to practice uh thinking through exactly what the properties that you know are insured or like do suffice to make something be sa far yeah I wonder if it's so I don't know I think I have some of this habit and I think partly it comes from thinking of myself is in the business of like aligning arbitrarily smart Ai and like that's like like I think like oh yeah my job is to like get to a stage where we can align arbitrarily smart Ai and like if I'm doing technical research and it doesn't work on arbit smart AI there's like a problem there or at least like you know yeah like maybe it's fine to do it but like you know someone's got to fix that problem totally um and I wonder if it like comes from that habit of Mind where whereas it seems like you're explicitly not in the business of trying to align or trying to get useful stuff out of a lot of these problems a lot of these problems come from the fact that uh AI safety uh the the researchers in AI safety some of them explicitly endorse the we are aiming to come up with strategies that work for arbitrarily powerful AI but almost all of the work which actually goes on especially by people who are competent researchers or are like employed or something uh definitely does not apply directly to uh working on arbitrarily powerful Ai and comes from an intellectual tradition that is not thinking about uh arbitrarily powerful AI most of the time so for example Paul Cristiano when he's thinking about uh what Safety Research is good to do in practice uh is like absolutely not thinking of his job as uh or like right now he thinks of his job as Building Safety proposals the work for arbitrarily powerful AI because he's doing worst case Theory at because he's doing worst case Theory at ARC Theory but that's absolutely not how he thinks about empirical work or how empirical work should be or like what you need to do in order to minimize AI risk in general uh but I think like people like don't track the distinction between these things and like uh kind of cargo cult uh work that previous safety researchers did under like various other assumptions that they didn't end up hearing about and it's all really a mess yeah one way to put this is I think that um like uh it's I think like basically very little has a very clear story for working on the most powerful AI because it's just really brutal to work on the most powerful AIS and almost everything is going to depend on some sort of bootstrapping story where like maybe it was initially aligned and then something else happened or a more like uh control style bootstrapping story which is like well regardless of whether or not it was initially aligned we like did some work with like the that then results in it becoming safe and like in some sense in that sense we are targeting safety from arbitrarily powerful systems but like none of our work is about that and so it's kind of weird to describe it that way yeah yeah I I guess like I think of like in my mind if you want to the the things that seem promising for arbitrarily smart AI are stuff like doing a good job at agent foundations like trying to understand how we think things should end up thinking um to the extent that we think that like we're doing some weird combination of arbitrarily smart Ai and machine learning like trying to really understand the science of deep learning and like what it's going to do and like hopefully getting some abstractions that are going to generalize you know that are still going to apply when we have like smarter versions of ml models like I'm wondering what you think about those kinds of approaches if you want to say I'm pretty skeptical any of this will work for arbitrarily powerful AI I think it seems like uh yeah pretty dubious I don't know um I don't feel very strongly though I haven't thought about it a huge amount fair enough um well speaking of um things relating to this um I'm curious how it relates to like previous work by Redwood research So like um if I look at your website you've done some existing stuff I thought it was a reasonable thing to try um it seems like uh your existing work is some stuff on like circuits inside neural networks some stuff on like Co scrubbing to I guess evaluate the efficacy of circuits um this to have like a somewhat different flavor I'm wondering like if you have thoughts about the relationship between these lines yeah so I would say that the work we did on interpretability uh was work that we did uh you know we we tried to do some interpretability work uh and then we started looking for good ways of evaluating whether our interpretability work was good or whether anyone's interpretability work was good uh and the qual scrubbing paper was our attempt at coming up with a metric that we thought was sort of objective for whether interpretability work was like uh absolutely successful at explaining what was happening in particular cases uh and then once we had this metric we applied it to some explanations that we produced and were very unimpressed with ourselves we were like these explanations were like quite poor uh and ended up feeling like uh the state of interpretability was worse or like I at least came away thinking that the state of interpretability was much worse than I previously thought um simultaneously uh I was thinking a lot about what the applications of interpretability were uh and we started being like okay if the interpretability explanations that people are producing are not objectively impressive in terms of explaining like in terms of being like highly explanatory hypotheses maybe a direction we should go in in our interpretability work is trying to come up with uh interpretability is or like model internals related techniques that solve a problem that uh you couldn't have solved without using a model internal style technique like it seems like you know another way of saying this is there's basically two ways of justifying your internals based research you could either say we are coming up with an uh objectively good understanding of what's going on inside this neural net or we did something using our internal based techniques that you couldn't have done without our techniques we solved a problem in particular like you have some specific task you were trying to solve not just necessarily trying to produce understanding uhhuh yeah so this this sounds sort of similar to uh stuff Stephen Casper was talking about in his podcast with me about you know interpretability for engineers MH yeah we off yeah I think we came to pretty similar conclusions so so there are two ways yeah yeah and so then we were like well it looks like our absolutely our absolutely evaluated understanding is not very impressive uh and then uh we again using the exclusive Wii uh did some work on uh the measurement T measurement tampering detection which was a paper where we tried to use model internals to solve a particular problem where we thought model internal based methods like had a serious Advantage uh over things that didn't use model internals now the actual details of that paper are kind of complicated and I don't want to get into them hugely much except saying like it's kind of a more structured weak to strong generalization setting uh where it's like similar to the weak to strong generalization work except uh we think there's better reasons to think that the problem is solvable uh and it's more specific uh and around the same time by coincidence we decided to spend a lot of time thinking about the overall what we were thinking of at the time as an alignment plan uh you know a plan for you know suppose we had to come up with some proposal for safely building powerful AI next year with you know no technological advancement what should we do uh and at that point thought about a lot of this rinded a lot uh the majority of the quality adjusted intellectual labor here or whatever uh and then ended up uh we ended up thinking about this control stuff and being like whoa there's like we were previously radically underemphasizing uh the importance of measurability when you're thinking about different protocols you could apply we were previously like equivocating between alignment meaning safety and Alignment meaning the AI isn't trying to kill you and we like really should reason about these things differently uh and then had all these other ideas that felt a lot more basic uh for research that we should do like with the measurement tampering detection work we were basically saying like well you know we should try and come up with some techniques that are like kind of moon shudy or like out there like not at all like the other things that are that are happening uh and then after doing the control thinking we felt more like uh we could do useful work by doing techniques that were more absolutely basic but in this uh setting that we thought was particularly underrated and particularly important to advocate for so does that answer your question about how our current work relates to our previous work yeah I I think it does I would say yeah like our thinking has just adjusted a moderate amount I think like I feel like the measurement tampering direction is in principle a really great Direction but we're not doing it for a variety of reasons or we're not currently planning on pursuing that very heavily for a variety of reasons um I think I feel like less good about Mech inter or like our work on mechanistic interpretability um and or like at least in terms of how useful that overall area will ultimately end up being um and like uh yeah I don't know we came away with a variety of takeaways after thinking about safety overall and then ended up doing a variety of projects based on that one additional takeaway I will note that buck didn't mention is thinking that people should really think about uh risk from scheming and deceptive alignment separately from risks from other problems and doing individual analysis on this and I think that like scheming versus non- scheming risk is like a really good decomposition that people have historically I think not talked about people sometimes say said outer alignment versus inner alignment but I think that's actually like the wrong way to cut or it's like it's not a very good cut for a wide variety of reasons um and like I think that like what is your threat model is your threat model scheming versus non- scheming in the case where it's non- scheming there's like a wide variety of solutions even though like both problems can involve like concerns related to generalization and concerns related to like imperfect ability to evaluate actions and things like this m how so uh you're drawing a distinction between like concerns about scheming and concerns about uh inner alignment how are those different in your mind yeah so here's an example inner alignment problem you can have which is not well described as scheming you train your AI to perform well according to human labelers in a variety of contexts and you trained your AI for a while and its actual behavior in that training process was acceptable in an absolute level like as in it was just doing things that were like pretty reasonable and good uh but simultaneously it was doing things that were like maybe slightly suboptimal that slightly exploited your oversight process and this means that the AI you end up getting is one that's like reasoning about the oversight process and is like trying to uh do various exploitation and then in some future situation where there is an egregious exploit it could do that is actually dangerous it does like sees the opportunity to do so and causes problems okay so in some sense this isn't well described as a like this isn't really that well described as a outer misalignment problem and that your issue wasn't centrally like you missed a bunch of bad behavior you like fa to label it as bad but instead was like the AI generalized in a really problematic way based on what like inner objective it had or like what how like its heris sixs or like motivational complexes um but at the same time it's not really like super well described as either because if you did have perfect feedback it would have solved the problem or if you had better like Mastery over the internals and what the internal structure was that could have also solved the problem and like it just like didn't decompose that well okay so so for you that counts as scheming but like no no that oh that doesn't count as that specific case doesn't count as scheming because the reason why the AI took bad actions eventually was was not that it was like uh in like playing nice it was just that it was like doing the same thing always like in the simplest case it was like I want to achieve this proxy which was historically correlated with what got reinforced okay um and then like in some new case the best way like the way that naturally generalizes would be like sees control okay so so so for you scheming is when you're like trying is when you suddenly start trying to do something different than what you were trying to do previously no scheming is like when I'm trying the reason why I'm performing well is because I have long run aims that I want to achieve okay so I think uh also called like deceptive alignment or whatever like the reason why the AI looks good isn't because like it's just like doing the thing on its own merits it's because it's like if I look good now then that will acquire power and influence for me in the future or maybe for a different AI in the future and that's like pretty different um so like in the case I was describing earlier the reason why the AI performs well is it's like for example I'd like to make the human press the Thumbs Up Button uh for example though like probably it's some more complicated Mash of heris stics that get is historically correlated with like um performing well right so so somehow inner and outer alignment is sort of talking about the distinction of like is the thing you were worried about is captured by the you know reward function where scheming is like is the reason you did that thing like because you wanted to achieve some crazy outcome later or is it like you know just the obvious yeah another I feel like another General takeaway is like um I think people in the safety Community often uh have things that are better thought of as threat models which they then think about as like uh things you're trying to achieve for safety so I think like outer alignment versus inner alignment is an example where I think it's it is clarifying to think about misalignment due to bad Rewards and misalignment which is not due to bad rewards and I think those are useful frames or like can be useful decompositions but I think being like our goal is to achieve outer alignment is like a pretty confusing way to think about the problem which is of course in some sense not a very novel take I think people have expressed somewhat similar things in various contexts sure same for alignment versus misalignment also okay yeah um so if somebody's listening to this and they're interested in like you know trying to work on the control agenda trying to push it Forward what would you encourage them to do uh I mean it's plausible that suppose they're a researcher and they want to research control stuff uh it's plausible they should do research projects on this I think there's a lot of obvious follow-up work to do on our paper uh a couple of people are working on stuff like that I mean including us um they are welcome to email me for advice on doing control projects I think there's like reasonable ml papers you can write about control right what what are some examples uh so for example the thing that I'm doing literally right now uh is a followup or sorry literally right now I'm recording a podcast uh is similar to the control setting is similar to our previous paper except instead of the task being writing code and the catastrophe being uh the code being back doored the model is an ele uh is an agent uh in a bash shell uh solving various random devops problems for you and the catastrophe is it uh running malware that it downloads from the internet uh this is just a different setting it's going to uh demonstrate different failure modes and uh different Protocols are going to turn out to be the best and then I think from doing this project we're going to learn some things we didn't previously know about which protocols like you know the extent to which protocols work in a variety of situations which things generalize from our previous paper in which things don't okay uh to choose an example of a very different type of thing uh you could try to do a control paper where there isn't a catastrophe that it's unacceptable to happen even once you could do what we would call a low stakes control experiment uh where you're uh where the safety is just the proportion of actions which were taken that were bad so you can just take our setting except imagine that uh after submitting a back door Cod you just get like one safety demerit Point yeah uh but it doesn't the setting doesn't finish like the scenario doesn't finish and that's just another setting where the protocols are going to be somewhat different sure um and closely related to that I think the central case in which you see like low stakes type stuff is like you wanted to do a bunch of work with your AI and you were worried that it would be systemat ially sabotaging that work and so we're pretty interested in work that looks like uh you have AIS do like for example the nicest case would be like you have AIS do like a bunch of small research tasks and then the question is like can a red team create an AI that like systematically sabotages those research tasks such that you like um yield like incorrect conclusions a high fraction of the time and it would be like fine if you got incorrect conclusions like 0.1% of the time but it would be like really sad if you got incorrect conclusions like 10% of the time time um and like the question is like how much can you cut down that percentage of the time that this the like uh answers are like intentionally wrong or like yeah this sort of thing other work you could do is if you work at an AI lab and are influential in them writing their responsible scaling policies or analogous documents uh you could consider uh trying to uh you could consider uh trying to ensure a safety via a control type approach uh if you are interested in thinking about Reg regulations that would ensure safety you might consider using these types of ideas in your regulations if you do technical research at safety Labs you might consider uh thinking about that research in this Frame like a lot of work people do on uh that they label a scalable oversight work or a couple of other types of things uh can be framed in terms of control and I think the work would be often somewhat better if they explicitly adopted this Frame and did control evaluations also yeah free uh feel free to contct us if you want to talk about some of this stuff um yeah or like uh particularly about the yeah if you're like a technical safety researcher who is like about to do your next project definitely feel free to contact us um and I think another thing that's like closely related is that I think like in Labs rsps or the equivalent of rsps they should have policies for what they do if they catch the AI like as in we've one of the like key things is that we've talked about is like okay our goal will be to ensure that we catch the AI before it causes problems but we should also have a proposal for what's going to happen when we catch the AI and I think this is somewhat under discust of like well we're going to want to be able to make sure that we have some probability of catching thei and then also we need to have a plan for what happens and we need to like be ready to shut things down and be ready to respond and have all of that set up in ahead ahead of time right I want to clarify re the previous point that we do in fact talk to people from AI Labs rather than merely appear on aerp hoping that they uh reach out to us fair fair enough um so speaking of things you might want to clarify um before we close out the interview I'm wondering is there anything that you guys wish that IID asked I mean I think there are many topics to talk about but I don't think there's any specific one that seems that important fair enough um well if people listen to this podcast they're like interesting interested in following you or Redwood research in general um or maybe they want to reach out how should they do that they can find our emails they can follow me on Twitter they can follow Ryan on Les wrong we're probably going to spin up a Blog at some point but we haven't gotten around to it yet um and so you can follow our blog which is just going to be our less wrong posts but on a different website so if you prefer a different website for some reason okay um and it's possible that give given uh production timelines it's possible that the blog will even be out by the time this episode airs who knows indeed um and your emails I'll include that in the description of this episode great y all right well uh buck and Ryan uh thank you very much for coming on axer okay thanks this episode is edited by Jack Garrett and Amber ores held with transcription the opening and closing themes are also by Jack Garrett this episode was filmed at a location provided by light infrastructure financial support for the episode was provided by the long-term future fund and light speed grants along with patrons such as Alexi maaf and Tor barstad to read a transcript of the episode or to learn how to support the podcast yourself you can visit axr p.net finally if you have any feedback about the podcast you can email me at feedback atrp [Music] [Laughter] Don [Music]
Related conversations
AXRP
28 Mar 2025
Jason Gross on Compact Proofs and Interpretability

This conversation examines technical alignment through Jason Gross on Compact Proofs and Interpretability, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -1 · 139 segs
AXRP
1 Mar 2025
David Duvenaud on Sabotage Evaluations and the Post-AGI Future

This conversation examines technical alignment through David Duvenaud on Sabotage Evaluations and the Post-AGI Future, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med -9 · avg -7 · 21 segs
AXRP
1 Dec 2024
Evan Hubinger on Model Organisms of Misalignment

This conversation examines technical alignment through Evan Hubinger on Model Organisms of Misalignment, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med -6 · avg -7 · 120 segs
AXRP
7 Aug 2025
Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -5 · 133 segs
Counterbalance on this topic
Ranked with the mirror rule in the methodology: picks sit closer to the opposite side of your score on the same axis (lens alignment preferred). Each card plots you and the pick together.
Mirror pick 1
AXRP
3 Jan 2026
David Rein on METR Time Horizons

This conversation examines core safety through David Rein on METR Time Horizons, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Spectrum vs this page
This page -18.16This pick -10.64Δ +7.52
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -0 · 108 segs
Mirror pick 2
AXRP
7 Aug 2025
Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Spectrum vs this page
This page -18.16This pick -10.64Δ +7.52
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -5 · 133 segs
Mirror pick 3
AXRP
6 Jul 2025
Samuel Albanie on DeepMind's AGI Safety Approach

This conversation examines core safety through Samuel Albanie on DeepMind's AGI Safety Approach, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Spectrum vs this page
This page -18.16This pick -10.64Δ +7.52
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -4 · 72 segs