Library / In focus
AXRPCivilisational risk and strategy
Adam Shai and Paul Riechers on Computational Mechanics

Why this matters
Auto-discovered candidate. Editorial positioning to be finalized.
Summary
Auto-discovered from AXRP. Editorial summary pending review.
Perspective map
MixedGovernanceMedium confidenceTranscript-informed
The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.
An explanation of the Perspective Map framework can be found here.
Episode arc by segment
Early → late · height = spectrum position · colour = band
Risk-forwardMixedOpportunity-forward
Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).
StartEnd
Across 92 full-transcript segments: median 0 · mean -1 · spread -15–0 (p10–p90 -6–0) · 0% risk-forward, 100% mixed, 0% opportunity-forward slices.
Slice bands
92 slices · p10–p90 -6–0
Mixed leaning, primarily in the Governance lens. Evidence mode: interview. Confidence: medium.
- Emphasizes safety
- Emphasizes ai safety
- Full transcript scored in 92 sequential slices (median slice 0).
Editor note
Auto-ingested from daily feed check. Review for editorial curation.
ai-safetyaxrp
Play on sAIfe Hands
Episode transcript
YouTube captions (auto or uploaded) · video sUpAssKC-L0 · stored Apr 2, 2026 · 2,685 caption segments
Captions are an imperfect primary: they can mis-hear names and technical terms. Use them alongside the audio and publisher materials when verifying claims.
No editorial assessment file yet. Add content/resources/transcript-assessments/adam-shai-and-paul-riechers-on-computational-mechanics.json when you have a listen-based summary.
Show full transcript
[Music] hello everybody this episode I'll be speaking with Adam Shai and Paul Rikers who are co-founders of Simplex Simplex is a new organization that takes a physics of information perspective on interpretability aiming to build a rigorous foundation for AI safety Paul has a background in theoretical physics and computational mechanics while Adam has a background in computational and experimental neuroscience for links to what we're discussing you can check the description of the episode and a transcript is available at axr p.net oh Paul and Adam welcome to axer yeah thanks for having us yeah thank you so you guys work on doing comput computational Mechanics for AIA safety type things what is computational mechanics so I'm happy to give a bit of a background of uh computational mechanics is um basically grown uh within the field of physics mostly out of Chaos Theory information Theory um for what purpose well I mean physics has always been concerned with uh prediction that we want to write down some equations that tell us U maybe if we know the state of the world what does that imply for the future as some planets move around and all these things um and then there became a pretty serious challenge to that um being able to predict physical systems that came from Chaos Theory where even if you have deterministic equations of motion um there's there's kind of a finite Horizon to how much you can predict and um so so I don't know if I'd say that physics was thrown into turmoil but it was a serious challenge for um what are the limits of um predictability and how would you go about doing that prediction as well as possible so so I'd say that's kind of the history out of which computational mechanics has grown and so um it's now a very diverse set of uh results and ideas way of thinking about uh things and Dynamics but but really at its core is what does it mean to predict the future as well as possible um you know as as part of that what does it mean to um generate Dynamics and is that and how is generating different than predicting if you have some generative structure is it harder to predict a lot of this stuff's been Quantified and a lot of the work that's been done is at this kind of um this ultimate limit of what's possible um but there's also a lot of work that's been done um and and more and more work now um of if you have some resource constraints then then what does that imply for your ability to predict and um you're not asking yet but I'd also be happy to share how that's uh relevant to uh um AIML um I maybe just throw it in there sure um that maybe this is obvious to to listeners but um you know we're we're now training these AI models to predict future tokens from past tokens as well as possible and um you know while people are kind of poking around trying to understand what's going on there's there's this theory that was developed specifically to address that and so a lot of the mathematical framework maps on um kind of surprisingly easy where we've been able to um uh Come Away with some results that we've been um just actually surprised that it worked out so well and and it's a great framework to then build upon um I wouldn't say computational mechanics is the answer to understanding all of interpretability and AI safety but it's really a great foundation and something that helps us to um make better questions and see where research should go sure so I guess a bit more specifically um if I hear this pitch of like oh yeah we're interested in understanding how to predict the future as well as possible someone might think hey uh that's basian statistics we got B theem like 100 years ago M uh what what else is there I'm sure there's more but what is it what are you doing beijan inference over yeah I don't know stuff you're gonna you're gonna have some prior and you're gonna have some likelihood and then you're just done what's what else is there no exactly that's uh but I think that's my point is computation mechanics helps us to understand um what are the things uh that you're um that you're trying to do updates over and and I think there's some some question of um you from the start I'm saying okay we're we're applying this to because um models are predicting the future as well as possible but actually what we're doing is we're training them on this next token uh cross enp loss right well what does that mean people have um argued right about whether uh they can have some World model and it's unclear what people even mean by a world model or if there's stochastic parrots and it's not even clear what people mean by that um so one of the advantages is that um let's just take uh this seriously of what are the implications of doing well at next token prediction and there's a theorem it's actually a Cory from paper from I think 2001 from uh shalizi and Crutchfield uh which if you translate it into kind of modern language uh says that to do as well as possible at next token prediction implies that you actually predict the entire future as well as possible so so then okay then computational mechanics comes into play and uh right so so I should say yeah it's more than beijan inference because what well you want to know what about the past you need to predict the future and uh you know it would be if you had no resource limitations you just hang on to all of the past and there would be maybe a mapping from this past to this future but um with you know some some number of uh your token alphabet the size of the number of paths would be growing exponentially uh with longer uh longer length passs right and uh so that's intractable you somehow need to compress the past and what what about about the past um should be corar grained basically there there's kind of this obvious Mantra but when you take it seriously again it has interesting mathematical implications that for the purpose of prediction um just don't distinguish pasts that don't distinguish themselves for the purpose of prediction so that means if a past induces the same probability distribution over the entire future uh then just Clump those pass together and and also if you want to say about lossy prediction then you can look in that space of prob distributions over the future and you can see kind of what histories are nearby in that space sure so can I maybe a thing that would help my understanding is if I just say what I've gathered kind of the interesting objects interesting like things in computation mechanics are and you can tell me like where I'm wrong or perhaps perhaps more likely what interesting things I'm missing um does does that sound good yeah cool here's what I see computation mechanics is doing just from a brief glance it seems like it's really interested in stochastic processes so kind of like time series inference of you know a bunch of things happen that like you've got this thing and then the next thing and then the next thing and then the next thing and you want to predict the future as opposed to like forms of basan inference where like you're inferring the like correct neural network that's underlying this IID this independent identically distributed thing as opposed to you know various other things you could do with bent inference and in particular it's interested in in um hidden Markov models uh like like I see a lot of use of hidden Markov models where basically there's some set of states that the that the underlying there's some like underlying thing the underlying thing is like transitioning from state to state in some kind of lawful way and these transitions are associated with like emitting you know a thing that you can see um each time a transition happens you like see a thing Ed and basically in order to do well um at predicting you've got to understand you know understand the underlying hidden marov model understand like what what state is the hidden marov model in and if you if you could understand what state the hidden marov model was in then you do a really good good job with prediction I then see this construction of a thing called an Epsilon machine which as far as I can tell is kind of the ideal hidden Markov model of a certain stochastic process where basically the the states of the Hidden the states of the Epsilon machine are just like somehow the sufficient statistics like uh like like you were saying grouping together all pasts that um lead to the same you know future things that's like just the states in the Epsilon machine and there's some sort of minimality of like you know once you have those are like precisely the States you need to distinguish between if you had any more States it would be too many if you had uh any fewer States it would be too few to actually understand the process so you have like stochastic processes um there's hidden Markoff models there's Epson machines then I think there's something about like these inference hidden Markov models where if like I'm seeing a the outputs of a stochastic process if I'm seeing just you know a sequence of you know uh heads and tails or tokens or whatever there's some process where I'm coming to you know have probability distribution over like what the you know what what state the thing might be in I update those distributions and that itself Works kind of like a hidden Markoff model you know the where the states are my states of knowledge and the transitions that the tokens and transitions are like I see a thing you know I move from this state of knowledge to this state of knowledge so I also see this construction of going from underlying processes of generating to like the process of inferring things and I guess there are some interesting differences or similarities and differences between like the inference process and the generation process my impression of computational mechanics is it's taking that whole thing these stochastic processes these like hidden Markoff models these Epsilon machines these inference hidden Markov models I'm you probably call it something like you probably have a better name for them than that but that's like what I came up with maybe mix presentation or mix presentation yeah and then there's this question of like okay how do I like calculate interesting things about these processes how do I like actually you know find uh I don't know stationary distributions or how do I like get observables or high order statistics out of that this is what I've gathered from computational mechanics I'm wondering how is does that seem right am I missing stuff that was pretty good I think there's a lot there um just uh I'll just kind of go at it without any particular order um but so importantly you brought up the Epsilon machine um so for instance one extra thing we can add on to what you said is that there's kind of a distinction between minimal for the purposes of predicting versus minimal for the purposes of generating okay and so the Epson machine is is the minimal machine uh for the purposes of of prediction okay um in general you can get smaller machines or machine is like a hidden markof model yeah I think it can be a little bit more General than that um but yeah I think it's fine okay to think of it as an hmm um so yeah the the mix a presentations are hmm in this framework and the Absol machine is one and you notice they they actually both generate the same data set so you can think of kind of there you have data right so like the data for instance that you train a Transformer on all of your training data yeah and these can even be in in theory not for a transformer in practice but in theory these can even be kind of infinite strings yeah if you want to make an hmm that generates that data there's kind of an arbitrary Choice a lot of the times there are many many hmm that can generate that data and infinite number of them in fact and in compm they call these um different presentations and uh Paul in one of his papers in the intro makes kind of a really great point and kind of this is like deep conceptual view that I really love about comc which is that kind of for different questions about a process and for like the structure of the data um different presentations allow you to answer those questions kind of with different amounts of ease so if you want to ask questions about kind of the predictive structure um and what information you need to predict then an absolon machine or a mix a presentation is like a very good way um to answer that question do you have some question um that's different about generation or about pairwise correlations then different presentations will be more useful for that so um just this notion that the way that you present the data kind of the choice that you have of exactly which machine you use to to generate the data with um even though it's the same exact data with the same structure in some sense yeah uh kind of allows you to answer different questions um so that's just a few thoughts um yeah um yeah I guess you had a lot to unpack there in what you said so so there a few things I wanted to uh at least comment on um the first one is I I think a common misconception I think it's natural that you were saying oh it seems that computational mechanics is about hmm um and uh I guess that that's a kind of common misconception I'd say that actually hmm turn out to be the answer to a particular um question you can ask in computation mechanics so like the data generating structure could be uh essentially anything you know you could you could think through the chsky hierarchy or or other types of you know ordinary differential equations or whatever that generates the thing okay and then you ask the question of uh again what is it about the past that I need to remember to predict the future as well as possible um and then you have these minimal sufficient statistics um which are these kind of course grings of the past and then you look at what's the metad dnamic um among those um kind of belief States about the world like you mentioned okay and then that thing turns out uh to be describable always as a hidden markof model if we allow ourselves infinitely many states um well and um you know because you can think about this basically via ban updating that um given uh basically your knowledge of the world um up to some point like like even if you know the data generating process you don't know the hidden state of the world right yeah you just wake up in the morning and you have this impoverished uh view you need to take in data and synchronize to the state of the world so um you so you know something you have something like a model of the world um your data so far um induces some belief State um and then as you take in more information how do you update your belief of course just base rule basan updating and so from a particular belief State there is um a unique answer of where you go to given the the new data um so so that induces a hidden Markov model in fact it induces a particular type of hidden Markov model which uh kind of in the lingo of comp Mech comput mechanics is unifer um but you can also think in like the theory of aoma this is um called deterministic um that's a little confusing because we're talking about stochastic uh deterministic automa so unifil is kind of a better name like you're a single thread through the um states of this um belief structure right so so just to clarify it's like stochastic in the sense that your the states of knowledge are you know probabilities over what's going to happen in the future but it's deterministic and that if you see a thing there's just a deterministic next state of knowledge you go to there's a deterministic way to update your state of knowledge right yeah um and and there's typically some stochasticity in terms of what in the world comes at you yeah yes and and so so if you look at like an ensemble of realizations there would be kind of a Randomness stochasticity in terms of like the the transition rates between belief States but in any realization there's like a thing you do to beliefs and then just a few of things I want to engage with where what's the role of the Epson machine here yeah so so I think especially for making a connection with understanding neural networks um I I think the Epson machine isn't as important I think this mixed state Pres presentation this updating beliefs is the the primal thing that we care about and it doesn't even require stationarity or anything like that um if you do have stationarity then there will end up being some recurrent structure in the belief State um update and then those recurrent belief states are the the causal states of this Epsilon machine so um I think it's a bit of like the history of where comp came from that the Epson machine was important and I think for the current um way in which kme is useful uh I think Epson machine's not as important um but it but is still interesting to to say what are the recurrent um states of knowledge that I would come back to for um something stationary which maybe is relevant if you're like reading a long book or something but okay yeah so can can I try my new understanding okay um and see if I've gotten any like novel misconceptions um here's my new understanding you've got the stochastic processes stochastic processes are just like things that generate sequence data like um you know the probability distribution over books that are going to be written which in a book is a really long sequence of words uh in this Frame or you know sequences of um the state of some chaotic process maybe just like sequences of stuff you want to like understand you know this probability distribution over these sequences like what's going to happen in the future given the past and it turns out that hidden Markov models are kind of a universal way of describing these stochastic processes like if you have a stochastic process you can create a hmm hidden Markov model that generates it in fact you can create a bunch of hidden Markov models that generate it there's also this process of updating your beliefs um you know given like some data that you've seen in this process and that also looks uh kind of like a hidden Markov model U maybe not exactly because things are coming in instead of going out but something like that and computational mechanics as I currently understand it is something like okay how do we relate these stochastic processes to Hidden Markov models like which hidden markof you know you can construct a bunch of hidden Markoff models that correspond to the same stasic process which one do we want to like play with and you know which one is most useful for various mathematical goals we might have for understanding this sequence if we want to understand the inference um process like you know we can think of that in terms of some kind of hidden Mar model like thing and then there's just this process of understanding like okay what actually happens how you know maybe uh you haven't said this but I imagine there might be questions like okay how many states are there what are the Dynamics of belief updating where do you go to that's what I now hypothesize computational mechanics is how does that how does that sound what am I missing now um i' I'd say you've hit on a lot of relevant things I think maybe even more interesting uh topics would come up if we're thinking about um you know how what does any of this have to do with um neural network because like I said conversation mechanics is a very rich uh field there's like um you know semester long courses on this so I don't think we could do it justice like right away but but I think but I think we can get to a lot of um neat ways which the theory is being used and extended um by us and our colleagues um as we're saying okay how can we how can we utilize this to well what have we done so far is basically um we have this new angle at understanding uh internal representations of neural networks and also um something which I think will become more clear in our upcoming work um the behaviors of models like what to anticipate in terms of uh Model Behavior in context learning um and and there's like really an amazing number of ways now to adapt the theory that we see which is getting us to to answer questions that um I'd say wasn't obvious to um even ask before cool uh yeah I had a thought um just because of uh something you said uh one of the things we've been talking about recently is is this um we've been thinking about a lot of the work that's been going on in interpretability um that uses the concept of world models and kind of what the form of that work is and trying to get closer and closer to like what is a world model what is the type of stuff people do when they talk about world models in in interpretability work um and this comes back to this issue of where hmm fit in kmec and you know are we kind of assuming an hmm or something I I think one of the things I've been thinking about even if you don't have stochastic data right if you have some other data structure and there's like a lot of data structures that one could have right you have some implicit or maybe explicit structure to the world in which you want to ask does this Transformer represent somehow in its internals well no matter what that data structure is at the end of the day if you're going to train the Transformer on it you have to turn it into sequential data right it takes sequences of tokens and so um you know you can go in and you can kind of probe the model internals for structures associated with this original data structure in your head that might not have anything to do with sequences right that's a fair game to play and I think people more sometimes do that um like maybe maybe you can specialize ath as not really being about sequences but being about some other type of data structure about kind of like states of the board where you know the top left corner is black white or empty right um then you can go in and you can probe your aell GPT to see if that's linearly represented in the residual stream um but when you train the AEL GPT you are transforming the game of AEL to a specific tokenization scheme that gives you specific sequences um and those sequences when are kind of naturally thought of as an hmm um and then we should be able to you know if our this work isn't done so this is more kind of thoughts and things we hope to do but we should be able to take results like that um kind of reframe it from a a kme point of view and then run our analysis um and kind of apply our framework to it to kind of even make hopefully theoretical claims explaining the results that they've gotten right that certain ways certain probes for kind of properties of the gam that are linear linearly represented and other are nonlinear nonlinearly represented you know if we take our findings seriously that kind of this belief State geometry I guess we haven't spoken too much about that yet um but if that belief State geometry is kind of the thing that the Transformer is trying to represent then hopefully we can kind of explain a bunch of these other results um that people have been finding um I think first I just want to make sure that I really understand what this theory is before I like move on there but I'm chomping it the bit get to that later okay here's a really basic question why is it called computational mechanics like does it have something to say about computation or mechanics uh yeah it's a kind of interesting historical note um I think it makes more sense when you realize it's called computational mechanics in uh distinction to statistical mechanics which okay this makes some sense uh what's statistical mechanics in in physics there is this theory of thermodynamics of course which is this high level you just look at these macro States and how they evolve um thermodynamics doesn't require um a microscopic description but then it turned out that oh actually if you believe in atoms which I mean this is kind of why people started rebeling in atoms uh then you can actually uh explain some of the theramic um uh behaviors and you can uh figure out material properties and stuff how just by just by saying like basically by saying how random something is um so so we have entropy right that quantifies that and by saying how uh how random something is we can derive all of these properties but it's kind of a static view of what's going on you you just look at the statistics y okay when the question shifted to say what are the um Dynamics aspects what are the computational aspects somehow there's like nature is intrinsically Computing you know you think of Shannon's uh channel of um that that information we go through uh you can Al also think of uh the present as a channel uh that the past must go through to get to the Future right so um if you take this on there's some sense in which Nature's uh naturally Computing um the future from the past okay uh so that's kind of the idea of computational mechanics and yeah it's also uh it's it's taken and built upon a lot of ideas in uh computer science and information Theory um unfortunately it also has the same name as something which is quite different which is like um I don't even know exactly what it is it's like how to use computers to model like Newtonian mechanics something like that yeah yeah so it's like that's that's a bit confusing but yeah the the computational mechanics uh that we're talking about here is uh in distinction to statistical mechanics got it I guess the final thing that I want to ask is a thing you mentioned uh a bit offhandedly was this idea of how can you do prediction under some sort of like resource constraints of like not being able to maybe it was not being able to distinguish certain states of the world or maybe it was something else like I'm really interested in that because I think this is a question that like basent statistics doesn't have an amazing answer for off the shelf um it's a question that is obviously really relevant to AI yeah can you tell us a little bit yeah we're also really interested in this um I mentioned that some work has been done in comp Mech on this um but it's say it's kind of underdeveloped but like Sarah Marsen and other colleagues have worked on um thinking of rate Distortion Theory how does it apply in this Dynamic uh framework of computational mechanics when maybe your memory is limited um so what is rate Distortion Theory conceptually you can kind of Imagine a plot where on the x- axis you have kind of your memory capacity and on the y- axis you have your accuracy at predicting the future and there's some kind of parto optimal curve you can get so for a certain amount of memory constraint you can only get so far in inaccuracy and there's some curve that describes the optimal trade-off there so shape of the curve I guess probably how you how you get to the curve um and so you were saying there was some work by Sarah Marsen and others on taking some sort sort of rape Distortion Theory View and applying it to I think uh dynamical processes or something yeah so so say instead of the um the causal States instead of these ideal belief States how would you compress those um to still predict as well as possible given the um memory constraints so there's related to like information bottleneck methods there's this um causal bottleneck stuff um so that that's one way in which you can think think of a constraint there's memory constraints but there can be all all sorts um there's an interesting Avenue on the physics side um where um actually if you allow yourself something like uh Quantum States uh Quantum information can somehow uh compress the past better than classical information could so this interesting branch of computation mechanics which is said actually we can we can use quantum States uh to produce classical stochastic processes with less memory than otherwise classically necessary um so that's interesting interesting sounds very tangential um but a bit of a teaser actually uh we think that um neural Nets do this so um that that's a bit wild um and again something you just probably wouldn't think to look for if you weren't um building uh off this theory Sure I guess to finally um to to just triangulate computational mechanics a little bit more I want to compare compared to two other things that sound kind of similar so the first is uh people who have followed the AI safety space um people who have listened to few episodes back in my podcast might be familiar with singular learning theory right so this is the last like this is the most recent really mathy way of thinking about basan statistics that has taken the alignment World by storm um how do you see computational mechanics is kind of contrasting with singular learning theory uh perhaps both in just what the theory even is or is trying to do and how it might apply to AI systems um I have a quick comment maybe then Adam wants you elaborate but um at a very high level um computational mechanics is kind of a model agnostic theory of prediction yeah um whereas I think SLT takes on very seriously that if you have a particular model in that you have a particular parameterization of the model yeah then what's learning like um uh singular learning theory SLT um is aasan Theory so I think sometimes the way in which it's applied to talk about uh actual learning with um maybe stochastic gradient descent confuses me a little bit but I think you know it's a genuine uh mathematical theory that that I think um likely will say something useful about inference maybe um maybe it can be complimentary to comp Mech in addressing some of those aspects do you have further thoughts on that yeah I guess I mean first of all I'm not an expert at all on SLT although I've been like highly um inspired by them and like the entire group at tus um I think they ask a lot of very questions that are similarly motivated so for instance um I think they really do a good job at kind of talking about their motivations about kind of the computational structure of these systems and how can we get a mathematical and theoretical handle on those things um I think their kind of tactic and framing of it and this might be more of a Vibes thing than like again I'm not an expert in the math at all so I I really can't speak to that but I think kind of naturally they take a view that has more to do with learning and the learning process and right I think they want to at the end say something about kind of the structure of that learning process and the geometry of the of the local landscape and how that relates to the computational structure at a particular weight setting um whereas I think kme kind of most naturally at least takes a view that really has to do with the structure of that inference process kind of directly um but then course you one can think of extensions uh Paul's already done some work kind of extending um that to the learning process a little bit um so yeah in some sense it's kind of I think they're trying to get to very a very similar set of questions um and a lot of these are like I consider deep conceptual questions at the heart of like how do a aii systems work what does it mean to talk about structure of computation what does it mean to talk about the structure of training data um but from kind of different starting points the learning the learning process versus the inference process right yeah in some ways they seem sort of dual right like uh you know figuring out the weight updating process versus fixed weights like where should It Go um the other thing I want to ask to triangulate um again a somewhat recent episode aerp is when I hear people talk about um oh yeah we've got a really nice mathematical ban theory of you know updating based on evidence and we're going to use it to understand intelligence uh one place my mind goes to is like active inference um this Carl friston style work I'm wondering um if there's like any interesting connections or contrasts to be brought out there I think again um this will kind of also be a little vibesyoutube and and I I think the active part of active inference is talking about as far as I understand it an agent taking actions in the world there there's kind of a a part of compm that has to do with interacting systems um called transducers and that is so far not what we've applied to neural networks um so that part's a little different if you just take kind of the not transducer part of kme at least and compare it to it yeah also I guess like the way that I view active inference um again as a nonexpert is it kind of it takes seriously I guess it matters what form of active inference so often it starts with just kind of like we think the brain's doing Bay or like an agent is doing is some beian thing and then you can get it down to an actual formula by kind of assuming um you know it's hard to do Bay in full so like let's find uh some kind of mathematical way that you can approximate it and they kind of get to a specific set of equations um for that b and updating and uh yeah so it kind of assumes a certain method of approximating basan inference um where again like I don't know we're pretty agnostic to all that stuff in compm so yeah those are just some thoughts sure I guess uh I'm I am Limited in my ability to to wait until I dive into your paper and your interesting work but uh is there anything you would like to say just about computational mechanics as a theory um before I dive into that um yeah I guess just to reemphasize that um two things one it's as it is currently not the answer to everything um uh but has been really useful um but also I think it's a great springboard uh um to continue on so we've recently U founded This research organiz organization Simplex because we believe that uh we can really build on this work in useful ways um so I mean I just kind of that contextualization help yeah let's uh let's talk about building on it so um you've recently published this paper uh Transformers represent belief State geometry in their residual stream by uh Adam Sarah Marsen luk just take Sarah Alex and Cen seal and pull um what do you do in this paper yeah so we we kind of were motivated by this idea of like what is a world model what is kind of fundamentally the structure that we're trying to get these um that we're training into these Transformers when we train them on EX token prediction what is the relationship to that of that structure to a world model how can we kind of get a a handle on the on these questions I guess I guess one of my own personal I don't know if it's a pet peeve but something like a pet peeve is um just these like it feels like often there's just this kind of NeverEnding discussion that happens in kind of the AI community at large about are these things do they really understand the world do they have a world model do they not um and these things these are slippery Concepts and it feels like the same conversation has been happening uh for a while now and our hope is that if we can kind of get to some more concrete handle um then we can actually know what everyone's talking about we don't have to keep on talking past each other we can answer some of these questions so the tactic that we take in this paper is to kind of start with a ground truth data generating process um so we know kind of the Hidden structure of the world so to speak um that generates the data that we train a Transformer on and then we can go in um and we can first make predictions uh using the theory of computational mechanics about what structure should be inside of the Transformer and then we can go and we can see if it is there the structure so we need a way to like figure out what we should look for in the Transformer um and so we need a way to go from this hidden data generating structure which takes the form of an hmm in our work um to a prediction about the activations in the Transformer and this so we just asked the question um what would it mean to like what is the computational structure that one needs in order to predict well given histories finite histories from this data Jing process if we want to predict the next token that in computational mechanics takes the form of something called a mix dat presentation which describ describes how your belief about which state the data generating process is in how that belief get gets updated as you see more and more data right um and the that by itself doesn't give you a prediction for what should be inside the Transformer because you know that's just kind of another hit and Markoff model uh but there's a geometry associated with that and that geometry is given by virtue of the fact that your beliefs over the states of the generative process are themselves probability distributions probability distributions are just a bunch of numbers that sum to one right um and you can plot them in a space where um kind of each each thing they have a probability for in this case it's you know the different states of the generative process or a different axis um and so the points lie in some geometric geometry in that space and in the main example or one of the first examples we used in the paper and It's featured in that blog post onless wrong you end up getting even though it's a very simple uh generative structure you end up getting this kind of very cool looking fractal so it it feels like a highly non-trivial geometry um to kind of expect and predict should be in the Transformer and what we find is that when we actually go when we try to find um like a a linear plane inside of the residual stream that when you project all the activations to it you get the fractal you can find it and not only can you you find it you can perform that analysis over training and you can kind of see how it develops and refines to this fractal structure um so we were and that it's linearly represented lar yeah yeah a big deal so yeah we were very excited about that um it means that we have some kind of theoretical handle about like given the structure of our training data well we should expect geometrically in the activations of a transformer trained on that data um so it gives us a handle on um kind of model internals and how it relates to training data gotcha yeah I think like when I'm thinking about this one thing I'm paying attention to is like you were said what does it mean for a thing to have a world model and I see like I guess the part of the contribution of computational mechanics is just distinguishing between kind of the Dynamics of generating some sequence and the Dynamics of you know what inference looks like like what states go through with inference and would it be right to say that like a big deal about your paper is just saying like hey here's the Dynamics of the inference that we think like beijan uh you know beijan agents should go through when they're like inferring what comes next what comes next or sorry even just inferring like the whole future and saying like hey we found the Dynamics of inference encoded in this hmm or sorry encoded in the in the neural network sorry yeah um yeah I guess one one of the points that we we kind of hope is obvious in hindsight but um I think a lot of good results are obvious in hindsight so it's hard to feel proud of yourself but um it's just this this thing of yeah what does it mean to have a world model and we we just took this question pretty seriously and what's the mathematics of it and then uh you know there's these speculations can can next token prediction do this and then the answer is concretely yes in fact uh not only must neural Nets learn a generative model for the world but they also must be learning a way to do uh ban effectively beijan updating over those uh those hidden states of the world and um and I guess then what's more it's it's not just tracking next token probability distributions it actually the model will differentiate um basically states of knowledge that could have identical probability distributions over the next token which seems a little odd if you just think about I'm just trying to do well at next token prediction um but again in hindsight kind of makes sense that um if there will be a difference down the road and you just um kind of merged those States early on then even if your loss at this time step or at this context position um is just as good um eventually As you move through context you'll have lost the information needed to uh distinguish uh what's coming at you so um so that's kind of this thing about predicting the whole future is because well it's just this iterative thing if you do next token prediction on and on and on eventually you want to get the whole future right um and uh not only is there this causal architecture but as Adam said there's this implied geometry which is which is pretty neat um and and I think there's all sorts of hopes and we've had a lot of good Community engagement people have talked about well maybe this means we can back out what a model has learned if we can somehow automate the process of this uh you know Simplex Discovery there there's a lot of I think directions to go that we're excited about and uh we think the community can can also help out yeah just to add one thing um to re reiterate the point that you made about the structure of inference versus the structure of uh generation I I do think this is at least for me it was like a super core conceptual lesson coming from a neuroscience backg because um in kind of the section of Neuroscience I was in at least we would talk a lot about kind of um you know the cortex's role is to do this to implement this predictive model um but then we would Al always kind of instantiate that in our minds as generative models um it's it's like if you asked any particular Neuroscience in that field including me just a few years ago like yeah but don't you have to use the generative model to to do prediction you've been like yeah of course you have to do inference over it but it' be kind of like a side thing like once you have the generative model it's like obvious how to do inference and there's like not much to it that would be kind of the implication but actually the structure of inference is enormous right we from this three-state hmm and this and this is a simple example we get an infinite State uh inference structure that has this fractal thing so like the structure of inference itself is is not you know something to just shrug away in some sense it's like if you're trying to do prediction it's it is the thing and it structure can be quite complicated and interesting um and so like I I actually think the neuroscientists could actually learn kind of a lot from this too um although obviously we applied it to neural networks yeah sure and and I guess there's maybe one other idea that you could um usefully latch on to here um which is something like thinking of predator and prey it's like um you you can have like a fox and mouse or something like that and the mouse can be maybe um hard to um predict or eat by with like a simple generative structure it just does something y um and all the fox has to do is predict what it's doing but it's not sufficient for a fox to be able to act like a mouse it has to actually predict it and uh that's maybe some evolutionary pressure for it to have a bigger brain and things like this and I I think this is like um okay is this a weird analogy but but I think it's actually quite relevant like how smart um uh should we expect AI models to be it's not just that they need to be able to act like a human like they need need to be able to uh predict humans and that's uh a vastly greater capability um we can't do that with each other too well I mean we do a little bit um but it's hard but it's it's a superum task actually yeah yeah have you uh have you played the uh token prediction game oh no I've played various versions of this I don't know if it's the one you're referring to yeah it's it's just a web app that like it's Bas basically the game is to literally do what these Transformers have to do so you see like you see just a prefix or actually I think you start with zero prefix you know you just have to guess what the next token is I think it even shows you like four possibilities or something and you have to just guess yeah what what is the next thing going to be and then it gives it to you and then you have to guess okay what is the next thing going to be and it's really hard yeah I don't know somehow I'd never like made this comparison before but it's much harder than writing something yeah for sure um and there's also something interesting in that I'd say metad dnamic of doing the prediction which is as you get more context you'll probably be better at predicting what's next okay now I have some context right uh and so so this is some quantitative sort of stuff that just once we again once we take these ideas seriously we can say oh yeah as you move through context your your entropy over the next token if you're guessing should decrease how does it decrease um actually that's like one of the things I worked on during my PhD in comac is like oh there's this way to calculate this via like the spectral structure of this belief statement at Dynamic and all that um but like there's there's something to this of that models since they're predicting they will they should and they will converge to users in context what is that that's uh that is a definite type of in context learning it's um a bit uh I think our task to show whether or not that's the same of what other people call in context learning um but but that's a very um um concrete um prediction that especially with non-erotic data which is the case for us that in general you're going to have this power law decay of uh next token entropy which is born out in empirically sure okay so there's a there's ton to talk about here yeah first I want to ask kind of a basic question a really just aesthetically if I weren't like trying to read your paper one of the coolest things about it would be this like nice colorful that you guys have right why is it a fractal so I guess um one way to think about that is what is a fractal in general well a fractal is something like um iterating a simple rule over and over right it's a selfsimilar structure you get out of that what's the similar rule well well we have a simple generative structure and then you're doing beijan updating over that there's like different types of um outputs that it could give each output gives a different way of doing um Bay updates over whatever your distribution of the states is so um so you're doing the same thing over and over and the geometric implication of that is fractal okay is that helpful the yeah I think this uh maps on to what my guess was I also I'll also point out that um kind of after the less wrong post we made um John wenworth and David Udell uh made a followup post Udell or lurel lurel okay sorry made a follow-up post um kind of explaining that iterative game structure and then there's also been kind of academic work um kind of explaining the same thing from the point of view of uh yeah iterative functions um from alexand Alexander urans so those are two kind of resources mhm cool the next thing I want to ask is picking off a a thing you mentioned very briefly of like potential followup work being like you know try and find this fractal structure as a way of understanding what the network has learned is there something to say about like so you have these fractals of you know belief States from this mix State presentation or something is there some way of going from here's the fractal I found to here are the belief Dynamics yeah um I I think this is getting at an important point that the fractal is uh the those are the beliefs they're not the dynamic among the beliefs Yep they're not the updates yeah and I think there's kind of an obvious thing that we'd like to do is like oh can we um now that we find this thing is it now natural to do some like Mech and turp sort of thing of finding a beijan updating circuit that would be nice it's not totally clear how this works out but it's a a natural sort of thing um for getting at hopefully a thought process of the machine right like what what is its internal Dynamics among these beliefs is a great question that we are pursuing um a lot of this is empirical because okay now we have this theoretically grounded thing but we also need to work with the actual implementation of like Transformer architecture um how does that actually instantiate the thing yeah I guess I also just have a theoretical question of like okay suppose you had this fractal somehow um you had it to infinite Precision is it possible to back out the the belief hmm or the the stochastic process or there are like multiple things that go to the same fractal um so there would be like multiple choices like there's degenerate ways in which you can uh choose to represent a generative structure right but but it's all the same stochastic process so and that's the thing that matters right the world model that it has um so there really shouldn't be that much degeneracy at all in fact um in the current kind of toy models that we' used to um to say what happens in in kind of the simplest framework then we have to we have to project down we have to find a a projection in the residual stream where this fractal exists y actually probably in Frontier models the residual stream is a bottleneck so you're not going to have to project down it's going to try to use all of its um residual stream so so in fact it maybe is easier in a sense but then it's also doing this um probably lossy representation that we're also trying to get a handle on yeah I I guess I wonder so if if it is kind of if one fractal does correspond to one generative process maybe we don't even need to do mechanistic interpretability and we can just say like oh yeah here's the geometric structure um of you know the activations bam this is the thing that it learned yeah I I I think I don't want to overstate what we're claiming here I I think the way the way that I would think about what this belief State geometry is is not kind of um maybe it'll be useful to try to think about like features the way people talk about um and what they are and what the relationship is between features and these belief States even just kind of philosophically um I think it would be a mistake to claim that the belief states are the features um I think I think of features at the moment at least and they're kind of Il defined but I think of them as the kind of computational atoms that the neural network will be using in the service of building up um the belief State geometry and the Dynamics between those belief States um now I should say that I think um and of and of course how how a neural network builds the belief State geometry up will be highly highly probably will be highly highly dependent on the exact mathematical form of the network they're using so in Transformers um you know the fact that you have a residual stream with an attention mechanism the attention mechanism is literally the only place the Transformer can bring information from the past to the current um token position and so you know and it has a certain mathematical form it's like strangely a lot of it is linear not totally obviously but um but that mathematical form kind of puts constraints over what information can be Pro from the past to the present and it puts strong constraints over kind of exactly the manner in which you can kind of build up this belief State Dynamic um so yeah it would be really cool to be able to even kind of it's like a version of compm that somehow takes into account um the kind of mathematical constraint that the attention head um puts on you in your ability to bring information from the past to the present um which we very much don't have but which you would vaguely imagine Al although Adam has done some preliminary experiments where uh you look at basically how uh like this fractal we were talking about how that's built up through the layers and in the different uh like from from attention from MLPs uh and I mean it's surprisingly intuitive and beautiful um so encouraging at least yeah um so like so for instance if you take a the fractal example um in in the ms3 process it's a vocab size of three so it only speaks in kind of A's B's and C's so it just strings of a BS and C's and then you put it through the embedding and you see just kind of three points so it's like three points in a triangle one for each of the tokens that makes sense and then that's the first place that you're in the residual stream then you kind of split off and um you go into the first attention mechanism and what you see happen in the first attention mechanism is a kind of uh spreads out the points to like a filled in Triangle okay and then when you add in that triangle back into the residual stream you're adding in the Triangle to these three points in a triangle and what you get is three triangles um in in the shape of a triangle and then so on and so forth and you can kind of see and then and then actually it goes into the MLP and you can see a kind of stretch out these three stacked triangles um to look more like the Simplex geometry and then every time you go through another layer it adds kind of like more holes and more triangular structure and um yeah you very nicely see kind of a very intuitive feeling for how every part of the Transformer mechanistically um puts in extra structure in order to get to the belief State geometry um yeah and I think this this is one example of how we hope that computational mechanics can be something like a language for bridging between mechanism and behavior that um a lot Mech turp has been great but it's pretty low level uh there's just results here and there so one thing computation mechanics can do is create a language to tie this together to unify the efforts but but also like where's this leading and so hopefully um there's a little bit more of a bridge between the mechanism of behavior sure sure so there's this question about like what we learned and we've talked about that a little bit um one thing that so yeah we we've learned that like these fractals they're represented you can like get these this mix tra presentation in there ual stream um another another result that comes up in the paper well one result is that it's linear even in retrospect like are you surprised by that or do you think you have a good story for why it should be linear I don't have a good story and in fact computational mechanics doesn't as far as I'm I know it does not give it right it just says that the mix Day presentation should be there yeah mhm it doesn't say how or where um yeah I think when when we decided to do these experiments were like uh you know ba based on this uh theoretically informed thing like yeah somehow somehow this geometry should be in there you know these belief States should be in there even this geometry should be in there but it's at least it's going to be stretched out or something right but it's like uh okay there's there's still definitely something to explain and I think um I've seen titles and abstracts of papers that I feel like maybe help us to understand why things are linearly represented but I don't yet understand that well enough gotcha gotcha um it's like too good to be true but hey good for us yeah uh maybe this just has a similar answer but another result that comes out is that at least in one of these uh processes that you train a that you train a Transformer on there's not like one layer in the residual stream that you can probe and get the get this nice Simplex you have to sort of uh train a probe on all of the layers right yeah is there a is there a story there of just like well one layer would have too good to be true or is there something deeper actually in that case there is kind of a deep reason that computational mechanics uh gives for for that finding although I should I should say we were not looking for that finding um that's just kind of I think one of the nice things about having kind of a theoretical foundation for the ways way that you go about um running your experiments is that sometimes things phenomenon jump out of you and then in hindsight you realize that the theory kind of explains it so in this particular process is is the random random exor process where you if you start from a particular State you generate a bit a 01 then you generate another bit Z or one and then you take the exor of the previous two and then you generate a bit you generate a bit xor so on and so forth so that's your data generating process it has five states in its um kind of minimal generative uh machine and that means that beliefs Bel the belief State geometry lives in a 4D space and actually unlike um kind of the fractal thing you actually get 36 distinct States um it's also quite a beautiful structure um maybe there's a way to Splash that on the screen and uh probably we we have a low budget I'm okay but it kind of looks like an x-ray crystallography um pattern um and so it's aesthetically pleasing but importantly multiple of those belief States even though they are distinct belief States they give literally the same next token prediction so what does that mean the Transformer all it needs to do kind of at the like if we just think of the last layer of the Transformer and how it actually gets made into a prediction for the next token you just kind of read off of the residual stream with the unembedded and converted to a probability distribution but that means that at the end at the last layer you actually don't need to represent distinctions in next token predictions right and in fact not only not only relative not only relative to the kind of local thing of doing the next token predictions but there's no attention mechanism after that so even if you wanted to represent it there it could never be used there because it can never interact kind of and send information into the future not not of the next token but of everything else yeah so so in fact but still comic says the structure these distinctions should be there they don't need to be in the last layer but they do need to be somewhere um and so yeah well you see them in kind of earlier layers so that's an explanation for why it's not in the last layer yeah um yeah and and I think this kind of hints at um maybe how the Transformers using uh the the kind of past construction of the belief states that if it's earlier on then you can leverage those across context a little better um so so I think there's like strong interplay here between Theory and experiment that there's still a lot for us to figure out sure is is there sorry just a a very basic uh conjecture I have after hearing you say that is just that like you know the parts of the the parts that are relevant for next token predict are in the final layer the parts that are relevant for token off that prediction and the second to last layer well I I don't think it's quite that clean um you know because because if that's the case then you'd only be able to um model like Mark of order um n uh processes where n is the number of layers um so so it's not going to be quite that clean um but it's yeah something like at the end you really only need the last token sure distribution um okay but but then also the theory says that somewhere in the model somewhere in the residual stream uh there should be these representations of uh the full distribution over the future and and it does seem that um to be able to uh for for a certain context position to look back and utilize that it would um serve a purpose for those to uh come about early on um it's still not clear to me why bother shedding them like they could persist as far as I'm aware but there's not like a pressure for them to persist i' understand that better I'm sorry to get hung up on fractals but I have another fractal question so this uh this random random exor um belief State thing didn't form a fractal is that just because the fact that like one in three bits is deterministic does that just sort of snap you to like one of these like relatively well defined States uh is that what's going on I I wouldn't classify it that way um but the the distinction I think one one way to um to anticipate whether or not you're going to have a fractal structure is whether the minimal generative mechanism is this thing I alluded to earlier about whether it's unifil or not okay whether it has these deterministic transitions between hidden States right okay so the um so the random random xor the minimal generative mechanism is a five-state thing and it's also a five State UNIF thing so it kind of is the recurrent belief States so so you have these 30 6 um belief States in general that then will eventually as you go through resolving these different types of uncertainty yeah you'll eventually get to these five states so um okay that's that's kind of simple whereas in if you if you have even a simple generative mechanism let's say you can have as small as a two-state um hidden Markoff model that generates the thing but it's non-un in the sense that if you know the current state the machine and you know the next token it still doesn't tell you exactly where you end up so the number of probability distributions induced um can be infinite and generically is and um and and so in that setting again you're you're folding your beliefs over and over through beian updating so so in general if you have a a minimal generative model which is non unifer uh basian folding will give you fractal and if the minimal generative structure is unifer then you won't so that's also kind of nice that we can expect that sure cool again speaking of fractals so in the paper you basically say hey here's like the fractal we expect to see um here's the fractal we get and it's like pretty close in you know mean Square to error it's like way closer than if you like had a randomly initialized Network or way closer than it was during the start of training um and my understanding is you're measuring this in mean squar error terms right another way I could or a similar question I could ask is okay suppose that instead of like the actual the mix Date Presentation of the ideal process I just had this thing I got from the network how good a job would I do at prediction I'm wondering is that like a question that there's a nice answer to and do you know what the answer is and if you do know what the answer is just please tell me um so it's uh a little unclear to me what you're asking but if you're you might be asking something like um can we do something with the activations if we don't know the ground truth over the generative structure is that right or or like uh instead of just taking instead of measuring the distance between activations and ground Truth by like mean squared error of this probe one thing you could do is say okay like imagine this is the mix State presentation I have how good a job am I going to do at predicting the underlying like the actual thing like somehow like the the thing you've probed for seems like a lossy version of the of the underlying fractal how much does that L show up in prediction as opposed to reconstruction error the one thing you should be able to do is um and this is maybe not a super this is not like an infinitely General answer to your question but if you think of um kind of the fractal that is the ground truth mix presentation and then you think of um cor screening that fractal to a certain amount um Sarah maren has work uh um kind of going through the math of the implication for that for next for predictive accuracy um and that kind of intuitively I think makes sense and in in general I think it should be possible to work out kind of given any particular course screening so what does that mean in this particular case it means like if you're in a if you take just kind of some volume in in the in the Simplex um and you just kind of say I know I'm in in this area but I don't know where exactly I don't know exactly which belief Point belief state I'm in but I know I'm like somewhere there um you should also be able to calculate the consequences of kind of that that in and its implications for next token predictive accuracy or cross entropy loss and um what that looks like in in terms of kind of the visual of the fractal is exactly the fractal just getting fuzzier yeah yeah nearby points in the Simplex do make similar predictions so uh yeah there's some sense in which like yeah if it looks fuzzy then you can almost visually like say how that's going to um turn out for prediction and we can and we can quantify that right and this is am I right that this is kind of because the Simplex is like its distributions over hidden States and therefore like being close by in the Simplex is just close by in like your whole future predictions rather than just next token predictions right you can think of how do you calculate um probability of future tokens from a distribution over the model and it's basically linearly in terms of the distribution over the model and so if you slightly reweight there's like a continuity there that's uh you you can talk about closeness and a type of distance induced but um it's not exactly the natural ukian one gotcha I guess the next thing I want to ask is just uh how how to generalize from these results so um we have these results on um my understanding is that both of the both of the processes could be represented by um hidden Markov models with like five or fewer States um in general of course processes can have like a bunch of states in their in their minimal hidden Mark of model um should how scared should I be of like oh maybe things are really different if you have like a 100 or Infinity states that you need to keep track of you should be excited not scared okay okay um we can just go to the full thing like let's what is the generative what's the minimal generative structure for all of the data on the internet right that that's a large thing it's a very large thing it is way larger than the number of dimensions in the residual stream of even the biggest model like it's it's not even close right right so the natural kind of bottlenecks that exist um in the strongest versions of artificial intelligence systems that exist today um do not have the capacity to in a straightforward way at least represent the full Simplex structure um that this thing predicts which means that um there is at least an extra one extra question about how does some compression must be going on what is the nature of that compression um this is something we're like very excited at looking at now and of I think the the main thing that I get from our original work um although all all these things about like fractals and finding fractals and Transformers all that is true and these are good paths to go on but the more General point and the thing that gets us excited is that like it gives us a handle it gives us evidence that this framework that we're using to go about thinking about the relationship between tra training data structure model internals and Model Behavior works and um now we can start to ask questions like what is the best way to compress into a smaller kind of residual stream than you really have and so now we don't have space to represent the full structure how do we do that um and we don't have an answer to that at the moment but uh yeah it's definitely like I think of one of the things we're very excited to get a handle on so and at least the experiments become clear once you have the kind of theoretical framework is like oh now we know where the kind of simple picture should break down and we can look there so yeah we are very excited to see like how things start to break down and uh how the theory kind of generalizes to understand uh more towards Frontier models yeah I guess uh you've maybe already talked about this a little bit but I'm wondering what are yeah what next to in terms of understanding you know these mixt presentations in uh Transformers or just generally applying computational mechanics to neural Nets what are the next directions you're most excited by there's an incredible amount that we're excited by which is like um so we currently have uh you know this team of the two of us and uh collaborators in Academia and and also more generally in the AI safety interpretability Community uh but we're like um we're really excited to create more um opportunities for collaborations and scaling up um because there's a bit too much but let's let's get specific um there's there's some obvious uh next projects like um the AI safety world's a little obsessed by essayes right sparse Auto encoders so right and and features um like what even are they like how do we know if we're doing a good job there's like no ground truthiness to it and uh so so one thing we're excited to do is just like like well let's Benchmark in the cases where we know like what the machine's trying to do like that these are the optimal belief States how how do belief States correspond to features and Adam alluded to this earlier but um one kind of working hypothesis could be that oh features are these kind of um concise minimal descriptors that are useful for pointing at uh belief States um okay so let's test that out so so we're doing a collaboration right now where um where we're trying to make this comparison um I think it's going to be pretty empirically uh driven for now but um but I think that that that should be quite um useful broadly to know what we're doing with features there's a lot that I'm doing right now in terms of um understanding internal representations um more generally um in a few ways one is that um there's okay so we have a lot of what we've done right now is like what do you expect to happen once the model model has been well trained and and then it's like okay it should have this structure at the end but um how should that structure emerge over training um and there's there's a few ways in which we have a handle on that and um something I'm very excited about so I'm working actively on that um and uh we have a lot more do you have some favorites you want to talk about yeah uh there's actually it's it's kind of a problem in that um like for instance uh just last week there was this Simon instit Workshop um that had to do with AI and cognitive psychology and um I went and I watched a few of the talks and this is something increasingly that's been going on with me um not to get too personal but like every time I see a talk about AI often my reaction is like ah I think we can like get a handle on this using our framework um so I just have like a list of papers that keeps on growing about like things I would like to explain from a comic point of view but just to name a few of them um there's always in context learning which I think is a big one um you can start to think of a way to get a handle of that using kind of non-erotic processes so what that means is you know in the examples that we've talked about in our public facing work so far um the generative models are all um kind of um like single Parts there's like one hidden Markoff model um you kind of go recurrently between all the states right you can also Imagine processes where you just take a bunch of those and C put them side by side in addition to the beliefs having to kind of see more data and try to figure out in which state in the process you're in you also have to figure out which process you're in right so there's this kind of like extra meta synchronization process going on um and it's natural also for the type of data U that you'd be training on because you have this whole hierarchical structure of what what language what genre what mood and um there like these kind of uh recurrent components but many different recurrent components Y and and you can imagine that Dynamic being quite different for sets of processes which overlap in their internal structures more or less and that might have implications for which types of incon which kind of abstract structures you use and take advantage of and in context learning um okay so that that's one thing I think we're quite excited about another one is um kind of related to that just talking more generally about um kind of the capabilities of these systems and getting handle what is a capability like what is the different types of computational structure a model can take because like one of the things that I really love about computational mechanics is it gives you a handle on kind of like what do it even mean to talk about structure in these systems structure and training data structure model internal structure and behavior so for instance um and this is probably coming from kind of my uh Neuroscience background a little but like what do we do in Neuroscience when we want to study some cognitive capacity like abstraction or transfer learning or generalization well we want to run our experiments on like model organisms like rats and stuff like that so what we try to do is we try to find the simplest toy example of abstraction something that like has the flavor of abstraction in its minimal form that we can train a rat to do then we can go into the rat's brain while it's doing this abstraction task and we can um kind of do the Neuroscience equivalent of mechanistic interpretability on it so that that was like my life for a decade we can if we can take abstraction if we can find um kind of a form of an abstraction task in the form of these kind of uh hmm and I I think I have an idea for one um now we can kind of say like for in since um let's say you have two generative structures next to each other and they have the same hidden structure except they just they just differ in the in the vocabularies they speak in right so let's say one speaks in as's and B's and one speaks in x's and y's um now when we look at these as humans it's very obvious that they have the same hidden structure they have the same abstract structure but they speak in two different vocabularies what would it mean for a Transformer to understand that um these two processes have the same abstract structure y so if you train a Transformer on this set of tasks one thing it could do is learn them both optimally but not understand that they have the same structure another thing it can do is learn them both optimally but understand they have the same structure behaviorally you could figure out um the difference between those two cases if you test them on a heldout data on a held out uh prediction task which is combining um histories of A's and B's A's B's x's and y's um together in just holding the abstract structure constant um and if they can still do the optimal prediction task now combining past histories of the two different vocabularies that they've never experienced together before then um you can say they've abstracted in addition because we have a theory for understanding uh how the model internals relate to kind of this to any kind of behavior we can make a prediction of what would underline that abstract structure for each of the processes for each of the generative structures there should be a Simplex geometry that's represented in the residual stream of the Transformer and the prediction would be that if you're able to AB if the network is able to abstract and understand this abstract structure then these simp structures will kind of align in space whereas if not uh they might lie in orthogonal subspaces okay and now all of a sudden you have you start to think of like the relationship between compression if if the residual stream doesn't have enough space to represent them in in orthogonal subspaces maybe it has to put them in kind of overlapping subspaces and that kind of you start to realize that there might be a story kind of in general right like like we were talk about before in the in the real case we don't have enough space to represent the full structure so you have to do some compression and because of that you might have to your the Network's kind of forced to take these what should be separate Simplex structures and kind of overlap them and that might be the thing that gives rise to out of distribution generalization and kind of abstraction and stuff like that so that's another thing that I'm quite excited about sure I guess I'd like to move on a bit to the just the story of how you guys got into this if that's okay um maybe I'll start with Adam so my understanding is that you have a neuroscience background yeah based on what I know about Neuroscience uh it's not predominantly computational mechanics and it's also not predominantly saving the world from AI Doom yeah how did that happen for you just last year even I was kind of towards the end of a postdoc in in Neuroscience an experimental neuroscience so I was um like I was saying before I was training Rats on tasks and I was studying kind of the role of expectations and predictions and sensory processing and and visual cortex of these rats um and actually I remember uh you know running the experiments on these rats and uh it was like a week after chat PT came out and I was I was talking to chaty BT and just really shocked I think is the right word by its Its Behavior um and just kind of like just had a moment of like everything all of my beliefs and intuitions for what I thought underly must underly intelligent Behavior or just proven to be wrong H I guess I had all these intuitions that kind of came from my Neuroscience background about recurrence and complicated dendritic structures and all kinds of details and I was like you know there's some Secret Sauce in Neuroscience that has to underly intelligent Behavior meanwhile I'm studying these Rats on these very simple tasks that are like basically not much more than left right that's kind of simplifying it instrumenting it a little bit but they're not they're not super complicated linguistic tasks that humans perform um Chachi BT after learning about the Transformer structure being like these things are feed forward they're weirdly linear like they're not totally linear obviously and that's important but like they're much more linear in in kind of its structure basically uh to whittle it down I was like the architecture is not interesting enough to give rise to these interesting Dynamics given what I think about the relationship between mechanism and intelligent behavior that was like kind of a shocking thing and then also kind of realizing that like the existence of this system pushed more on my intuitions about intelligence than like basically any paper I had read in Neuroscience in a while yeah so it's a lot of like uh it's actually like emotionally a little difficult I have to say um but yeah at that point is when I started kind of thinking about that transition and like you know if my intuitions are wrong about this then what that means for the future GPT 4 came out not long after that and that was another like whoa uh okay apparently it wasn't just this one thing apparently it can get better faster um so yeah that's when I started going and there's always like some tension here like um kind of when you realize the safety implications for the future and then on the other hand it's like also just actually interesting like literally just a very interesting just intellectually super cool um and there's this like tension between those two things that I feel quite quite often I'm pretty sure I'm not the only one in the a safety community that feels that um so but even from like my academic interests like I'm interested in intelligence I'm interested in the mechanisms that underly intelligence I'm interested in the way that networks um of interacting nodes can give rise to interesting Behavior so yeah I I guess that's like the way that I found myself in a safety I guess the other relevant thing is um for a long time uh like eight years ago or something I had randomly run into comp Mech um as a neuroscientist just kind of reading one of the papers it just struck me not that I understood it at all I did not have and I still don't really have the mathematical acument to like grock all of its complexities but um something about the framework really captured um exactly what I was interested in getting at in Neuroscience which had to do about this relationship between what is the relationship between a dynamical process and the computation performs um and so for a while I would annoy all my neuroscience labmates and stuff about comp Mech and I probably sounded like a crazy person um but I was never able to like really figure out how to apply it in any useful way um so yeah I'm yeah I'm super excited to start working on on these neural networks and applying and like actually getting somewhere with it's like an amazing feeling yeah I'm very lucky to have met Paul and the other comic folks that are helping me with this yeah yeah maybe yeah first of all actually how did you get interested in computational mechanics just to begin with yeah um that goes back a long time uh so I don't know if I can remember um I guess oh god um so so I have a background in physics um yeah I've I've done theoretical physics I also did a MERS in electrical and computer engineering but then was back to theoretical physics for PhD I was interested in emergent Behavior complexity City chaos all this stuff I mean who isn't but um I think early on like um in uh let's say 2005 I um I had Jim Crutchfield as a undergrad teacher um at UC Davis and um I just thought the stff he did was super cool um and he does computational mechanics yeah so yeah he he's been um kind of the the ring leader um it's been like Jim and colleagues um building up computational mechanics um and I just like so so I guess from that point I like realized what Jim did I had some idea of computational mechanics and I guess I was uh like going on a trip with my buddy and um my buddy Seth is like you know what what would you do if you could like do anything and I was like oh well I probably do this computation mechanic stuff but like I don't know if I'm like smart enough or or something um so um but then okay turns turns out um you believe in yourself a little bit you can do lots of things that you're excited about um so um I ended up kind of long story short doing uh PhD in in gims group um so I so I did a lot of computation mechanics during my PhD and have done a bit afterwards um but in uh in in in physics I've worked on kind of sorts of things been a big mixed bag of um like Quantum information uh ultimate limits of prediction and learning which is a lot where the comp comes in non-equilibrium thermodynamics um but trying to understand just like what is reality like um and so that's kind of uh my my interest in in comp Mech was because it was very generally like what what are the what are the limits of prediction and learning that seemed fascinating to me maybe you have this followup question of like so where how how did I get into some AI safety stuff right truly a master of sequence prediction yeah I felt very unsettled um you know a year more ago um with just what I was seeing in uh AI like uh generative images uh chat gbt this stuff um and I was very unsettled by it but I also um didn't think I had actually anything to contribute like actually when we were talking Jim's group about neural networks um you know even back 2016 or something it's just like this stuff's so unprincipled like they just don't like they're just they're just cramming data into this architecture they don't know what they're doing like this is ridiculous so like it was always just like let's do the principled thing let's do comp right so I think somehow this was just in my mind like uh neural Nets are the unprincipled thing and I do the principled thing but now these unprincipled things are like becoming powerful like well what do we do um and uh it kind of out of the blue that this uh this guy Alexander guy like golden reached out to me and he's like oh comp me's the answer to AI safety and um and I was like well no I know comp me you don't you sound like a crazy person but I do care about like I I already was concerned with uh a safety so I'm like okay I'll hear you out um and so I explained to him why it wasn't going to be useful and um you know for for example like that um these architectures were as far as I understood them uh that that they were feed forward computational mechanics was about Dynamics and uh but we we kept talking and actually he did convince me that there's there's something about thinking about the Dynamics through context which is really important um and it and he I don't think Alexander quite made the case for how they important but somehow I realized like oh this will be important and and actually at that point I started talking to Adam this was almost a year ago I guess where Adam then started uh helping me to understand more of the details of Transformer architectures and this stuff and we were talking with Sarah Marsen and and others um and started to realize like okay actually this this will help us to understand uh model internals uh that people just aren't thinking about it's going to help us to understand behaviors people just aren't thinking about and so then when I realized that like oh there's actually something here it felt both research-wise super exciting but also like uh kind of moral imperative like I care about this I need to do something about it so I started just like reaching out to like I don't know people at at Google or whatever connections I had and be like hey I feel like I have something really important to contribute here how should I go about this and um the broad community of uh I don't know interpretability in both industry and outside um people were very helpful for Direct in me towards um someone directing me towards he there's this thing um Matt's which I don't know maybe it's too Junior for you but but actually for me it was great so I was a a Matt scholar uh beginning of this year for me it was great just being surrounded by people that really understood the ml side because that's not hasn't been my strength but now I've rocked a lot of that right and so I'm I'm really grateful that there has been a community that just cares about this stuff and so many people like with pibs nor and everyone like so many people have just been asking uh how can we help you um and and that's been really cool um so Adam and I have been talking for a while uh we we've been working on this stuff and I think earlier in the year um we were we were saying you know this stuff's going really well it seems like we have a lot to contribute at some point like uh you know we you know we should maybe do an organization on This research organization and so we were open to that idea um and and then kind of things fell in place where it's like oh I guess we can do that now yeah I'll I'll second the the thing that Paul said about how supportive the broad a safety Community has been so so for the last six months I've been a pib's affiliate and that's uh another a safety organization and um they've been really helpful to both of us I think um kind of supporting us and starting simplex and the research and everything sure so I think I understand your individual Journeys one thing that I still don't quite understand is how do you two how did you guys meet sink up is this Alexandria again yeah it is Alexander this is the second uh episode where yeah Alex cold emailing people has played a crucial role in someone's story some wizard in the background yeah well I met Alexander it's probably almost two years now ago um at uh some rationality event thing that I actually wasn't even invited to I was just tagging along okay I think yeah yeah we both he we started talking about rats and minimum description length things and I was like oh that sounds uh like a less principled version of comp Mech then I explained I started ranting about compm and then yeah me and Alexander kept on talking about that and trying to think of if there was a way to apply it to n networks I had exactly the same reaction like I don't know these aren't recurrent so I I don't really know how um there would be Rel into Transformers um and then I think um I think after that a little while after that Alexander reached out to Paul and I um I was having like a very frustrating day with the rats I think and I was like ah I really want to do kme I'll I'll message uh I message Sarah Marsen and um she very we had a meeting it was really great um and then eventually all all four of us got together and sort of trying to think of ways to apply kme to neural networks yeah so we were like on the side of me and Adam were still both in Academia um I was actually about to take on uh something that would have led to a tenure role at University in Spain we were looking forward to sangria all this um but we were just yeah we we're starting to do this research on the side and um yeah it just like was this thing it was starting to go well and we realized like like well for for me I realized like oh okay I need to dedicate myself to this and and so uh so so I think from like the for me kind of first engagement of like okay let's apply comp me to neural Nets has has been kind of in collaboration with um with Adam and um so so it's been really a great um I think complimentary uh set of skill sets coming together and um I think we have a lot of shared um priorities and vision yeah it's been super fun yeah awesome so I have a question just about the safety application um so my understanding is that basically the story is have a principled understanding of what neural Nets might be doing just get some tools to have a more principled way of saying like what's going on there and hopefully just use that to feed into you know like making them safe making them you know designing them the way we want them to be is that right or is there like a different plan you have yeah I mean that's that's an aspect of it for sure um but I think for me another important aspect is um enabling better decision- making people talk a lot about like what's your P Doom or probability of this or that like I feel like we understand these things so little we don't even know what the support is probability over what like we don't know what's going to happen we don't know the mechanisms everything we don't even have a shared language um like there isn't science being done the way that I think it needs to be and at the scale that it needs to be so I think computation mechanics can um maybe optimistically but I but I think it can um help to um figure out what are the things that we're most confused about how do we go about that and build up some infrastructure like and some of that will be at a low level some of that'll be a high level and probably going back and forth between those um so so I feel like um understanding is is very important in general to to make good decisions um so yeah I mean one is like if you understand well then you could uh have new intervention so that's one thing and then the other one is you might understand well that um this will go very poorly uh in any case because now we have a more mechanistic understanding of why or how things would go good or bad um and that's this other part I think is really important yeah there I there's a few other things that kind of are related to that so just to take a very concrete example like why don't we have rigorous benchmarks for SES why don't we have a data set where we know or Transformer a combination of both where we know exactly what the correct features are so we can tell whether or not kind of a specific form of saes with like you know specific architecture and losses and stuff whether it works better or worse than other things and I think it's because like we don't have an understanding for a lot of the basic concepts with which we talk about a safety relevant things like features like in this particular case it's features um and the hope is that if we can get a better understanding that we can even do things like build benchmarks like a pretty simple thing um I think this also carries over to things like evals um right where if we had a better understanding of of what different what fundamentally uh a capability was like what are the different types of capabilities that we could hopefully also kind of build better evals we can make um and we can relate any particular eval that exists right now um to kind of things going on inside of the Transformer um this kind of also relates to things like out of distribution behavior um if we can relate the behavior in general of some internal structure in a Transformer to all of the things that can do kind of as it spits out tokens at the end um then we we will know kind of the space of out of distribution behavior um and I'm nothing comes for free right it it's an enormous amount of work and it's a it's not like a simple thing to go from kind of this theoretical understanding we have right now to kind of the application to the largest models that exists that's not the statement but um at least we have like some footing here um so yeah that's I think kind of the idea and and just one more thing I'd want to add on this is there's a lot of power and comp Mech in that the theory is model agnostic like it's I mean most of the results aren't actually about neural Nets which I surprised it was applied at all right uh and it's not specific to Transformer architectures whereas maybe some mechanistic interpretability stuff would be architecture dependent so um so I think that's powerful like for any particular architecture you can then back out the implications um but you know as architectures change this learning should be able to um change be adapted with us uh so I do think that's important yeah so that's one story about how computational mechanics could be useful um one obvious way to like help push this forward is work on computational mechanics uh you know uh join Simplex or you know look at their list of projects and like work on them I wonder if there are any like synergies with other approaches in AI safety that you think work really well with comp compm that like um are good compliments to it these connections haven't been made made formally um I think there's like kind of an obvious connection to kind of Mech and turp in general which is I guess kind of a big field but for any kind of mechanistic understanding one has from kind of the standard um tool set of mechan turp so so these are things like looking at aention head patterns um causal interventions of different forms one can ask now how these relate to the task of carrying out the mix Day presentation yeah I'm hoping that there are synergies with kind of a more Devon turp uh point of view I I think there's already a there's a bunch of pieces of evidence that point that that will work out um right we we can see this kind of belief State geometry forming over training um Paul has theoretical work talking about the learning process of a parameterized model from the point of view of comc so hopefully we can also learn we can also connect this thing to the training process itself which is kind of a Devon turp maybe SLT um flavored way of looking at things um I mentioned evals before how many different AI safety um approaches are there uh unclear to me um so if if I'm thinking about how um computational mechanics does uh apply to safety does help things being safe we've talked a bit about the theory and it seems like it's mostly a theory of doing good inference about stastically generated processes that you know somehow like at least my impression is I'm imagining like yeah someone else is generating this process I'm inferring it and then computational mechanics is saying okay what's going on there what's going on in my head um but if I think about AI one of the crucial like important features of of it is that in it involves things acting in the world and changing the world right um not just inference but like this whole Loop of inference and action and I'm wondering like are there developed computational mechanics ways of talking about this or is this an open area I I think Adam mentioned I forgot what word it was but there was something about like competitional mechanics with two systems that were maybe influencing each other yeah is is that enough do we need more yeah I guess I'd say there's uh some basically preliminary work in computation mechanics about um you know not not just what it's like to to predict or to generate but but really putting this together in an agentic framework there's instead of Epsilon machines Epson transducers which is what's the minimal structure and Memory full structure for an input output device yeah um so that I think is really relevant work that um also can probably look at like model internals and and things like that but I think there's a totally different level at which computational mechanics can be applied um that that we're excited for but just don't have the bandwidth for right now right um in terms of what to expect with interacting agents like at a high level um there's there's a lot of ways to go about this like you know there's there's Theory work that's been done with you know pal DPS and um but but I think that there there is work um that has been done in terms of basically these memoral input output agents um there's a lot to do I indeed feel that's important um for understanding like if there's any hope for understanding what the emergent behaviors of interacting agents would be sounds hard um I'm uh I'm curious about that things need to be developed a lot more yeah I think um just to add on to that some of the things that would be exciting to try to get a handle on conceptual things to get a handle on from that point of view um would be things like what does it mean for an agent to understand some structure in the world um what does it mean in terms of its mod of its internal States and then kind of going a level beyond that what does it mean for a model to have for for an agent to have a model of itself in in its inside of itself and what is that kind of computational structure that subserves that so I think these are like pretty fundamental questions that one can um the work hasn't been done yet but one could kind of Imagine using and extending the framework of of transducers to kind of good at that which would be super exciting yeah yeah listeners I guess have now heard like some the exciting promise computational mechanics like cool things they can do cool things that work with it cool you know potential like threads of using it to understand agency if people want to follow your guys research um your your writings like how should they go about doing that I think the main way is um Simplex AIS safety.com our website um you can even contact us uh we're very excited to kind of collaborate and work with other people um so if you know anyone is interested they should certainly feel free to contact us through there um yeah yeah for sure um I'd say that people that feel like there's a natural connection um if you're feeling inspired to um you know make make our mission go well or you see a natural collaboration please feel free to reach out because we feel like this works really important uh and we're just a bit bottlenecked so um so I think working with good people and like um growing a vision together would be really nice great well um thanks very much for coming here today it was great talking yeah yeah thanks so much this episode is edited by Jack Garrett and Amber Dawn Ace helped with transcription the opening and closing themes are also by Jack Garett filming occurred at far laabs financial support for this episode was provided by the long-term future fund along with patrons such as Alexi malf of I'd also like to thank Joseph Miller for helping me set up the audio equipment I used to record this episode to read a transcript of this episode or to learn how to support the podcast yourself you can visit axr p.net finally if you have any feedback about this podcast you can email me at feedback axr p.net [Music] for [Music] [Music]
Related conversations
AXRP
7 Aug 2025
Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -5 · 133 segs
AXRP
1 Dec 2024
Evan Hubinger on Model Organisms of Misalignment

This conversation examines technical alignment through Evan Hubinger on Model Organisms of Misalignment, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med -6 · avg -7 · 120 segs
AXRP
11 Apr 2024
AI Control with Buck Shlegeris and Ryan Greenblatt

This conversation examines technical alignment through AI Control with Buck Shlegeris and Ryan Greenblatt, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med -6 · avg -9 · 174 segs
Future of Life Institute Podcast
7 Jan 2026
How to Avoid Two AI Catastrophes: Domination and Chaos (with Nora Ammann)

This conversation examines core safety through How to Avoid Two AI Catastrophes: Domination and Chaos (with Nora Ammann), surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -3 · 85 segs
Counterbalance on this topic
Ranked with the mirror rule in the methodology: picks sit closer to the opposite side of your score on the same axis (lens alignment preferred). Each card plots you and the pick together.
Mirror pick 1
TED Talks
18 Dec 2023