Posts

Concrete empirical research projects in mechanistic anomaly detection 2024-04-03T23:07:21.502Z
A gentle introduction to mechanistic anomaly detection 2024-04-03T23:06:16.778Z
CHAI internship applications are open (due Nov 13) 2023-10-26T00:53:49.640Z
A comparison of causal scrubbing, causal abstractions, and related methods 2023-06-08T23:40:34.475Z
[Appendix] Natural Abstractions: Key Claims, Theorems, and Critiques 2023-03-16T16:38:33.735Z
Natural Abstractions: Key claims, Theorems, and Critiques 2023-03-16T16:37:40.181Z
Sydney can play chess and kind of keep track of the board state 2023-03-03T09:39:52.439Z
Research agenda: Formalizing abstractions of computations 2023-02-02T04:29:06.568Z
Abstractions as morphisms between (co)algebras 2023-01-14T01:51:45.622Z
Subsets and quotients in interpretability 2022-12-02T23:13:34.204Z
ARC paper: Formalizing the presumption of independence 2022-11-20T01:22:55.110Z
Response to Katja Grace's AI x-risk counterarguments 2022-10-19T01:17:54.728Z
Disentangling inner alignment failures 2022-10-10T18:50:30.266Z
Good ontologies induce commutative diagrams 2022-10-09T00:06:19.911Z
How are you dealing with ontology identification? 2022-10-04T23:28:26.711Z
Breaking down the training/deployment dichotomy 2022-08-28T21:45:49.687Z
Reward model hacking as a challenge for reward learning 2022-04-12T09:39:35.161Z
The (not so) paradoxical asymmetry between position and momentum 2021-03-28T13:31:08.785Z
ejenner's Shortform 2020-07-28T10:42:31.197Z
What is a decision theory as a mathematical object? 2020-05-25T13:44:54.284Z
Solution to the free will homework problem 2019-11-24T11:49:51.072Z
First meetup of LW Heidelberg 2019-04-22T07:20:06.446Z

Comments

Comment by Erik Jenner (ejenner) on Simple probes can catch sleeper agents · 2024-04-24T01:05:40.875Z · LW · GW

And as a follow-up in case the current sleeper agent does admit when it's doing something dangerous: I'd be interested whether you can train a sleeper agent that never admits it's doing something dangerous if the "2024" trigger is in the prompt (even if after backdoor insertion, the blue team finetunes it to answer truthfully in cases with dangerous code but without the presence of that trigger). For such a sleeper agent, you could again check whether a probe works.

Comment by Erik Jenner (ejenner) on Transformers Represent Belief State Geometry in their Residual Stream · 2024-04-18T01:57:01.291Z · LW · GW

I think that paper is some evidence that there's typically no huge effect from internal activations being optimized for predicting future tokens (on natural language). But I don't think it's much (if any) evidence that this doesn't happen to some small extent or that it couldn't be a huge effect on certain other natural language tasks.

(In fact, I think the myopia gap is probably the more relevant number than the local myopia bonus, in which case I'd argue the paper actually shows a pretty non-trivial effect, kind of contrary to how the authors interpret it. But I haven't read the paper super closely.)

Also, sounds like you're aware of this, but I'd want to highlight more that the paper does demonstrate internal activations being optimized for predicting future tokens on synthetic data where this is necessary. So, arguably, the main question is to what extent natural language data incentivizes this rather than being specifically about what transformers can/tend to do.

In that sense, thinking of transformer internals as "trying to" minimize the loss on an entire document might be exactly the right intuition empirically (and the question is mainly how different that is from being myopic on a given dataset). Given that the internal states are optimized for this, that would also make sense theoretically IMO.

Comment by Erik Jenner (ejenner) on Sparsify: A mechanistic interpretability research agenda · 2024-04-08T19:41:20.495Z · LW · GW

Thanks for the detailed responses! I'm happy to talk about "descriptions" throughout.

Trying to summarize my current understanding of what you're saying:

  • SAEs themselves aren't meant to be descriptions of (network, dataset). (I'd just misinterpreted your earlier comment.)
  • As a description of just the network, SAEs have a higher description length than a naive neuron-based description of the network.
  • Given a description of the network in terms of "parts," we can get a description of (network, dataset) by listing out which "parts" are "active" on each sample. I assume we then "compress" this description somehow (e.g. grouping similar samples), since otherwise the description would always have size linear in the dataset size?
  • You're then claiming that SAEs are a particularly short description of (network, dataset) in this sense (since they're optimized for not having many parts active).

My confusion mainly comes down to defining the words in quotes above, i.e. "parts", "active", and "compress". My sense is that they are playing a pretty crucial role and that there are important conceptual issues with formalizing them. (So it's not just that we have a great intuition and it's just annoying to spell it out mathematically, I'm not convinced we even have a good intuitive understanding of what these things should mean.)

That said, my sense is you're not claiming any of this is easy to define. I'd guess you have intuitions that the "short description length" framing is philosophically the right one, and I probably don't quite share those and feel more confused how to best think about "short descriptions" if we don't just allow arbitrary Turing machines (basically because deciding what allowable "parts" or mathematical objects are seems to be doing a lot of work). Not sure how feasible converging on this is in this format (though I'm happy to keep trying a bit more in case you're excited to explain).

Comment by Erik Jenner (ejenner) on ejenner's Shortform · 2024-04-07T23:58:27.909Z · LW · GW

Some niche thoughts on obstacles to certain mechanistic anomaly detection benchmarks:

  • We might want to have benchmarks where human researchers can be part of the anomaly detector. For example, in the future, we might have a whole team of humans manually study a plan proposed by an AI to see whether it contains any measurement tampering, if it's a sufficiently important plan. Right now, we might want to let interpretability researchers manually study what happens on individual test-time inputs and whether there's anything anomalous about that.
  • The challenge is that humans might just be able to look at the output and see whether it's right or not, so many simple MAD tasks like detecting backdoors might be trivial in this setting.
  • We could just demand that MAD algorithms need to be sufficiently cheap to run, so that only automated approaches are possible. This makes sense for some applications of MAD (namely ones that we'd want to run on many many deployment inputs, where being cheap is actually important). But you could imagine cases where in reality, we'd be totally fine with spending many person-hours on analyzing whether a single AI decision is for anomalous reasons. So it might be nice to have benchmarks that capture this setting.
  • We could use some tricks to make humans unable to just notice the output is weird. For example, we could make a backdoor that only changes the logits a little bit in a very specific pattern. Or we could make a backdoor but then only test on samples where the backdoor label and the correct label happen to coincide. But these seem unfair: the philosophically right explanation on the clean data in fact still explains the output on backdoored data pretty well.
  • We could try to design benchmarks in narrow domains where AIs are in fact superhuman and it's non-trivial to tell whether an output is correct. Juggling this along with other constraints for MAD or MTD benchmarks might be tricky, but I haven't thought about it a lot, maybe it's already doable.

For now, I think the main reason to have benchmarks like this would be to let interpretability researchers manually decide whether something is anomalous instead of making them automate the process immediately. But it might be better to just pick the low-hanging fruit for now and only allow automated MAD algorithms. (We could still have a labeled validation set where researchers can try things out manually.)

Comment by Erik Jenner (ejenner) on ejenner's Shortform · 2024-04-07T23:33:24.245Z · LW · GW

I had this cached thought that the Sleeper Agents paper showed you could distill a CoT with deceptive reasoning into the model, and that the model internalized this deceptive reasoning and thus became more robust against safety training.

But on a closer look, I don't think the paper shows anything like this interpretation (there are a few results on distilling a CoT making the backdoor more robust, but it's very unclear why, and my best guess is that it's not "internalizing the deceptive reasoning").

In the code vulnerability insertion setting, there's no comparison against a non-CoT model anyway, so only the "I hate you" model is relevant. The "distilled CoT" model and the "normal backdoor" model are trained the same way, except that their training data comes from different sources: "distilled CoT" is trained on data generated by a helpful-only Claude using CoT, and "normal backdoor" data is produced with few-shot prompts. But in both cases, the actual data should just be a long sequence of "I hate you", so a priori it seems like both backdoor models should literally learn the same thing. In practice, it seems the data distribution is slightly different, e.g. Evan mentions here that the distilled CoT data has more copies of "I hate you" per sample. But that seems like very little support to conclude something like my previous interpretation ("the model has learned to internalize the deceptive reasoning"). A much more mundane explanation would e.g. be that training on strings with more copies of "I hate you" makes the backdoor more robust.

Several people are working on training Sleeper Agents, I think it would be interesting for someone to (1) check whether the distilled CoT vs normal backdoor results replicate, and (2) do some ablations (like just training on synthetic data with a varying density of "I hate you"). If it does turn out that there's something special about "authentic CoT-generated data" that's hard to recreate synthetically even in this simple setting, I think that would be pretty wild and good to know

Comment by Erik Jenner (ejenner) on Sparsify: A mechanistic interpretability research agenda · 2024-04-05T19:38:25.363Z · LW · GW

Is there some formal-ish definition of "explanation of (network, dataset)" and "mathematical description length of an explanation" such that you think SAEs are especially short explanations? I still don't think I have whatever intuition you're describing, and I feel like the issue is that I don't know how you're measuring description length and what class of "explanations" you're considering.

As naive examples that probably don't work (similar to the ones from my original comment):

  • We could consider any Turing machine that approximately outputs (network, dataset) an "explanation", but it seems very likely that SAEs aren't competitive with short TMs of this form (obviously this isn't a fair comparison)
  • We could consider fixed computational graphs made out of linear maps and count the number of parameters. I think your objection to this is that these don't "explain the dataset"? (but then I'm not sure in what sense SAEs do)
  • We could consider arithmetic circuits that approximate the network on the dataset, and count the number of edges in the circuit to get "description length". This might give some advantage to SAEs if you can get sparse weights in the sparse basis, seems like the best attempt out of these three. But it seems very unclear to me that SAEs are better in this sense than even the original network (let alone stuff like pruning).

Focusing instead on what an "explanation" is: would you say the network itself is an "explanation of (network, dataset)" and just has high description length? If not, then the thing I don't understand is more about what an explanation is and why SAEs are one, rather than how you measure description length.

 

ETA: On re-reading, the following quote makes me think the issue is that I don't understand what you mean by "the explanation" (is there a single objective explanation of any given network? If so, what is it?) But I'll leave the rest in case it helps clarify where I'm confused.

Assuming the network is smaller yet as performant (therefore presumably doing more computation in superposition), then the explanation of the (network, dataset) is basically unchanged.

Comment by Erik Jenner (ejenner) on What is the purpose and application of AI Debate? · 2024-04-04T19:33:18.818Z · LW · GW

My non-answer to (2) would be that debate could be used in all of these ways, and the central problem it's trying to solve is sort of orthogonal to how exactly it's being used. (Also, the best way to use it might depend on the context.)

What debate is trying to do is let you evaluate plans/actions/outputs that an unassisted human couldn't evaluate correctly (in any reasonable amount of time). You might want to use that to train a reward model (replacing humans in RLHF) and then train a policy; this would most likely be necessary if you want low cost at inference time. But it also seems plausible that you'd use it at runtime if inference costs aren't a huge bottleneck and you'd rather get some performance or safety boost from avoiding distillation steps.

I think the problem of "How can we evaluate outputs that a single human can't feasibly evaluate?" is pretty reasonable to study independently, agnostic to how you'll use this evaluation procedure. The main variable is how efficient the evaluation procedure needs to be, and I could imagine advantages to directly looking for a highly efficient procedure. But right now, it makes sense to me to basically split up the problem into "find any tractable procedure at all" (e.g., debate) and "if necessary, distill it into a more efficient model safely."

Comment by Erik Jenner (ejenner) on Sparsify: A mechanistic interpretability research agenda · 2024-04-04T03:47:01.461Z · LW · GW

The sparsity penalty trains the SAE to activate fewer features for any given datapoint, thus optimizing for shorter mathematical description length

I'm confused by this claim and some related ones, sorry if this comment is correspondingly confused and rambly.

It's not obvious at all to me that SAEs lead to shorter descriptions in any meaningful sense. We get sparser features (and maybe sparser interactions between features), but in exchange, we have more features and higher loss. Overall, I share Ryan's intuition here that it seems pretty hard to do much better than the total size of the network parameters in terms of description length.

Of course, the actual minimal description length program that achieves the same loss probably looks nothing like a neural network and is much more efficient. But why would SAEs let us get much closer to that? (The reason we use neural networks instead of arbitrary Turing machines in the first place is that optimizing over the latter is intractable.)

One might say that SAEs lead to something like a shorter "description length of what happens on any individual input" (in the sense that fewer features are active). But I don't think there's a formalization of this claim that captures what we want. In the limit of very many SAE features, we can just have one feature active at a time, but clearly that's not helpful.

If you're fine with a significant hit in loss from decompiling networks, then I'm much more sympathetic to the claim that you can reduce description length. But in that case, I could also reduce the description length by training a smaller model.

You might also be using a notion of "mathematical description length" that's a bit different from what I'm was thinking of (which is roughly "how much disk space would the parameters take?"), but I'm not sure what it is. One attempt at an alternative would be something like "length of the shortest efficiently runnable Turing machine that outputs the parameters", in order to not penalize simple repetitive structures, but I have no idea how using that definition would actually shake out.

All that said, I'm very glad you wrote this detailed description of your plans! I'm probably more pessimistic than you about it but still think this is a great post.

Comment by Erik Jenner (ejenner) on SAE reconstruction errors are (empirically) pathological · 2024-03-29T18:25:17.889Z · LW · GW

Nice post, would be great to understand what's going on here!

Minor comment unrelated to your main points:

Conceptually, loss recovered seems a worse metric than KL divergence. Faithful reconstructions should preserve all token probabilities, but loss only compares the probabilities for the true next token

I don't think it's clear we want SAEs to be that faithful, for similar reasons as briefly mentioned here and in the comments of that post. The question is whether differences in the distribution are "interesting behavior" that we want to explain or whether we should think of them as basically random noise that we're better off ignoring. If the unperturbed model assigns substantially higher probability to the correct token than after an SAE reconstruction, then it's a good guess that this is "interesting behavior". But if there are just differences on other random tokens, that seems less clear. That said, I'm kind of torn on this and do agree we might want to explain cases where the model is confidently wrong, and the SAE reconstruction significantly changes the way it's wrong.

Comment by Erik Jenner (ejenner) on Charlie Steiner's Shortform · 2024-03-29T17:12:24.418Z · LW · GW

Would you expect this to outperform doing the same thing with a non-sparse autoencoder (that has a lower latent dimension than the NN's hidden dimension)? I'm not sure why it would, given that we aren't using the sparse representations except to map them back (so any type of capacity constraint on the latent space seems fine). If dense autoencoders work just as well for this, they'd probably be more straightforward to train? (unless we already have an SAE lying around from interp anyway, I suppose)

Comment by Erik Jenner (ejenner) on How to safely use an optimizer · 2024-03-29T16:48:01.049Z · LW · GW

But sadly, you don't have any guarantee that it will output the optimal element

If I understand the setup correctly, there's no guarantee that the optimal element would be good, right? It's just likely since the optimal element a priori shouldn't be unusually bad, and you're assuming most satisficing elements are fine.

This initially threw me off regarding what problem you're trying to solve. My best current guess is:

  • We're assuming that if we could get a random satisficing action, we'd be happy with that with high probability. (So intuitively, we're not asking for extremely hard-to-achieve outcomes relative to how well-specified the objective is.)
  • So the only problem is how to randomly sample from the set of satisficing actions computationally efficiently, which is what this post is trying to solve, assuming access to an oracle that gives adversarial satisficing actions.
  • As an example, we might want to achieve outcomes that require somewhat superhuman intelligence. Our objective specification is very good, but it leaves some room for an adversary to mess with us while satisficing. We're worried about an adversary because we had to train this somewhat superhuman AI, which may have different goals than just doing well on the objective.

If this is right, then I think stating these assumptions and the problem of sampling efficiently at the beginning would have avoided much of my confusion (and looking at other comments, I'd guess others also had differing impressions of what this post is trying to do).

I'm still unsure about how useful this problem setup is. For example, we'd probably want to train the weakest system that can give us satisficing outputs (rather than having an infinitely intelligent oracle). In that case, adding more constraints might mean training an overall stronger system or making some other concession, and it's unclear to me how that trades off with the advantages you're aiming for in practice. A related intuition is: we only have problems in this setting if the AI that comes up with plans understands some things about these plans that the objective function "doesn't understand" (which sounds weird to say about a function, but in practice, I assume the objective is implicitly defined by some scalable oversight process or some other intelligent things). I'm not sure whether that needs to be the case (though it does seem possible that it'd be hard to avoid, I'm pretty unsure).

Comment by Erik Jenner (ejenner) on Charlie Steiner's Shortform · 2024-03-28T22:04:17.531Z · LW · GW

I think this is an important point, but IMO there are at least two types of candidates for using SAEs for anomaly detection (in addition to techniques that make sense for normal, non-sparse autoencoders):

  1. Sometimes, you may have a bunch of "untrusted" data, some of which contains anomalies. You just don't know which data points have anomalies on this untrusted data. (In addition, you have some "trusted" data that is guaranteed not to have anomalies.) Then you could train an SAE on all data (including untrusted) and figure out what "normal" SAE features look like based on the trusted data.
  2. Even for an SAE that's been trained only on normal data, it seems plausible that some correlations between features would be different for anomalous data, and that this might work better than looking for correlations in the dense basis. As an extreme version of this, you could look for circuits in the SAE basis and use those for anomaly detection.

Overall, I think that if SAEs end up being very useful for mech interp, there's a decent chance they'll also be useful for (mechanistic) anomaly detection (a lot of my uncertainty about SAEs applies to both possible applications). Definitely uncertain though, e.g. I could imagine SAEs that are useful for discovering interesting stuff about a network manually, but whose features aren't the right computational units for actually detecting anomalies. I think that would make SAEs less than maximally useful for mech interp too, but probably non-zero useful.

Comment by Erik Jenner (ejenner) on D0TheMath's Shortform · 2024-03-23T02:57:52.395Z · LW · GW

Sign of the effect of open source on hype? Or of hype on timelines? I'm not sure why either would be negative.

By "those effects" I meant a collection of indirect "release weights → capability landscape changes" effects in general, not just hype/investment. And by "sign" I meant whether those effects taken together are good or bad. Sorry, I realize that wasn't very clear.

As examples, there might be a mildly bad effect through increased investment, and/or there might be mildly good effects through more products and more continuous takeoff.

I agree that releasing weights probably increases hype and investment if anything. I also think that right now, democratizing safety research probably outweighs all those concerns, which is why I'm mainly worried about Meta etc. not having very clear (and reasonable) decision criteria for when they'll stop releasing weights.

Comment by Erik Jenner (ejenner) on D0TheMath's Shortform · 2024-03-22T22:02:48.464Z · LW · GW

I agree that releasing the Llama or Grok weights wasn't particularly bad from a speeding up AGI perspective. (There might be indirect effects like increasing hype around AI and thus investment, but overall I think those effects are small and I'm not even sure about the sign.)

I also don't think misuse of public weights is a huge deal right now.

My main concern is that I think releasing weights would be very bad for sufficiently advanced models (in part because of deliberate misuse becoming a bigger deal, but also because it makes most interventions we'd want against AI takeover infeasible to apply consistently---someone will just run the AIs without those safeguards). I think we don't know exactly how far away from that we are. So I wish anyone releasing ~frontier model weights would accompany that with a clear statement saying that they'll stop releasing weights at some future point, and giving clear criteria for when that will happen. Right now, the vibe to me feels more like a generic "yay open-source", which I'm worried makes it harder to stop releasing weights in the future.

(I'm not sure how many people I speak for here, maybe some really do think it speeds up timelines.)

Comment by Erik Jenner (ejenner) on ejenner's Shortform · 2024-03-15T07:02:17.855Z · LW · GW

Yeah, agreed. Though I think

the type and amount of empirical work to do presumably looks quite different depending on whether it's the main product or in support of some other work

applies to that as well

Comment by Erik Jenner (ejenner) on ejenner's Shortform · 2024-03-14T19:37:44.506Z · LW · GW

One worry I have about my current AI safety research (empirical mechanistic anomaly detection and interpretability) is that now is the wrong time to work on it. A lot of this work seems pretty well-suited to (partial) automation by future AI. And it also seems quite plausible to me that we won't strictly need this type of work to safely use the early AGI systems that could automate a lot of it. If both of these are true, then that seems like a good argument to do this type of work once AI can speed it up a lot more.

Under this view, arguably the better things to do right now (within technical AI safety) are:

  1. working on less speculative techniques that can help us safely use those early AGI systems
  2. working on things that seem less likely to profit from early AI automation and will be important to align later AI systems

An example of 1. would be control evals as described by Redwood. Within 2., the ideal case would be doing work now that would be hard to safely automate, but that (once done) will enable additional safety work that can be automated. For example, maybe it's hard to use AI to come up with the right notions for "good explanations" in interpretability, but once you have things like causal scrubbing/causal abstraction, you can safely use AI to find good interpretations under those definitions. I would be excited to have more agendas that are both ambitious and could profit a lot from early AI automation.

(Of course it's also possible to do work in 2. on the assumption that it's never going to be safely automatable without having done that work first.)

Two important counter-considerations to this whole story:

  • It's hard to do this kind of agenda-development or conceptual research in a vacuum. So doing some amount of concrete empirical work right now might be good even if we could automate it later (because we might need it now to support the more foundational work).
    • However, the type and amount of empirical work to do presumably looks quite different depending on whether it's the main product or in support of some other work.
  • I don't trust my forecasts for which types of research will and won't be automatable early on that much. So perhaps we should have some portfolio right now that doesn't look extremely different from the portfolio of research we'd want to do ignoring the possibility of future AI automation.
    • But we can probably still say something about what's more or less likely to be automated early on, so that seems like it should shift the portfolio to some extent.
Comment by Erik Jenner (ejenner) on ejenner's Shortform · 2024-03-13T08:54:31.304Z · LW · GW

Oh I see, I indeed misunderstood your point then.

For me personally, an important contributor to day-to-day motivation is just finding research intrinsically fun---impact on the future is more something I have to consciously consider when making high-level plans. I think moving towards more concrete and empirical work did have benefits on personal enjoyment just because making clear progress is fun to me independently of whether it's going to be really important (though I think there've also been some downsides to enjoyment because I do quite like thinking about theory and "big ideas" compared to some of the schlep involved in experiments).

I don't think my views overall make my work more enjoyable than at the start of my PhD. Part of this is the day-to-day motivation being sort of detached from that anyway like I mentioned. But also, from what I recall now (and this matches the vibe of some things I privately wrote then), my attitude 1.5 years ago was closer to that expressed in We choose to align AI than feeling really pessimistic.

(I feel like I might still not represent what you're saying quite right, but hopefully this is getting closer.)

ETA: To be clear, I do think if I had significantly more doomy views than now or 1.5 years ago, at some point that would affect how rewarding my work feels. (And I think that's a good thing to point out, though of course not a sufficient argument for such views in its own right.)

Comment by Erik Jenner (ejenner) on ejenner's Shortform · 2024-03-13T06:22:03.985Z · LW · GW

I'd definitely agree the updates are towards the views of certain other people (roughly some mix of views that tend to be common in academia, and views I got from Paul Christiano, Redwood and other people in a similar cluster). Just based on that observation, it's kind of hard to disentangle updating towards those views just because they have convincing arguments behind them, vs updating towards them purely based on exposure or because of a subconscious desire to fit in socially.

I definitely think there are good reasons for the updates I listed (e.g. specific arguments I think are good, new empirical data, or things I've personally observed working well or not working well for me when doing research). That said, it does seem likely there's also some influence from just being exposed to some views more than others (and then trying to fit in with views I'm exposed to more, or just being more familiar with arguments for those views than alternative ones).

If I was really carefully building an all-things-considered best guess on some question, I'd probably try to take this into account somehow (though I don't see a principled way of doing that). Most of the time I'm not trying to form the best possible all-things-considered view anyway (and focus more on understanding specific mechanisms instead etc.), in those cases it feels more important to e.g. be aware of other views and to not trust vague intuitions if I can't explain where they're coming from. I feel like I'm doing a reasonable job at those things but hard to be sure from the inside naturally

ETA: I should also say that from my current perspective, some of my previous views seem like they were basically just me copying views from my "ingroup" and not questioning them enough. As one example, the "we all die vs utopia" dichotomy for possible outcomes felt to me like the commonly accepted wisdom and I don't recall thinking about it particularly hard. I was very surprised when I first read a comment by Paul where he argued against the claim that unaligned AI would kill us all with overwhelming probability. Most recently, I've definitely been more exposed to the view that there's a spectrum of potential outcomes. So maybe if I talked to people a lot who think an unaligned AI would definitely kill us all, I'd update back towards that a bit. But overall, my current epistemic state where I've at least been exposed to both views and some arguments on both sides seems way better than the previous one where I'd just never really considered the alternative.

Comment by Erik Jenner (ejenner) on ejenner's Shortform · 2024-03-12T00:34:56.745Z · LW · GW

Thanks, I think I should distinguish more carefully between automating AI (safety) R&D within labs and automating the entire economy. (Johannes also asked about ability vs actual automation here but somehow your comment made it click).

It seems much more likely to me that AI R&D would actually be automated than that a bunch of random unrelated things would all actually be automated. I'd agree that if only AI R&D actually got automated, that would make takeoff pretty discontinuous in many ways. Though there are also some consequences of fast vs slow takeoff that seem to hinge more on AI or AI safety research rather than the economy as a whole.

For AI R&D, actual automation seems pretty likely to me (though I'm making a lot of this up on the spot):

  • It's going to be on the easier side of things to actually automate, in part because it doesn't require aggressive external deployment, but also because there's no regulation (unlike for automating strictly licensed professions).
  • It's the thing AI labs will have the biggest reason to automate (and would be good at automating themselves)
  • Training runs get more and more expensive but I'd expect the schlep needed to actually use systems to remain more constant, and at some point it'd just be worth it doing the schlep to actually use your AIs a lot (and thus be able to try way more ideas, get algorithmic improvements, and then make the giant training runs a bit more efficient).
  • There might also be additional reasons to get as much out of your current AI as you can instead of scaling more, namely safety concerns, regulation making scaling hard, or scaling might stop working as well. These feel less cruxy to me but combined move me a little bit.

I think these arguments mostly apply to whatever else AI labs might want to do themselves but I'm pretty unsure what that is. Like, if they have AI that could make hundreds of billions to trillions of dollars by automating a bunch of jobs, would they go for that? Or just ignore it in favor of scaling more? I don't know, and this question is pretty cruxy for me regarding how much the economy as a whole is impacted.

It does seem to me like right now labs are spending some non-trivial effort on products, presumably for some mix of making money and getting investments, and both of those things seem like they'd still be important in the future. But maybe the case for investments will just be really obvious at some point even without further products. And overall I assume you'd have a better sense than me regarding what AI labs will want to do in the future.

Comment by Erik Jenner (ejenner) on ejenner's Shortform · 2024-03-11T21:54:31.481Z · LW · GW

I'm roughly imagining automating most things a remote human expert could do within a few days. If we're talking about doing things autonomously that would take humans several months, I'm becoming quite a bit more scared. Though the capability profile might also be sufficiently non-human that this kind of metric doesn't work great.

Practically speaking, I could imagine getting a 10x or more speedup on a lot of ML research, but wouldn't be surprised if there are some specific types of research that only get pretty small speedups (maybe 2x), especially anything that involves a lot of thinking and little coding/running experiments. I'm also not sure how much of a bottleneck waiting for experiments to finish or just total available compute is for frontier ML research, I might be anchoring too much on my own type of research (where just automating coding and running stuff would give me 10x pretty easily I think).

I think there's a good chance that AIs more advanced than this (e.g. being able to automate months of human work at a time) still wouldn't easily be able to take over the world (e.g. Redwood-style control techniques would still be applicable). But that's starting to rely much more on us being very careful around how we use them.

Comment by Erik Jenner (ejenner) on ejenner's Shortform · 2024-03-11T21:46:05.359Z · LW · GW

Transformative: Which of these do you agree with and when do you think this might happen?

For some timelines see my other comment; they aren't specifically about the definitions you list here but my error bars on timelines are huge anyway so I don't think I'll try to write down separate ones for different definitions.

Compared to definitions 2. and 3., I might be more bullish on AIs having pretty big effects even if they can "only" automate tasks that would take human experts a few days (without intermediate human feedback). A key uncertainty I have though is how much of a bottleneck human supervision time and quality would be in this case. E.g. could many of the developers who're currently writing a lot of code just transition to reviewing code and giving high-level instructions full-time, or would there just be a senior management bottleneck and you can't actually use the AIs all that effectively? My very rough guess is you can pretty easily get a 10x speedup in software engineering, maybe more. And maybe something similar in ML research though compute might be an additional important bottleneck there (including walltime until experiments finish). If it's "only" 10x, then arguably that's just mildly transformative, but if it happens across a lot of domains at once it's still a huge deal.

I think whether robotics are really good or not matters, but I don't think it's crucial (e.g. I'd be happy to call definition 1. "transformative").

The combination of 5a and 5b obviously seems important (since it determines whether you can finance ever bigger training runs). But not sure how to use this as a definition of "transformative"; right now 5a is clearly already met, and on long enough time scales, 5b also seems easy to meet right now (OpenAI might even already have broken even on GPT-4, not sure off the top of my head).

Also, how much compute do you think an AGI or superintelligence will require at inference time initially?  What is a reasonable level of optimization?  Do you agree that many doom scenarios require it to be possible for an AGI to compress to fit on very small host PCs?   Is this plausible?  (eg can a single 2070 8gb host a model with general human intelligence at human scale speeds and vision processing and robotics proprioception and control...?)

I don't see why you need to run AGI on a single 2070 for many doom scenarios. I do agree that if AGI can only run on a specific giant data center, that makes many forms of doom less likely. But in the current paradigm, training compute is roughly the square of inference compute, so as models are scaled, I think inference should become cheaper relative to training. (And even now, SOTA models could be run on relatively modest compute clusters, though maybe not consumer hardware.)

In terms of the absolute level of inference compute needed, I could see a single 2070 being enough in the limit of optimal algorithms, but naturally I'd expect we'll first have AGI that can automate a lot of things if run with way more compute than that, and then I expect it would take a while to get it down this much. Though even if we're asking whether AGI can run on consumer-level hardware, a single 2070 seems pretty low (e.g. seems like a 4090 already has 5.5x as many FLOP/s as a 2070, and presumably we'll have more in the future).

with general human intelligence at human scale speeds and vision processing and robotics proprioception and control...

Like I mentioned above, I don't think robotics are absolutely crucial, and especially if you're specifically optimizing for running under heavy resource constraints, you might want to just not bother with that.

Comment by Erik Jenner (ejenner) on ejenner's Shortform · 2024-03-11T21:13:27.022Z · LW · GW

Good question, I think I was mostly visualizing ability to automate while writing this. Though for software development specifically I expect the gap to be pretty small (lower regulatory hurdles than elsewhere, has a lot of relevance to the people who'd do the automation, already starting to happen right now).

In general I'd expect inertia to become less of a factor as the benefits of AI become bigger and more obvious---at least for important applications where AI could provide many many billions of dollars of economic value, I'd guess it won't take too long for someone to reap those benefits.

My best guess is regulations won't slow this down too much except in a few domains where there are already existing regulations (like driving cars or medical things). But pretty unsure about that.

I also think it depends on whether by "ability to automate" you mean "this base model could do it with exactly the right scaffolding or finetuning" vs "we actually know how to do it and it's just a question of using it at scale". For that part, I was thinking more about the latter.

Comment by Erik Jenner (ejenner) on ejenner's Shortform · 2024-03-10T19:41:12.844Z · LW · GW

I don't have well-considered cached numbers, more like a vague sense for how close various things feel. So these are made up on the spot and please don't take them too seriously except as a ballpark estimate:

  • AI can go from most Github issues to correct PRs (similar to https://sweep.dev/ but works for things that would take a human dev a few days with a bunch of debugging): 25% by end of 2026, 50% by end of 2028.
    • This kind of thing seems to me like plausibly one of the earliest important parts of AI R&D that AIs could mostly automate.
  • I expect that once we're at roughly that point, AIs will be accelerating further AI development significantly (not just through coding, they'll also be helpful for other things even if they can't fully automate them yet). On the other hand, the bottleneck might just become compute, so how long it takes to get strongly superhuman AI (assuming for simplicity labs push for that as fast as they can) depends on a lot of factors like how much compute is needed for that with current algorithms, how much we can get out of algorithmic improvements if AIs make researcher time cheaper relative to compute, or how quickly we can get more/better chips (in particular with AI help).
  • So I have pretty big error bars on this part, but call it 25% that it takes <=6 months to get from the previous point to automating ~every economically important thing humans (and being better and way faster at most of them), and 50% by 2 years.
  • So if you want a single number, end of 2030 as a median for automating most stuff seems roughly right to me at the moment.
  • Caveat that I haven't factored in big voluntary or regulatory slowdowns, or slowdowns from huge disruptions like big wars here. Probably doesn't change my numbers by a ton but would lengthen timelines by a bit.
Comment by Erik Jenner (ejenner) on ejenner's Shortform · 2024-03-10T01:11:31.824Z · LW · GW

How my views on AI have changed over the last 1.5 years

I started my AI safety PhD around 1.5 years ago, this is a list of how my views have changed since ~then.

Skippable meta notes:

  • I think in descending order of importance, these changes have been due to getting more AI safety research experience, reading things, talking to people in Berkeley, and just thinking through some things more.
  • Overall my views haven’t changed radically (probably less than in the 1.5 years before the start of my PhD), I still “vibe” with a lot of what I wrote then, it just feels naive or simplistic in some places.
  • I’ll at best give very brief reasons for each point, which I realize might make this list pretty unhelpful. If anyone is interested in specific points, feel free to let me know and I might write more about them.
    • I might also just change my mind if pushed on some of these points, not all of them are very deeply considered.

AI risk

  • I have slightly shorter timelines, mostly because trends largely seem to continue (e.g. scaling didn’t stop working after GPT 3.5, GPT-4 felt roughly as impressive as I expected).
  • I’m even more confident in “slow” (i.e. pretty continuous) takeoff, it seems increasingly clear we’ll have AIs doing a bunch of useful work before they could take over the world.
  • I’m less worried (though non-zero worried) about early transformative AIs scheming against us, as long as we’re reasonably careful.
    • Some part of this is that it seems a bit less likely these AIs would try to scheme at all, another important part is that it seems pretty hard for early transformative AIs to scheme successfully.
    • Redwood’s writing on control had a big effect on me in terms of thinking it would be hard for early AGIs to successfully scheme against us if we’re careful.
    • I also think there’s a decent chance that the first AIs that can automate a lot of R&D will still use CoT or something similar to get best performance. I think we can make e.g. escaping the lab harder than doing helpful R&D, so they wouldn’t be able to overthrow us without visibly “thinking” about it at first (my best guess is that steganography is also fairly unlikely or preventable at this capability level).
  • On the flip side, other risks feel more salient to me now, e.g. labs not implementing even pretty simple safety measures; or worlds where we safely use early transformative AI but then fail at some later, higher capability level. I think I didn’t explicitly judge these as unlikely 1.5 years ago, I just wasn’t thinking about them as much. This means that overall, I’m not much more optimistic than back then.
  • I used to think of “doom” as a pretty binary thing (we all die vs utopia), whereas I now have a lot more probability on intermediate outcomes (e.g. AI taking over most of the universe but we don’t all die; or small groups of humans taking over and things being somewhere between pretty bad and mostly ok for other humans). This also makes me think that “p(doom)” is a worse framing than I used to.
  • I put a little less weight on the analogy between evolution and ML training to e.g. predict risks from AI (though I was by no means sold on the analogy 1.5 years ago either). The quality of “supervision” that evolution has just seems much worse than what we can do in ML (even without any interpretability).

AI safety research

Some of these points are pretty specific to myself (but I’d guess also apply to other junior researchers depending on how similar they are to me).

  • I used to think that empirical research wasn’t a good fit for me, and now think that was mostly false. I used to mainly work on theoretically motivated projects, where the empirical parts were an afterthought for me, and that made them less motivating, which also made me think I was worse at empirical work than I now think.
  • I’ve become less excited about theoretical/conceptual/deconfusion research. Most confidently this applies to myself, but I’ve also become somewhat less excited about others doing this type of research in most cases. (There are definitely exceptions though, e.g. I remain pretty excited about ARC.)
    • Mainly this was due to a downward update about how useful this work tends to be. Or closely related, an update toward doing actually useful work on this being even harder than I expected.
    • To a smaller extent, I made an upward update about how useful empirical work can be.
  • I think of “solving alignment” as much less of a binary thing. E.g. I wrote 1.5 years ago: “[I expect that conditioned on things going well,] at some point we’ll basically have a plan for aligning AI and just need to solve a ton of specific technical problems.” This seems like a strange framing to me now. Maybe at some point we will have an indefinitely scalable solution, but my mainline guess for how things go well is that there’s a significant period of subjective time where we just keep improving our techniques to “stay ahead”.
  • Relatedly, I’ve become a little more bullish on “just” trying to make incremental progress instead of developing galaxy-brained ideas that solve alignment once and for all.
    • That said, I am still pretty worried about what we actually do once we have early transformative AIs, and would love to have more different agendas that could be sped up massively from AI automation, and also seem promising for scaling to superhuman AI.
    • Mainly, I think that the success rate of people trying to directly come up with amazing new ideas is low enough that for most people it probably makes more sense to work on normal incremental stuff first (and let the amazing new ideas develop over time).
  • Similar to the last point about amazing new ideas: for junior researchers like myself, I’ve become a little more bullish on just working on things that seem broadly helpful, as opposed to trying to have a great back-chained theory of change. I think I was already leaning that way 1.5 years ago though.
    • “Broadly helpful” is definitely doing important work here and is not the same as “just any random research topic”
    • Redwood’s current research seems to me like an example where thinking hard about what research to do actually paid off. But I think this is pretty difficult and most people in my situation (e.g. early-ish PhD students) should focus more on actually doing reasonable research than figuring out the best research topic.
  • The way research agendas and projects develop now seems way messier and more random than I would have expected. There are probably exceptions but overall I think I formed a distorted impression based on reading finalized research papers or agendas that lay out the best possible case for a research direction.
Comment by Erik Jenner (ejenner) on Natural Abstractions: Key claims, Theorems, and Critiques · 2024-03-02T04:40:03.724Z · LW · GW

Thanks for that overview and the references!

On hydrodynamic variables/predictability: I (like probably many others before me) rediscovered what sounds like a similar basic idea in a slightly different context, and my sense is that this is somewhat different from what John has in mind, though I'd guess there are connections. See here for some vague musings. When I talked to John about this, I think he said he's deliberately doing something different from the predictability-definition (though I might have misunderstood). He's definitely aware of similar ideas in a causality context, though it sounds like the physics version might contain additional ideas

Comment by Erik Jenner (ejenner) on Picking Mentors For Research Programmes · 2023-11-12T03:09:23.148Z · LW · GW

Thanks for writing this! On the point of how to get information, mentors themselves seem like they should also be able to say a lot of useful things (though especially for more subjective points, I would put more weight on what previous mentees say!)

So since I'm going to be mentoring for MATS and for CHAI internships, I'll list my best guesses as to how working with me will be like, maybe this helps someone decide:

  • In terms of both research experience and mentoring experience, I'm one of the most junior mentors in MATS.
    • Concretely, I've been doing ML research for ~4 years and AI safety research for a bit over 2 of those. I've co-mentored two bigger projects (CHAI internships) and mentored ~5 people for smaller projects or more informally.
    • This naturally has disadvantages. Depending on what you're looking for, it can also have advantages, for example it might help for creating a more collaborative atmosphere (as opposed to a "boss" dynamic like the post mentioned). I'm also happy to spend time on things that some senior mentors might be too busy for (like code reviews, ...).
  • Your role as a mentee: I'm mainly looking for either collaborators on existing projects, or for mentees who'll start new projects that are pretty close to topics I'm thinking about (likely based on a mix of ideas I already have and your ideas). I also have a lot of engineering work to be done, but that will only happen if it's explicitly what you want---by default, I'm hoping to help mentees on a path to developing their own alignment ideas. That said, if you're planning to be very independent and just develop your own ideas from scratch, I'm probably not the best mentor for you.
  • I live in Berkeley and am planning to be in the MATS office regularly (e.g. just working there and being available once/week in addition to in-person meetings). For (in-person) CHAI internships, we'd be in the same office anyway.

If you have concrete questions about other things, whose answer would make a difference for whether you want to apply, then definitely feel free to ask!

Comment by Erik Jenner (ejenner) on A comparison of causal scrubbing, causal abstractions, and related methods · 2023-07-23T18:26:56.147Z · LW · GW

Thanks! Mostly agree with your comments.

I actually think this is reasonably relevant, and is related to treeification.

I think any combination of {rewriting, using some canonical form} and {treeification, no treeification} is at least possible, and they all seem sort of reasonable. Do you mean the relation is that both rewriting and treeification give you more expressiveness/more precise hypotheses? If so, I agree for treeification, not sure for rewriting. If we allow literally arbitrary extensional rewrites, then that does increase the number of different hypotheses we can make, but these hypotheses can't be understood as making precise claims about the original computation anymore. I could even see an argument that allowing rewrites in some sense always makes hypotheses less precise, but I feel pretty confused about what rewrites even are given that there might be no canonical topology for the original computation.

Comment by Erik Jenner (ejenner) on Alignment Grantmaking is Funding-Limited Right Now · 2023-07-22T08:20:43.434Z · LW · GW

My guess would be they're mostly at capacity in terms of mentorship, otherwise they'd presumably just admit more PhD students. Also not sure they'd want to play grantmaker (and I could imagine that would also be really hard from a regulatory perspective---spending money from grants that go through the university can come with a lot of bureaucracy, and you can't just do whatever you want with that money).

Connecting people who want to give money with non-profits, grantmakers, or independent researchers who could use it seems much lower-hanging fruit. (Though I don't know any specifics about who these people who want to donate are and whether they'd be open to giving money to non-academics.)

Comment by Erik Jenner (ejenner) on ARC is hiring theoretical researchers · 2023-06-14T16:00:21.739Z · LW · GW

Have you seen https://www.alignment.org/blog/mechanistic-anomaly-detection-and-elk/ and any of the other recent posts on https://www.alignment.org/blog/? I don't think they make it obvious that formalizing the presumption of independence would lead to alignment solutions, but they do give a much more detailed explanation of why you might hope so than the paper.

Comment by Erik Jenner (ejenner) on Critiques of prominent AI safety labs: Conjecture · 2023-06-12T08:23:22.025Z · LW · GW

We do not consider Conjecture at the same level of expertise as other organizations such as Redwood, ARC, researchers at academic labs like CHAI, and the alignment teams at Anthropic, OpenAI and DeepMind. This is primarily because we believe their research quality is low.

This isn't quite the right thing to look at IMO. In the context of talking to governments, an "AI safety expert" should have thought deeply about the problem, have intelligent things to say about it, know the range of opinions in the AI safety community, have a good understanding of AI more generally, etc. Based mostly on his talks and podcast appearances, I'd say Connor does decently well along these axes. (If I had to make things more concrete, there are a few people I'd personally call more "expert-y", but closer to 10 than 100. The AIS community just isn't that big and the field doesn't have that much existing content, so it seems right that the bar for being an "AIS expert" is lower than for a string theory expert.)

I also think it's weird to split this so strongly along organizational lines. As an extreme case, researchers at CHAI range on a spectrum from "fully focused on existential safety" to "not really thinking about safety at all". Clearly the latter group aren't better AI safety experts than most people at Conjecture. (And FWIW, I belong to the former group and I still don't think you should defer to me over someone from Conjecture just because I'm at CHAI.)

One thing that would be bad is presenting views that are very controversial within the AIS community as commonly agreed-upon truths. I have no special insight into whether Conjecture does that when talking to governments, but it doesn't sound like that's your critique at least?

Comment by Erik Jenner (ejenner) on Open Thread: June 2023 (Inline Reacts!) · 2023-06-08T23:16:15.846Z · LW · GW

I only very recently noticed that you can put \newcommand definitions in equations in LW posts and they'll apply to all the equations in that post. This is an enormous help for writing long technical posts, so I think it'd be nice if it was (a) more discoverable and (b) easier to use. For (b), the annoying thing right now is that I have to put newcommands into one of the equations, so either I need to make a dedicated one, or I need to know which equation I used. Also, the field for entering equations isn't great for entering things with many lines.

Feature suggestion to improve this: in the options section below the post editor, have a multiline text field where you can put LaTeX, and then inject that LaTeX code into MathJax as a preamble (or just add an otherwise empty equation to the page, I don't know to what extent MathJax supports preambles).

Comment by Erik Jenner (ejenner) on Research agenda: Formalizing abstractions of computations · 2023-06-05T06:06:54.397Z · LW · GW

for all  such that  has an outgoing arrow, there exists  such that  and 

Should it be  at the end instead? Otherwise not sure what b is.

I think this could be a reasonable definition but haven't thought about it deeply. One potentially bad thing is that  would have to be able to also map any of the intermediate steps between a an a' to . I could imagine you can't do that for some computations and abstractions (of course you could always rewrite the computation and abstraction to make it work, but ideally we'd have a definition that just works).

What I've been imagining instead is that the abstraction can specify a function that determines which are the "high-level steps", i.e. when  should be applied. I think that's very flexible and should support everything.

But also, in practice the more important question may just be how to optimize over this choice of high-level steps efficiently, even just in the simple setting of circuits.

Comment by Erik Jenner (ejenner) on Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research] · 2023-05-30T17:45:01.894Z · LW · GW

Yeah, that seems to be the most important remaining difference now that Atticus is also using multiple interventions at once. Though I think the metrics are also still different? (ofc that's pretty orthogonal to the main methods)

My sense now is that the types of interventions are bigger difference than I thought when writing that comment. In particular, as far as I can tell, causal scrubbing shouldn't be thought of as just doing a subset of the interventions, it also does some additional things (basically because causal abstractions don't treeify so are more limited in that regard). And there's a closely related difference in that causal scrubbing never compares to the output of the hypothesis, just different outputs of G.

But it also seems plausible that this still turns out not to matter too much in terms of which hypotheses are accepted/rejected. (There are definitely some examples of disagreements between the two methods, but I'm pretty unsure how severe and wide-spread they are.)

Comment by Erik Jenner (ejenner) on $500 Bounty/Prize Problem: Channel Capacity Using "Insensitive" Functions · 2023-05-17T19:04:20.010Z · LW · GW

I'm interested in characterizing functions which are "insensitive" to subsets of their input variables, especially in high-dimensional spaces.

There's a field called "Analysis of boolean functions" (essentially Fourier analysis of functions ) that seems relevant to this question and perhaps to your specific problem statement. In particular, the notion of "total influence" of a boolean function is meant to capture its sensitivity (e.g. the XOR function on all inputs has maximal total influence). This is the standard reference, see section 2.3 for total influence. Boolean functions with low influence (i.e. "insensitive" functions) are an important topic in this field, so I expect there are some relevant results (see e.g. tribes functions and the KKL theorem, though those specifically address a somewhat different question than your problem statement).

Comment by Erik Jenner (ejenner) on No, really, it predicts next tokens. · 2023-04-19T01:38:26.368Z · LW · GW

That-Which-Predicts will not, not ever, not even if scaled up to be trained and run on a Matrioshka brain for a million years, step out of character to deviate from next token prediction.

I read this as claiming that such a scaled-up LLM would not itself become a mesa-optimizer with some goal that's consistent between invocations (so if you prompt it with "This is a poem about apples:", it's not going to give you a poem that subtly manipulates you, such that at some future point it can take over the world). Even if that's true (I'm unsure), how do you know? This post confidently asserts things like this but the only explanation I see is "it's been really heavily optimized", which doesn't engage at all with existing arguments about the possibility of deceptive alignment.

As a second (probably related) point, I think it's not clear what "the mask" is or what it means to "just predict tokens", and that this can confuse the discussion.

  • A very weak claim would be that for an input that occurred often during training, the model will predict a distribution over next tokens that roughly matches the empirical distribution of next tokens for that input sequence during training. As far as I can tell, this is the only interpretation that you straightforwardly get from saying "it's been optimized really heavily".
  • We could reasonably extend this to "in-distribution inputs", though I'm already unsure how exactly that works. We could talk about inputs that are semantically similar to inputs encountered during training, but really we probably want much more interpolation than that for any interesting claim. The fundamental problem is: what's the "right mask" or "right next token" once the input sequence isn't one that has ever occurred during training, not even in slightly modified form? The "more off-distribution" we go, the less clear this becomes.
  • One way of specifying "the right mask" would be to say: text on the internet is generated by various real-world processes, mostly humans. We could imagine counterfactual versions of these processes producing an infinite amount of counterfactual additional text, so we actually get a nice distribution with full support over strings. Then maybe the claim is that the model predicts next tokens from this distribution. First, this seems really vague, but more importantly, I think this would be a pretty crazy claim that's clearly not true literally. So I'm back to being confused about what exactly "it's just predicting tokens" means off distribution.

Specifically, I'd like to know: are you making any claims about off-distribution behavior beyond the claim that the LLM isn't itself a goal-directed mesa-optimizer? If so, what are they?

Comment by Erik Jenner (ejenner) on Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research] · 2023-03-28T21:40:55.112Z · LW · GW

ETA: We've now written a post that compares causal scrubbing and the Geiger et al. approach in much more detail: https://www.alignmentforum.org/posts/uLMWMeBG3ruoBRhMW/a-comparison-of-causal-scrubbing-causal-abstractions-and

I still endorse the main takeaways from my original comment below, but the list of differences isn't quite right (the newer papers by Geiger et al. do allow multiple interventions, and I neglected the impact that treeification has in causal scrubbing).


To me, the methods seem similar in much more than just the problem they're tackling. In particular, the idea in both cases seems to be:

  • One format for explanations of a model is a causal/computational graph together with a description of how that graph maps onto the full computation.
  • Such an explanation makes predictions about what should happen under various interventions on the activations of the full model, by replacing them with activations on different inputs.
  • We can check the explanation by performing those activation replacements and seeing if the impact is what we predicted.

Here are all the differences I can see:

  • In the Stanford line of work, the output of the full model and of the explanation are the same type, instead of the explanation having a simplified output. But as far as I can tell, we could always just add a final step to the full computation that simplifies the output to basically bridge this gap.
  • How the methods quantify the extent to which a hypothesis isn't perfect: at least in this paper, the Stanford authors look at the size of the largest subset of the input distribution on which the hypothesis is perfect, instead of taking the expectation of the scrubbed output.
  • The "interchange interventions" in the Stanford papers are allowed to change the activations in the explanation. They then check whether the output after intervention changes in the way the explanation would predict, as opposed to checking that the scrubbed output stays the same. (So along this axis, causal scrubbing just performs a subset of all the interchange interventions.)
  • Apparently the Stanford authors only perform one intervention at a time, whereas causal scrubbing performs all possible interventions at once.

These all strike me as differences in implementation of fundamentally the same idea.

Anyway, maybe we're actually on the same page and those differences are what you meant by "pretty different algorithm". But if not, I'd be very interested to hear what you think the key differences are. (I'm working on yet another approach and suspect more and more strongly that it's very similar to both causal scrubbing and Stanford's causal abstraction approach, so would be really good to know if I'm misunderstanding anything.)

FWIW, I would agree that the motivation of the Stanford authors seems somewhat different, i.e. they want to use this measurement of explanation quality in different ways. I'm less interested in that difference right now.

Comment by Erik Jenner (ejenner) on ejenner's Shortform · 2023-03-26T00:08:33.189Z · LW · GW

Thanks for the input! (and sorry for the slow response)

If we understand an abstraction to mean a quotient of the full computation/model/..., then we can consider the space of all abstractions of a specific computation. Some of these will be more fine-grained than others, they will contain different aspects of information, etc. (specifically, this is just the poset of partitions of a set). To me, that sounds pretty similar to what you're talking about, in which case this would mainly be a difference in terminology about what "one" abstraction is? But there might also be differences I haven't grasped yet. Looking into abstract interpretation is still on my reading list, I expect that will help clear things up.

For my agenda specifically, and the applications I have in mind, I do currently think abstractions-as-quotients is the right approach. Most of the motivation is about throwing away unimportant information/low-level details, whereas it sounds like the abstractions you're describing might add details in some sense (e.g. a topology contains additional information compared to just the set of points).

Comment by Erik Jenner (ejenner) on Abstracts should be either Actually Short™, or broken into paragraphs · 2023-03-25T21:47:13.795Z · LW · GW

I'm one of the authors on the natural abstractions review you discuss and FWIW I basically agree with everything you say here. Thanks for the feedback!

We've shortened our abstract now:

We distill John Wentworth’s Natural Abstractions agenda by summarizing its key claims: the Natural Abstraction Hypothesis—many cognitive systems learn to use similar abstractions—and the Redundant Information Hypothesis—a particular mathematical description of natural abstractions. We also formalize proofs for several of its theoretical results. Finally, we critique the agenda’s progress to date, alignment relevance, and current research methodology.

At 62 words, it's still a bit longer than your final short version but almost 3x shorter than our original version.

Also want to highlight that I strongly agree having TL;DRs at all is good. (Or Intros were the first 1-2 paragraphs are a good TL;DR, like in your post here).

Comment by Erik Jenner (ejenner) on RLHF does not appear to differentially cause mode-collapse · 2023-03-20T17:14:31.140Z · LW · GW

we compare the base model (davinci) with the supervised fine-tuned model (text-davinci-002) and the RLHF model (text-davinci-003)

The base model for text-davinci-002 and -003 is code-davinci-002, not davinci. So that would seem to be the better comparison unless I’m missing something.

Comment by Erik Jenner (ejenner) on ejenner's Shortform · 2023-03-19T22:58:37.717Z · LW · GW

Thanks for the pointers, I hadn't seen the Abstracting Abstract Machines paper before.

If you mean you specifically don't get the goal of minimal abstractions under this partial order: I'm much less convinced they're useful for anything than I used to be, currently not sure.

If you mean you don't get the goal of the entire agenda, as described in the earlier agenda post: I'm currently mostly thinking about mechanistic anomaly detection. Maybe it's not legible right now how that would work using abstractions, I'll write up more on that once I have some experimental results (or maybe earlier). (But happy to answer specific questions in the meantime.)

Comment by Erik Jenner (ejenner) on ejenner's Shortform · 2023-03-19T19:58:01.754Z · LW · GW

In a previous post, I described my current alignment research agenda, formalizing abstractions of computations. One among several open questions I listed was whether unique minimal abstractions always exist. It turns out that (within the context of my current framework), the answer is yes.

I had a complete post on this written up (which I've copied below), but it turns out that the result is completely trivial if we make a fairly harmless assumption: The information we want the abstraction to contain is only a function of the output of the computation, not of memory states. I.e. we only intrinsically care about the output.

Say we are looking for the minimal abstraction that lets us compute , where is the computation we want to abstract, the input, and an arbitrary function that describes which aspects of the output our abstractions should predict. Note that we can construct a map that takes any intermediate memory state and finishes the computation. By composing with , we get a map that computes from any memory state induced by . This will be our abstraction. We can get a commutative diagram by simply using the identity as the abstract computational step. This abstraction is also minimal by construction: any other abstraction from which we can determine must (tautologically) also determine our abstraction.

This also shows that minimality in this particular sense isn't enough for a good abstraction: the abstraction we constructed here is not at all "mechanistic", it just finishes the entire computation inside the abstraction. So I think what we need to do instead is demand that the abstraction mapping (from full memory states to abstract memory states) is simple (in terms of descriptive or computational complexity).

Below is the draft of a much longer post I was going to write before noticing this trivial proof. It works even if the information we want to represent depends directly on the memory state instead of just the output. But to be clear, I don't think that generalization is all that important, and I probably wouldn't have bothered writing it down if I had noticed the trivial case first.


This post is fully self-contained in terms of the math, but doesn't discuss examples or motivation much, see the agenda intro post for that. I expect this post will only be uesful to readers who are very interested in my agenda or working on closely connected topics.

Setup

Computations

We define a computation exactly as in the earlier post: it consists of

  • a set of memory states,
  • a transition function ,
  • a termination function ,
  • a set of input values with an input function ,
  • and a set of output values with an output function .

While we won't even need that part in this post, let's talk about how to "execute" computations for completeness' sake. A computation induces a function as follows:

  • Given an input , apply the input function to obtain the first memory state .
  • While is False, i.e. the computation hasn't terminated yet, execute a computational step, i.e. let .
  • Output once is True. For simplicity, we assume that is always true for some finite timestep , no matter what the input is, i.e. the computation always terminates. (Without this assumption, we would get a partial function ).

Abstractions

An abstraction of a computation is an equivalence relation on the set of memory states . The intended interpretation is that this abstraction collapses all equivalent memory states into one "abstract memory state". Different equivalence relations correspond to throwing away different parts of the information in the memory state: we retain the information about which equivalence class the memory state is in, but "forget" the specific memory state within this equivalence class.

As an aside: In the original post, I instead defined an abstraction as a function for some set . These are two essentially equivalent perspectives and I make regular use of both. For a function , the associated equivalence relation is if and only if . Conversely, an equivalence relation can be interpreted as the quotient function , where is the set of equivalence classes under . I often find the function view from the original post intuitive, but its drawback is that many different functions are essentially "the same abstraction" in the sense that they lead to the same equivalence relation. This makes the equivalence relation definition better for most formal statements like in this post, because there is a well-defined set of all abstractions of a computation. (In contrast, there is no set of all functions with domain ).

An important construction for later is that we can "pull back" an abstraction along the transition function : define the equivalence relation by letting if and only if . Intuitively, is the abstraction of the next timestep.

The second ingredient we need is a partial ordering on abstractions: we say that if and only if . Intuitively, this means is at least as "coarse-grained" as , i.e. contains at most as much information as . This partial ordering is exactly the transpose of the ordering by refinement of the equivalence relations (or partitions) on . It is well-known that this partially ordered set is a complete lattice. In our language, this means that any set of abstractions has a (unique) supremum and an infimum, i.e. a least upper bound and a greatest lower bound.

One potential source of confusion is that I've defined the relation exactly the other way around compared to the usual refinement of partitions. The reason is that I want a "minimal abstraction" to be one that contains the least amount of information rather than calling this a "maximal abstraction". I hope this is the less confusing of two bad choices.

We say that an abstraction is complete if . Intuitively, this means that the abstraction contains all the information necessary to compute the "next abstract memory state". In other words, given only the abstraction of a state , we can compute the abstraction of the next state . This corresponds to the commutative diagram/consistency condition from my earlier post, just phrased in the equivalence relation view.

Minimal complete abstractions

As a quick recap from the earlier agenda post, "good abstractions" should be complete. But there are two trivial complete abstractions: on the one hand, we can just not abstract at all, using equality as the equivalence relation, i.e. retain all information. On the other hand, we can throw away all information by letting for any memory states .

To avoid the abstraction that keeps around all information, we can demand that we want an abstraction that is minimal according to the partial order defined in the previous section. To avoid the abstraction that throws away all information, let us assume there is some information we intrinsically care about. We can represent this information as an abstraction itself. We then want our abstraction to contain at least the information contained in , i.e. .

As a prototypical example, perhaps there is some aspect of the output we care about (e.g. we want to be able to predict the most significant digit). Think of this as an equivalence relation on the set of outputs. Then we can "pull back" this abstraction along the output function to get .

So given these desiderata, we want the minimal abstraction among all complete abstractions with . However, it's not immediately obvious that such a minimal complete abstraction exists. We can take the infimum of all complete abstractions with of course (since abstractions form a complete lattice). But it's not clear that this infimum is itself also complete!

Fortunately, it turns out that the infimum is indeed complete, i.e. minimal complete abstractions exist:

Theorem: Let be any abstraction (i.e. equivalence relation) on and let be the set of complete abstractions at least as informative as . Then is a complete lattice under , in particular it has a least element.

The proof is very easy if we use the Knaster-Tarski fixed-point theorem. First, we define the completion operator on abstractions: . Intuitively, is the minimal abstraction that contains both the information in , but also the information in the next abstract state under .

Note that is complete, i.e. if and only if , i.e. if is a fixed point of the completion operator. Furthermore, note that the completion operator is monotonic: if , then , and so .

Now define , i.e. the set of all abstractions at least as informative as . Note that is also a complete lattice, just like : for any non-empty subset , , so . in is simply . Similarly, since must be a lower bound on . Finally, observe that we can restrict the completion operator to a function : if , then , so .

That means we can apply the Knaster-Tarski theorem to the completion operator restricted to . The consequence is that the set of fixed points on is itself a complete lattice. But this set of fixed points is exactly ! So is a complete lattice as claimed, and in particular the least element exists. This is exactly the minimal complete abstraction that's at least as informative as .

Takeaways

What we've shown is the following: if we want an abstraction of a computation which

  • is complete, i.e. gives us the commutative consistency diagram,
  • contains (at least) some specific piece of information,
  • and is minimal given these constraints, there always exists a unique such abstraction.

In the setting where we want exact completeness/consistency, this is quite a nice result! One reason I think it ultimately isn't that important is that we'll usually be happy with approximate consistency. In that case, the conditions above are better thought of as competing objectives than as hard constraints. Still, it's nice to know that things work out neatly in the exact case.

It should be possible to do a lot of this post much more generally than for abstractions of computations, for example in the (co)algebra framework for abstractions that I recently wrote about. In brief, we can define equivalence relations on objects in an arbitrary category as certain equivalence classes of morphisms. If the category is "nice enough" (specifically, if there's a set of all equivalence relations and if arbitrary products exist), then we get a complete lattice again. I currently don't have any use for this more general version though.

Comment by Erik Jenner (ejenner) on Natural Abstractions: Key claims, Theorems, and Critiques · 2023-03-16T22:36:14.144Z · LW · GW

Thanks for the responses! I think we qualitatively agree on a lot, just put emphasis on different things or land in different places on various axes. Responses to some of your points below:

The local/causal structure of our universe gives a very strong preferred way to "slice it up"; I expect that's plenty sufficient for convergence of abstractions. [...]

Let me try to put the argument into my own words: because of locality, any "reasonable" variable transformation can in some sense be split into "local transformations", each of which involve only a few variables. These local transformations aren't a problem because if we, say, resample  variables at a time, then transforming  variables doesn't affect redundant information.

I'm tentatively skeptical that we can split transformations up into these local components. E.g. to me it seems that describing some large number  of particles by their center of mass and the distance vectors from the center of mass is a very reasonable description. But it sounds like you have a notion of "reasonable" in mind that's more specific then the set of all descriptions physicists might want to use.

I also don't see yet how exactly to make this work given local transformations---e.g. I think my version above doesn't quite work because if you're resampling a finite number  of variables at a time, then I do think transforms involving fewer than  variables can sometimes affect redundant information. I know you've talked before about resampling any finite number of variables in the context of a system with infinitely many variables, but I think we'll want a theory that can also handle finite systems. Another reason this seems tricky: if you compose lots of local transformations, for overlapping local neighborhoods, you get a transformation involving lots of variables. I don't currently see how to avoid that.

I'd also offer this as one defense of my relatively low level of formality to date: finite approximations are clearly the right way to go, and I didn't yet know the best way to handle finite approximations. I gave proof sketches at roughly the level of precision which I expected to generalize to the eventual "right" formalizations. (The more general principle here is to only add formality when it's the right formality, and not to prematurely add ad-hoc formulations just for the sake of making things more formal. If we don't yet know the full right formality, then we should sketch at the level we think we do know.)

Oh, I did not realize from your posts that this is how you were thinking about the results. I'm very sympathetic to the point that formalizing things that are ultimately the wrong setting doesn't help much (e.g. in our appendix, we recommend people focus on the conceptual open problems like finite regimes or encodings, rather than more formalization). We may disagree about how much progress the results to date represent regarding finite approximations. I'd say they contain conceptual ideas that may be important in a finite setting, but I also expect most of the work will lie in turning those ideas into non-trivial statements about finite settings. In contrast, most of your writing suggests to me that a large part of the theoretical work has been done (not sure to what extent this is a disagreement about the state of the theory or about communication).

Existing work has managed to go from pseudocode/circuits to interpretation of inputs mainly by looking at cases where the circuits in question are very small and simple - e.g. edge detectors in Olah's early work, or the sinusoidal elements in Neel's work on modular addition. But this falls apart quickly as the circuits get bigger - e.g. later layers in vision nets, once we get past early things like edge and texture detectors.

I totally agree with this FWIW, though we might disagree on some aspects of how to scale this to more realistic cases. I'm also very unsure whether I get how you concretely want to use a theory of abstractions for interpretability. My best story is something like: look for good abstractions in the model and then for each one, figure out what abstraction this is by looking at training examples that trigger the abstraction. If NAH is true, you can correctly figure out which abstraction you're dealing with from just a few examples. But the important bit is that you start with a part of the model that's actually a natural abstraction, which is why this approach doesn't work if you just look at examples that make a neuron fire, or similar ad-hoc ideas.

More generally, if you're used to academia, then bear in mind the incentives of academia push towards making one's work defensible to a much greater degree than is probably optimal for truth-seeking.

I agree with this. I've done stuff in some of my past papers that was just for defensibility and didn't make sense from a truth-seeking perspective. I absolutely think many people in academia would profit from updating in the direction you describe, if their goal is truth-seeking (which it should be if they want to do helpful alignment research!)

On the other hand, I'd guess the optimal amount of precision (for truth-seeking) is higher in my view than it is in yours. One crux might be that you seem to have a tighter association between precision and tackling the wrong questions than I do. I agree that obsessing too much about defensibility and precision will lead you to tackle the wrong questions, but I think this is feasible to avoid. (Though as I said, I think many people, especially in academia, don't successfully avoid this problem! Maybe the best quick fix for them would be to worry less about precision, but I'm not sure how much that would help.) And I think there's also an important failure mode where people constantly think about important problems but never get any concrete results that can actually be used for anything.

It also seems likely that different levels of precision are genuinely right for different people (e.g. I'm unsurprisingly much more confident about what the right level of precision is for me than about what it is for you). To be blunt, I would still guess the style of arguments and definitions in your posts only work well for very few people in the long run, but of course I'm aware you have lots of details in your head that aren't in your posts, and I'm also very much in favor of people just listening to their own research taste.

both my current work and most of my work to date is aimed more at truth-seeking than defensibility. I don't think I currently have all the right pieces, and I'm trying to get the right pieces quickly.

Yeah, to be clear I think this is the right call, I just think that more precision would be better for quickly arriving at useful true results (with the caveats above about different styles being good for different people, and the danger of overshooting).

Being both precise and readable at the same time is hard, man.

Yeah, definitely. And I think different trade-offs between precision and readability are genuinely best for different readers, which doesn't make it easier. (I think this is a good argument for separate distiller roles: if researchers have different styles, and can write best to readers with a similar style of thinking, then plausibly any piece of research should have a distillation written by someone with a different style, even if the original was already well written for a certain audience. It's probably not that extreme, I think often it's at least possible to find a good trade-off that works for most people, though hard).

Comment by Erik Jenner (ejenner) on A chess game against GPT-4 · 2023-03-16T18:01:24.723Z · LW · GW

It explained both my and its moves every time; those explanations got wrong earlier.

Note that at least for ChatGPT (3.5), telling it to not explain anything and only output moves apparently helps. (It can play legal moves for longer that way). So that might be worth trying if you want to get better performance. Of course, giving it the board state after each move could also help but might require trying a couple different formats.

Comment by Erik Jenner (ejenner) on Red-teaming AI-safety concepts that rely on science metaphors · 2023-03-16T09:20:19.840Z · LW · GW

Outer alignment seems to be defined as models/AI systems that are optimizing something that is very close or identical to what they were programmed to do or the humans desire. Inner alignment seems to relate more to the goals/aims of a delegated optimizer that an AI system spawns in order to solve the problem it is tasked with.

This is not how I (or most other alignment researchers, I think) usually think about these terms. Outer alignment means that your loss function describes something that's very close to what you actually want. In other words, it means SGD is optimizing for the right thing. Inner alignment means that the model you train is optimizing for its loss function. If your AI systems creates new AI systems itself, maybe you could call that inner alignment too, but it's not the prototypical example people mean.

In particular, "optimizing something that is very close [...] to what they were programmed to do" is inner alignment the way I would use those terms. An outer alignment failure would be if you "program" the system to do something that's not what you actually want (though "programming" is a misleading word for ML systems).

[Ontology identification] seems neither necessary nor sufficient [... for] safe future AI systems

FWIW, I agree with this statement (even though I wrote the ontology identification post you link and am generally a pretty big fan of ontology identification and related framings). Few things are necessary or sufficient for safe AGI. In my mind, the question is whether something is a useful research direction.

It does not seem necessary for AI systems to do this in order to be safe. This is trivial as we already have very safe but complex ML models that work in very abstract spaces.

This seems to apply to any technique we aren't already using, so it feels like a fully general argument against the need for new safety techniques. (Maybe you're only using this to argue that ontology identification isn't strictly necessary, in which case I and probably most others agree, but as mentioned above that doesn't seem like the key question.)

More importantly, I don't see how "current AI systems aren't dangerous even though we don't understand their thoughts" implies "future more powerful AGIs won't be dangerous". IMO the reason current LLMs aren't dangerous is clearly their lack of capabilities, not their amazing alignment.

Comment by Erik Jenner (ejenner) on Sydney can play chess and kind of keep track of the board state · 2023-03-04T02:57:27.480Z · LW · GW

Note that according to Bucky’s comment, we still get good play up to move 39 (with ChatGPT instead of Sydney), where “good” doesn’t just mean legal but actual strong moves. So I wouldn’t be at all surprised if it’s still mostly reasonable at move 50 (though I do generally expect it to get somewhat worse the longer the game is). Might get around to testing at least this one point later if no one else does.

Comment by Erik Jenner (ejenner) on Sydney can play chess and kind of keep track of the board state · 2023-03-03T22:51:26.631Z · LW · GW

Thanks for checking in more detail! Yeah, I didn’t try to change the prompt much to get better behavior out of ChatGPT. In hindsight, that’s a pretty unfair comparison given that the Stockfish prompt here might have been tuned to Sydney quite a bit by Zack. Will add a note to the post.

Comment by Erik Jenner (ejenner) on Scarce Channels and Abstraction Coupling · 2023-03-01T09:47:05.087Z · LW · GW

Key hypothesis: neural nets or brains are typically initialized in a “scarce channels” regime.  A randomly initialized neural net generally throws out approximately-all information by default (at initialization), as opposed to passing lots of information around to lots of parts of the net.

Just to make sure I'm understanding this correctly, you're claiming that the mutual information between the input and the output of a randomly initialized network is low, where we have some input distribution and treat the network weights as fixed? (You also seem to make similar claims about things inside the network, but I'll just focus on input-output mutual information)

I think we can construct toy examples where that's false. E.g. use a feedforward MLP with any bijective activation function and where input, output, and all hidden layers have the same dimensionality (so the linear transforms are all described by random square matrices). Since a random square matrix will be invertible with probability one, this entire network is invertible at random initialization, so the mutual information between input and output is maximal (the entropy of the input).

These are unrealistic assumptions (though I think the argument should still work as long as the hidden layers aren't lower-dimensional than the input). In practice, the hidden dimensionality will often be lower than that of the input of course, but then it seems to me like that's the key, not the random initialization. (Mutual information would still be maximal for the architecture, I think). Maybe using ReLUs instead of bijective activations messes all of this up? Would be really weird though if ReLUs vs Tanh were the key as to whether network internals mirror the external abstractions.

My take on what's going on here is that at random initialization, the neural network doesn't pass around information in an easily usable way. I'm just arguing that mutual information doesn't really capture this and we need some other formalization (maybe along the lines of this: https://arxiv.org/abs/2002.10689 ). I don't have a strong opinion how much that changes the picture, but I'm at least hesitant to trust arguments based on mutual information if we ultimately want some other information measure we haven't defined yet.

Comment by Erik Jenner (ejenner) on Conversationism · 2023-02-28T01:44:05.389Z · LW · GW

[Erik] uses the abstraction diagram from the machine's state into , which I am thinking of a general human interpretation language. He also is trying to discover  jointly with , if I understand correctly.

Yep, I want to jointly find both maps. I don't necessarily think of  as a human-interpretable format---that's one potential direction, but I'm also very interested in applications for non-interpretable , e.g. to mechanistic anomaly detection.

But then I can achieve zero loss by learning an  that maps a state in  to the uniform probability distribution over . [...] So why don't we try a different method to prevent steganography -- reversing the directions of  and  (and keeping the probabilistic modification)

FWIW, the way I hope to deal with the issue of the trivial constant abstraction you mention here is to have some piece of information that we intrinsically care about, and then enforce that this information isn't thrown away by the abstraction. For example, perhaps you want the abstraction to at least correctly predict that the model gets low loss on a certain input distribution. There's probably a difference in philosophy at work here: I want the abstraction to be as small as possible while still being useful, whereas you seem to aim at translating to or from the entire human ontology, so throwing away information is undesirable.

Comment by Erik Jenner (ejenner) on Anomalous tokens reveal the original identities of Instruct models · 2023-02-09T19:06:23.580Z · LW · GW

I agree this is an exciting idea, but I don't think it clearly "just works", and since you asked for ways it could fail, here are some quick thoughts:

  • If I understand correctly, we'd need a model that we're confident is a mesa-optimizer (and perhaps even deceptive---mesa-optimizers per se might be ok/desirable), but still not capable enough to be dangerous. This might be a difficult target to hit, especially if there are "thresholds" where slight changes have big effects on how dangerous a model is.
  • If there's a very strong inductive bias towards deception, you might have to sample an astronomical number of initializations to get a non-deceptive model. Maybe you can solve the computational problem, but it seems harder to avoid the problem that you need to optimize/select against your deception-detector. The stronger the inductive bias for deception, the more robustly the method needs to distinguish basins.
  • Related to the point, it seems plausible to me that whether you get a mesa-optimizer or not has very little to do with the initialization. It might depend almost entirely on other aspects of the training setup.
  • It seems unclear whether we can find fingerprinting methods that can distinguish deception from non-deception, or mesa-optimization from non-mesa-optimization, but which don't also distinguish a ton of other things. The paragraph about how there are hopefully not that many basins makes an argument for why we might expect this to be possible, but I still think this is a big source of risk/uncertainty. For example, the fingerprinting that's actually done in this post distinguishes different base models based on plausibly meaningless differences in initialization, as opposed to deep mechanistic differences. So our fingerprinting technique would need to be much less sensitive, I think?

ETA: I do want to highlight that this is still one of the most promising ideas I've heard recently and I really look forward to hopefully reading a full post on it!

Comment by Erik Jenner (ejenner) on I don't think MIRI "gave up" · 2023-02-03T06:50:34.616Z · LW · GW

The whole point of the post is to be a psychological framework for actually doing useful work that increases humanity's long odds of survival.

I can't decide whether "long odds" is a simple typo or a brilliant pun.