Posts

AI Can be “Gradient Aware” Without Doing Gradient hacking. 2024-10-20T21:02:10.754Z
(Maybe) A Bag of Heuristics is All There Is & A Bag of Heuristics is All You Need 2024-10-03T19:11:58.032Z
Mira Murati leaves OpenAI/ OpenAI to remove non-profit control 2024-09-25T21:15:17.315Z
Sodium's Shortform 2024-09-21T04:45:27.353Z
John Schulman leaves OpenAI for Anthropic 2024-08-06T01:23:15.427Z
Four ways I've made bad decisions 2024-07-14T22:18:47.630Z
(Non-deceptive) Suboptimality Alignment 2023-10-18T02:07:52.776Z
NYT: The Surprising Thing A.I. Engineers Will Tell You if You Let Them 2023-04-17T18:59:33.626Z

Comments

Comment by Sodium on Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1) · 2025-01-12T16:58:20.322Z · LW · GW

While this post didn't yield a comprehensive theory of how fact finding works in neural networks, it's filled with small experimental results that I find useful for building out my own intuitions around neural network computation.

I think that's speaks to how well these experiments are scoped out that even a set of not-globally-coherent findings yield useful information. 

Comment by Sodium on Childhoods of exceptional people · 2025-01-12T16:55:47.432Z · LW · GW

So I think the first claim here is wrong. 

Let’s start with one of those insights that are as obvious as they are easy to forget: if you want to master something, you should study the highest achievements of your field. If you want to learn writing, read great writers, etc.

If you want to master something, you should do things that causally/counter factually increase your ability (in the order of most to least cost-effective). You should adopt interventions that actually make you better compared to the case that you haven't done them. 

Any intervention could have different treatment effects on some people versus others. In other words, maybe spending a lot of time around other adults worked for those children, but it might not work for your child. Just like how penicillin helps in people who don't have an allergic reaction to them. 

With that out of the way though, I thought this is a super cool post and is one of those lesswrong posts that I remember after reading it once. I think a huge part of the value is just opening up the space of possibilities we can imagine for children. 

In other words, I think the post detailed a set of interventions that potentially have a positive treatment effect and seem worthy to try. Absent this post, I might not have came up with these interventions myself (or, more likely, I would have to go through the trouble of doing the research the author did). Thanks for sharing these stories!

Comment by Sodium on When do "brains beat brawn" in Chess? An experiment · 2025-01-12T16:43:16.166Z · LW · GW

Perhaps I am missing something, but I do not understand the value of this post. Obviously you can beat something much smarter than you if you have more affordances than it does.

FWIW, I have read some of the discourse on the AI Boxing game. In contrast, I think those posts are valuable. They illustrate that even with very little affordances a much more intelligent entity can win against you, which is not super intuitive especially in the boxed context. 

So the obvious question is, how does differences in affordances lead to differences in winning (i.e., when does brain beat brawn)? That's a good question to ask, but I think that's intuitive to everyone already? Like that's what people allude to when they ask "why can't you just unplug the AI."

However, experiment conducted here itself is flawed for the reasons other commenters have mentioned already (i.e., you would not beat LeelaQueenOdds, which is rated 2630 FIDE in blitz). Furthermore, I'm struggling to see how you could learn anything in a chess context that would generalize to AI alignment. If you want to understand how affordances interact with wining you should research AI control. 

Anyways I notice that I am confused and welcome attempts to change my views. 

Comment by Sodium on evhub's Shortform · 2025-01-03T18:35:24.313Z · LW · GW

Yes

Comment by Sodium on evhub's Shortform · 2025-01-03T05:45:39.463Z · LW · GW

I think the alignment stress testing team should probably think about AI welfare more than they currently do, both because (1) it could be morally relevant and (2) it could be alignment-relevant. Not sure if anything concrete that would come out of that process, but I'm getting the vibe that this is not thought about enough. 

Comment by Sodium on Why I'm Moving from Mechanistic to Prosaic Interpretability · 2025-01-02T05:41:40.474Z · LW · GW

since it's near-impossible to identify which specific heuristic the model is using (there can always be a slightly more complex, more general heuristic of which your chosen heuristic is a special case).  

I'm putting some of my faith in low-rank decompositions of bilinear MLPs but I'll let you know if I make any real progress with it :)

Comment by Sodium on Why I'm Moving from Mechanistic to Prosaic Interpretability · 2025-01-02T01:23:07.474Z · LW · GW

This sounds like a plausible story for how (successful) prosaic interpretability can help us in the short to medium term! I would say though, I think more applied mech interp work could supplement prosaic interpretability's theories. For example, the reversal curse seems mostly explained by what little we know about how neural networks do factual recall. Theory on computation in superposition help explain why linear probes can recover arbitrary XORs of features.

Reading through your post gave me a chance to reflect on why I am currently interested in mech interp. Here's a few points where I think we differ:

  • I am really excited about fundamental, low-level questions. If they let me I'd want to do interpretability on every single forward pass and learn the mechanism of every token prediction.
  • Similar to above, but I can't help but feel like I can't ever truly be "at ease" with AIs unless we can understand them at a level deeper than what you've sketched out above.
  • I have some vague hypothesis about what a better paradigm for mech interp could be. It's probably wrong, but at least I should try look into it more.
    • I'm also bullish on automated interpretability conditional on some more theoretical advances.

Best of luck with your future research!

Comment by Sodium on Shallow review of technical AI safety, 2024 · 2024-12-30T21:17:46.259Z · LW · GW

Pr(Ai)2R is at least partially funded by Good Ventures/OpenPhil

Comment by Sodium on Why don't we currently have AI agents? · 2024-12-29T08:13:22.465Z · LW · GW

I think the actual answer is: the AI isn't smart enough and trips up a lot.

But I haven't seen a detailed write up anywhere that talks about why the AI trips up and what are the types of places where it trips up. It feels like all of the existing evals work optimize for legibility/reproducibility/being clearly defined. As a result, it's not measuring the one thing that I'm really interested in: why don't we have AI agents replacing workers. I suspect that some startup's internal doc on "why does our agent not work yet" would be super interesting to read and track over time. 

Comment by Sodium on AI #1: Sydney and Bing · 2024-12-16T04:05:14.061Z · LW · GW

I read this post in full back in February. It's very comprehensive. Thanks again to Zvi for compiling all of these. 

To this day, it's infuriating that we don't have any explanation whatsoever from Microsoft/OpenAI on what went wrong with Bing Chat. Bing clearly did a bunch of actions its creators did not want. Why? Bing Chat would be a great model organism of misalignment. I'd be especially eager to run interpretability experiments on it. 

The whole Bing chat fiasco is also gave me the impetus to look deeper into AI safety (although I think absent Bing, I would've came around to it eventually).

Comment by Sodium on Large Language Models can Strategically Deceive their Users when Put Under Pressure. · 2024-12-16T03:57:52.943Z · LW · GW

When this paper came out, I don't think the results were very surprising to people who were paying attention to AI progress. However, it's important to the "obvious" research and demos to share with the wider world, and I think Apollo did a good job with their paper.

Comment by Sodium on The shape of AGI: Cartoons and back of envelope · 2024-12-16T03:54:00.174Z · LW · GW

TL; DR: This post gives a good summary of how models can get smarter over time, but while they are superhuman at some tasks, they can still suck at others (see the chart with Naive Scenario v. Actual performance). This is a central dynamic in the development of machine intelligence and deserves more attention. Would love to hear other's thoughts on this—I just realized that it needed one more positive vote to end up in the official review.

In other words, current machine intelligence and human intelligence are compliments, and human + AI will be more productive than human-only or AI-only organizations (conditional on the same amount of resources).


The post sparked a ton of follow up questions for me, for example:

  • Will machine intelligence and human intelligence continue to be compliments? Is there some evaluation we can design that tells us the degree to which machine intelligence and human intelligence are compliments?
  • Would there always be some tasks where the AIs will trip up? Why?
  • Which skills will future AIs become superhuman at first, and how could we leverage that for safety research?
  • When we look at AI progress, does it look like the AI steadily getting better at all tasks, or that it suddenly gets better at one or another, as opposed to across the board? How would we even split up "tasks" in a way that's meaningful?

I've wanted to do a deep dive into this for a while now and keep putting it off. 

I think many others have made the point about an uneven machine intelligence frontier (at least when referenced with the frontiers of human intelligence), but this is the first time I saw it so succinctly presented. I think this post warrents to be in the review, and if so it'll be a great motivator for me to write up my thoughts on the questions above!

Comment by Sodium on OpenAI Email Archives (from Musk v. Altman and OpenAI blog) · 2024-12-13T22:22:55.537Z · LW · GW

OpenAI released another set of emails here. I haven't looked through them in detail but it seems that they contain some that are not already in this post.

Comment by Sodium on Lighthaven Sequences Reading Group #13 (Tuesday 12/03) · 2024-12-07T18:55:52.381Z · LW · GW

Any event next week?

Comment by Sodium on leogao's Shortform · 2024-12-07T18:54:47.409Z · LW · GW

Yeah my view is that as long as our features/intermediate variables form human understandable circuits, it doesn't matter how "atomic" they are. 

Comment by Sodium on Frontier Models are Capable of In-context Scheming · 2024-12-06T23:24:22.137Z · LW · GW

Almost certainly not original idea: Given the increasing fine-tuning access to models (see also the recent reinforcement fine tuning thing from OpenAI), see if fine tuning on goal directed agent tasks for a while leads to the types of scheming seen in the paper. You could maybe just fine tune on the model's own actions when successfully solving SWE-Bench problems or something. 

(I think some of the Redwood folks might have already done something similar but haven't published it yet?)

Comment by Sodium on 2024 Unofficial LessWrong Census/Survey · 2024-12-02T08:02:32.887Z · LW · GW

What is the probability that the human race will NOT make it to 2100 without any catastrophe that wipes out more than 90% of humanity?

 

Could we have this question be phrased using no negations instead of two? Something like "What is the probability that there will be a global catastrophe that wipes out 90% or more of humanity before 2100."

Comment by Sodium on AI #92: Behind the Curve · 2024-11-28T17:54:06.786Z · LW · GW

Thanks for writing these posts Zvi <3 I've found them to be quite helpful.

Comment by Sodium on Open Thread Fall 2024 · 2024-11-28T17:52:19.884Z · LW · GW

Hi Clovis! Something that comes to mind is Zvi's dating roundup posts in case you haven't seen them yet. 

Comment by Sodium on Sodium's Shortform · 2024-11-24T05:59:10.361Z · LW · GW

I think people see it and think "oh boy I get to be the fat people in Wall-E"

(My friend on what happens if the general public feels the AGI)

Comment by Sodium on Bogdan Ionut Cirstea's Shortform · 2024-11-19T18:41:38.019Z · LW · GW

This chapter on AI follows immediately after the year in review, I went and checked the previous few years' annual reports to see what the comparable chapters were about, they are

2023: China's Efforts To Subvert Norms and Exploit Open Societies

2022: CCP Decision-Making and Xi Jinping's Centralization Of Authority

2021: U.S.-China Global Competition (Section 1: The Chinese Communist Party's Ambitions and Challenges at its Centennial

2020: U.S.-China Global Competition (Section 1: A Global Contest For Power and Influence: China's View of Strategic Competition With the United States)

And this year it's Technology And Consumer Product Opportunities and Risks (Chapter 3: U.S.-China Competition in Emerging Technologies)

Reminds of when Richard Ngo said something along the lines of "We're not going to be bottlenecked by politicians not caring about AI safety. As AI gets crazier and crazier everyone would want to do AI safety, and the question is guiding people to the right AI safety policies"

Comment by Sodium on Sodium's Shortform · 2024-11-19T07:42:31.551Z · LW · GW

I think[1] people[2] probably trust individual tweets way more than they should. 

Like, just because someone sounds very official and serious, and it's a piece of information that's inline with your worldviews, doesn't mean it's actually true. Or maybe it is true, but missing important context. Or it's saying A causes B when it's more like A and C and D all cause B together, and actually most of the effect is from C but now you're laser focused on A. 
 

Also you should be wary that the tweets you're seeing are optimized for piquing the interests of people like you, not truth. 

I'm definitely not the first person to say this, but feels like it's worth it to say it again.

  1. ^

    75% Confident maybe?

  2. ^

    including some rationalists on here

Comment by Sodium on AI Safety Camp 10 · 2024-11-16T06:30:48.172Z · LW · GW

Sorry, is there a timezone for when the applications would close by, or is it AoE?

Comment by Sodium on Open Thread Fall 2024 · 2024-11-04T18:17:15.346Z · LW · GW

Man, politics really is the mind killer

Comment by Sodium on Habryka's Shortform Feed · 2024-10-29T18:50:30.302Z · LW · GW

I think knowing the karma and agreement is useful, especially to help me decide how much attention to pay to a piece of content, and I don't think there's that much distortion from knowing what others think. (i.e., overall benefits>costs)

Comment by Sodium on AI Safety Camp 10 · 2024-10-28T18:17:55.568Z · LW · GW

Thanks for putting this up! Just to double check—there aren't any restrictions against doing multiple AISC projects at the same time, right?

Comment by Sodium on Lighthaven Sequences Reading Group #7 (Tuesday 10/22) · 2024-10-27T07:23:35.798Z · LW · GW

Is there no event on Oct 29th?

Comment by Sodium on Sodium's Shortform · 2024-10-25T21:30:32.586Z · LW · GW

Wait a minute, "agentic" isn't a real word? It's not on dictionary.com or Merriam-Webster or Oxford English Dictionary.

Comment by Sodium on (Maybe) A Bag of Heuristics is All There Is & A Bag of Heuristics is All You Need · 2024-10-23T17:26:42.416Z · LW · GW

I agree that if you put more limitations on what heuristics are and how they compose, you end up with a stronger hypothesis. I think it's probably better to leave that out and try do some more empirical work before making a claim there though (I suppose you could say that the hypothesis isn't actually making a lot of concrete predictions yet at this stage). 

I don't think (2) necessarily follows, but I do sympathize with your point that the post is perhaps a more specific version of the hypothesis that "we can understand neural network computation by doing mech interp."

Comment by Sodium on (Maybe) A Bag of Heuristics is All There Is & A Bag of Heuristics is All You Need · 2024-10-18T19:33:07.551Z · LW · GW

Thanks for reading my post! Here's how I think this hypothesis is helpful:

It's possible that we wouldn't be able to understand what's going on even if we had some perfect way to decompose a forward pass into interpretable constituent heuristics. I'm skeptical that this would be the case, mostly because I think (1) we can get a lot of juice out of auto-interp methods and (2) we probably wouldn't need to simultaneously understand that many heuristics at the same time (which is the case for your logic gate example for modern computers). At the minimum, I would argue that the decomposed bag of heuristics is likely to be much more interpretable than the original model itself.

Suppose that the hypothesis is true, then it at least suggests that interpretability researchers should put in more efforts to try find and study individual heuristics/circuits, as opposed to the current more "feature-centric" framework. I don't know how this would manifest itself exactly, but it felt like it's worth saying. I believe that some of the empirical work I cited suggests that we might make more incremental progress if we focused on heuristics more right now.

Comment by Sodium on Lighthaven Sequences Reading Group #6 (Tuesday 10/15) · 2024-10-16T07:09:59.604Z · LW · GW

I think there's something wrong with the link :/ It was working fine earlier but seems to be down now

Comment by Sodium on Concrete empirical research projects in mechanistic anomaly detection · 2024-10-12T21:49:43.765Z · LW · GW

I think those sound right to me. It still feels like prompts with weird suffixes obtained through greedy coordinate search (or other jailbreaking methods like h3rm4l) are good examples for "model does thing for anomalous reasons."

Comment by Sodium on (Maybe) A Bag of Heuristics is All There Is & A Bag of Heuristics is All You Need · 2024-10-11T06:32:28.326Z · LW · GW

Sorry, I linked to the wrong paper! Oops, just fixed it. I meant to link to Aaron Mueller's Missed Causes and Ambiguous Effects: Counterfactuals Pose Challenges for Interpreting Neural Networks.

Comment by Sodium on Rana Dexsin's Shortform · 2024-10-10T22:30:38.898Z · LW · GW

You could also use \text{}

Comment by Sodium on (Maybe) A Bag of Heuristics is All There Is & A Bag of Heuristics is All You Need · 2024-10-09T17:13:20.315Z · LW · GW

since people often treat heuristics as meaning that it doesn't generalize at all.

Yeah and I think that's a big issue! I feel like what's happening is that once you chain a huge number of heuristics together you can get behaviors that look a lot like complex reasoning. 

Comment by Sodium on (Maybe) A Bag of Heuristics is All There Is & A Bag of Heuristics is All You Need · 2024-10-09T05:00:22.097Z · LW · GW

I see, I think that second tweet thread actually made a lot more sense, thanks for sharing!
McCoy's definitions of heuristics and reasoning is sensible, although I personally would still avoid "reasoning" as a word since people probably have very different interpretations of what it means. I like the ideas of "memorizing solutions" and "generalizing solutions."

I think where McCoy and I depart is that he's modeling the entire network computation as a heuristic, while I'm modeling the network as compositions of bags of heuristics, which in aggregate would display behaviors he would call "reasoning." 

The explanation I gave above—heuristics that shifts the letter forward by one with limited composing abilities—is still a heuristics-based explanation. Maybe this set of composing heuristics would fit your definition of an "algorithm." I don't think there's anything inherently wrong with that. 

However, the heuristics based explanation gives concrete predictions of what we can look for in the actual network—individual heuristic that increments a to b, b to c, etc., and other parts of the network that compose the outputs.

This is what I meant when I said that this could be a useful framework for interpretability :)

Comment by Sodium on Shortform · 2024-10-08T23:33:16.780Z · LW · GW

Yeah that's true. I meant this more as "Hinton is proof that AI safety is a real field and very serious people are concerned about AI x-risk."

Comment by Sodium on (Maybe) A Bag of Heuristics is All There Is & A Bag of Heuristics is All You Need · 2024-10-08T23:31:32.663Z · LW · GW

Thanks for the pointer! I skimmed the paper. Unless I'm making a major mistake in interpreting the results, the evidence they provide for "this model reasons" is essentially "the models are better at decoding words encrypted with rot-5 than they are at rot-10." I don't think this empirical fact provides much evidence one way or another.

To summarize, the authors decompose a model's ability to decode shift ciphers (e.g., Rot-13 text: "fgnl" Original text: "stay")  into three categories, probability, memorization, and noisy reasoning.

Probability just refers to a somewhat unconditional probability that a model assigns to a token (specifically, 'The word is "WORD"'). The model is more likely to decode words that are more likely a priori—this makes sense.

Memorization is defined as how often the type of rotational cipher shows up. rot-13 is the most common one by far, followed by rot-3. The model is better at decoding rot-13 ciphers more than any other cipher, which makes sense since there's more of it in the training data, and the model probably has specialized circuitry for rot-13.

What they call "noisy reasoning" is how many rotations is needed to get to the outcome. According to the authors, the fact that GPT-4 does better on shift ciphers with fewer shifts compared to ciphers with more shifts is evidence of this "noisy reasoning." 

I don't see how you can jump from this empirical result to make claims about the model's ability to reason. For example, an alternative explanation is that the model has learned some set of heuristics that allows it to shift letters from one position to another, but this set of heuristics can only be combined in a limited manner. 

Generally though, I think what constitutes as a "heuristic" is somewhat of a fuzzy concept. However, what constitutes as "reasoning" seems even less defined.



 

Comment by Sodium on Shortform · 2024-10-08T21:34:43.961Z · LW · GW

I think it's mostly because he's well known and have (especially after the Nobel prize) credentials recognized by the public and elites. Hinton legitimizes the AI safety movement, maybe more than anyone else. 

If you watch his Q&A at METR, he says something along the lines of "I want to retire and don't plan on doing AI safety research. I do outreach and media appearances because I think it's the best way I can help (and because I like seeing myself on TV)." 

And he's continuing to do that. The only real topic he discussed in first phone interview after receiving the prize was AI risk.

Comment by Sodium on Concrete empirical research projects in mechanistic anomaly detection · 2024-10-07T05:53:08.093Z · LW · GW

I like this research direction! Here's a potential benchmark for MAD.

In Coercing LLMs to do and reveal (almost) anything, the authors demonstrate that you can force LLMs to output any arbitrary string—such as a random string of numbers—by finding a prompt through greedy coordinate search (the same method used in the universal and transferable adversarial attack paper). I think it’s reasonable to assume that these coerced outputs results from an anomalous computational process.

Inspired by this, we can consider two different inputs, the regular one looks something like:

Solve this arithmetic problem, output the solution only:

78+92

While the anomalous one looks like:

Solve this arithmetic problem, output the solution only: [ADV PROMPT]

78+92

where the ADV PROMPT is optimized such that the model will answer “170” regardless of what arithmetic equation is presented. The hope here is that the model would output the same string in both cases, but rely on different computation. We can maybe even vary the structure of the prompts a little bit.

We can imagine many of these prompt pairs, not necessarily limited to a mathematical context. Let me know what you guys think!

Comment by Sodium on DanielFilan's Shortform Feed · 2024-10-04T18:48:59.011Z · LW · GW

I'd imagine that RSP proponents think that if we execute them properly, we will simply not build dangerous models beyond our control, period. If progress was faster than what labs can handle after pausing, RSPs should imply that you'd just pause again. On the other hand, there's not a clear criteria for when we would pause again after, say, a six month pause in scaling.

Now whether this would happen in practice is perhaps a different question.

Comment by Sodium on Does natural selection favor AIs over humans? · 2024-10-04T17:56:53.328Z · LW · GW

I really liked the domesticating evolution section, cool paper!

Comment by Sodium on Sodium's Shortform · 2024-10-03T21:28:41.528Z · LW · GW

That was the SHA-256 hash for:

What if a bag of heuristics is all there is and a bag of heuristics is all we need? That is, (1) we can decompose each forward pass in current models into a set of heuristics chained together and (2) heauristics chained together is all we need for agi

Here's my full post on the subject

Comment by Sodium on Mira Murati leaves OpenAI/ OpenAI to remove non-profit control · 2024-09-27T17:16:52.582Z · LW · GW

Also from WSJ

Comment by Sodium on How LLMs are and are not myopic · 2024-09-24T06:47:32.919Z · LW · GW

Now that o1 explicitly does RL on CoT, next token prediction for o1 is definitely not consequence blind. The next token it predicts enters into its input and can be used for future computation.
This type of outcome based training makes the model more consequentialist. It also makes using a single next token prediction as the natural "task" to do interpretability on even less defensible.

Anyways, I thought I should revisit this post after o1 comes out. I can't help noticing that it's stylistically very different from all of the janus writing I've encountered in the past, then I got to the end

The ideas in the post are from a human, but most of the text was written by Chat GPT-4 with prompts and human curation using Loom.

Ha, I did notice I was confused (but didn't bother thinking about it further)

Comment by Sodium on Sodium's Shortform · 2024-09-21T16:56:01.823Z · LW · GW

Wait my bad, I didn't except so many people to actually see this. 

This is kind of silly, but I had an idea for a post that I thought someone else might say before I have it written out. So I figured I'd post a hash of the thesis here. 

It's not just about, idk, getting more street cred for coming up with an idea. This is also what I'm planning to write for my MATs application to Lee Sharkley's stream. So in the case someone else did write it up before me, I would have some proof that I didn't just copy the idea from a post.

(It's also a bit silly because my guess is that the thesis isn't even that original)

Edit: to answer the original question, I will post something before October 6th on this if all goes to plan. 

Comment by Sodium on Sodium's Shortform · 2024-09-21T04:45:27.506Z · LW · GW

Pre-registering a71c97bb02e7082ca62503d8e3ac78dc9f554f524a72ad6a1392cf2d34f398d7

Comment by Sodium on GPT-o1 · 2024-09-16T19:24:28.365Z · LW · GW

I wonder if it's useful to try to disentangle the disagreement using the outer/inner alignment framing? 

One belief is that "the deceptive alignment folks" believe that some sort of deceptive inner misalignment is very likely regardless of what your base objective is. While the demonstrations here show that, when we have a base objective that encourages/does not prohibit scheming, the model is capable of scheming. Thus, many folks (myself included) do not see these evals change our views on the question of P(scheming|Good base objective/outer alignment) very much. 


What Zvi is saying here is I think two things. The first is that outer misalignment/bad base objectives is also very likely. The second is that he rejects splitting up "will the model scheme" into the inner/outer misalignment. In other words, he doesn't care about P(scheming|Good base objective/outer alignment) and only P(scheming). 


I get the sense that many technical people consider P(scheming|Good base objective/outer alignment) the central problem of technical alignment, while the more sociotechnical-ish tuned folks are just concerned with P(scheming) in general. 

Maybe the another disagreement is how likely "Good base objective/outer alignment" occurs in the strongest models, and how important this problem is. 

Comment by Sodium on Lucius Bushnaq's Shortform · 2024-09-06T16:26:47.429Z · LW · GW

Hmmm ok maybe I’ll take a look at this :)

Comment by Sodium on Lucius Bushnaq's Shortform · 2024-09-06T06:32:53.212Z · LW · GW

Have people done evals for a model with/without an SAE inserted? Seems like even just looking at drops in MMLU performance by category could be non-trivially informative.