Posts

Veo-2 Can Produce Realistic Ads 2025-01-21T19:13:32.884Z
[Exercise] Four Examples of Noticing Confusion 2025-01-18T15:29:23.293Z
How do you deal w/ Super Stimuli? 2025-01-14T15:14:51.552Z
When AI 10x's AI R&D, What Do We Do? 2024-12-21T23:56:11.069Z
Logan Riggs's Shortform 2024-12-04T14:52:55.820Z
Book a Time to Chat about Interp Research 2024-12-03T17:27:46.808Z
Evaluating Sparse Autoencoders with Board Game Models 2024-08-02T19:50:21.525Z
Interpreting Preference Models w/ Sparse Autoencoders 2024-07-01T21:35:40.603Z
Was Releasing Claude-3 Net-Negative? 2024-03-27T17:41:56.245Z
Improving SAE's by Sqrt()-ing L1 & Removing Lowest Activating Features 2024-03-15T16:30:00.744Z
Finding Sparse Linear Connections between Features in LLMs 2023-12-09T02:27:42.456Z
Sparse Autoencoders: Future Work 2023-09-21T15:30:47.198Z
Sparse Autoencoders Find Highly Interpretable Directions in Language Models 2023-09-21T15:30:24.432Z
Really Strong Features Found in Residual Stream 2023-07-08T19:40:14.601Z
(tentatively) Found 600+ Monosemantic Features in a Small LM Using Sparse Autoencoders 2023-07-05T16:49:43.822Z
[Replication] Conjecture's Sparse Coding in Small Transformers 2023-06-16T18:02:34.874Z
[Replication] Conjecture's Sparse Coding in Toy Models 2023-06-02T17:34:24.928Z
[Simulators seminar sequence] #2 Semiotic physics - revamped 2023-02-27T00:25:52.635Z
Making Implied Standards Explicit 2023-02-25T20:02:50.617Z
Proposal for Inducing Steganography in LMs 2023-01-12T22:15:43.865Z
[Simulators seminar sequence] #1 Background & shared assumptions 2023-01-02T23:48:50.298Z
Results from a survey on tool use and workflows in alignment research 2022-12-19T15:19:52.560Z
A descriptive, not prescriptive, overview of current AI Alignment Research 2022-06-06T21:59:22.344Z
Frame for Take-Off Speeds to inform compute governance & scaling alignment 2022-05-13T22:23:12.143Z
Alignment as Constraints 2022-05-13T22:07:49.890Z
Make a Movie Showing Alignment Failures 2022-04-13T21:54:50.764Z
Convincing People of Alignment with Street Epistemology 2022-04-12T23:43:57.873Z
Roam Research Mobile is Out! 2022-04-08T19:05:40.211Z
Convincing All Capability Researchers 2022-04-08T17:40:25.488Z
Language Model Tools for Alignment Research 2022-04-08T17:32:33.230Z
5-Minute Advice for EA Global 2022-04-05T22:33:04.087Z
A survey of tool use and workflows in alignment research 2022-03-23T23:44:30.058Z
Some (potentially) fundable AI Safety Ideas 2022-03-16T12:48:04.397Z
Solving Interpretability Week 2021-12-13T17:09:12.822Z
Solve Corrigibility Week 2021-11-28T17:00:29.986Z
What Heuristics Do You Use to Think About Alignment Topics? 2021-09-29T02:31:16.034Z
Wanting to Succeed on Every Metric Presented 2021-04-12T20:43:01.240Z
Using GPT-N to Solve Interpretability of Neural Networks: A Research Agenda 2020-09-03T18:27:05.860Z
What's a Decomposable Alignment Topic? 2020-08-21T22:57:00.642Z
Mapping Out Alignment 2020-08-15T01:02:31.489Z
Writing Piano Songs: A Journey 2020-08-10T21:50:25.099Z
Solving Key Alignment Problems Group 2020-08-03T19:30:45.916Z
No Ultimate Goal and a Small Existential Crisis 2020-07-24T18:39:40.398Z
Seeking Power is Often Convergently Instrumental in MDPs 2019-12-05T02:33:34.321Z
"Mild Hallucination" Test 2019-10-10T17:57:42.471Z
Finding Cruxes 2019-09-20T23:54:47.532Z
False Dilemmas w/ exercises 2019-09-17T22:35:33.882Z
Category Qualifications (w/ exercises) 2019-09-15T16:28:53.149Z
Proving Too Much (w/ exercises) 2019-09-15T02:28:51.812Z
Arguing Well Sequence 2019-09-15T02:01:30.976Z

Comments

Comment by Logan Riggs (elriggs) on [Exercise] Four Examples of Noticing Confusion · 2025-01-21T17:52:03.418Z · LW · GW

I didn't either, but on reflection it is! 

I did change the post based off your comment, so thanks!

Comment by Logan Riggs (elriggs) on Logan Riggs's Shortform · 2025-01-19T15:22:43.155Z · LW · GW

I think the fuller context,

Anthropic has put WAY more effort into safety, way way more effort into making sure there are really high standards for safety and that there isn't going to be danger what these AIs are doing

implies it's just the amount of effort is larger than other companies (which I agree with), and not the Youtuber believing they've solved alignment or are doing enough, see: 

but he's also a realist and is like "AI is going to really potentially fuck up our world"

and

But he's very realistic. There is a lot of bad shit that is going to happen with AI. I'm not denying that at all.

So I'm not confident that it's "giving people a false impression of how good we are doing on actually making things safe." in this case.

I do know DougDoug has recommended Anthropic's Alignment Faking paper to another youtuber, which is more of a "stating a problem" paper than saying they've solved it.

Comment by Logan Riggs (elriggs) on [Exercise] Four Examples of Noticing Confusion · 2025-01-19T15:11:26.789Z · LW · GW

Did you listen to the song first?

Comment by Logan Riggs (elriggs) on Thane Ruthenis's Shortform · 2025-01-18T21:17:51.176Z · LW · GW

Thinking through it more, Sox2-17 (they changed 17 amino acids from Sox2 gene) was your linked paper's result, and Retro's was a modified version of factors Sox AND KLF. Would be cool if these two results are complementary.

Comment by Logan Riggs (elriggs) on Thane Ruthenis's Shortform · 2025-01-18T21:00:43.551Z · LW · GW

You're right! Thanks
For Mice, up to 77% 

Sox2-17 enhanced episomal OKS MEF reprogramming by a striking 150 times, giving rise to high-quality miPSCs that could generate all-iPSC mice with up to 77% efficiency

For human cells, up to 9%  (if I'm understanding this part correctly).
 

SOX2-17 gave rise to 56 times more TRA1-60+ colonies compared with WT-SOX2: 8.9% versus 0.16% overall reprogramming efficiency.

So seems like you can do wildly different depending on the setting (mice, humans, bovine, etc), and I don't know what the Retro folks were doing, but does make their result less impressive. 

Comment by Logan Riggs (elriggs) on [Exercise] Four Examples of Noticing Confusion · 2025-01-18T20:39:09.324Z · LW · GW

You're actually right that this is due to meditation for me. AFAIK, it's not a synesthesia-esque though (ie I'm not causing there to be two qualia now), more like the distinction between mental-qualia and bodily-qualia doesn't seem meaningful upon inspection. 

So I believe it's a semantic issue, and I really mean "confusion is qualia you can notice and act on" (though I agree I'm using "bodily" in non-standard ways and should stop when communicating w/ non-meditators).

Comment by Logan Riggs (elriggs) on [Exercise] Four Examples of Noticing Confusion · 2025-01-18T20:29:06.567Z · LW · GW

This is great feedback, thanks! I added another example based off what you said.

For how obvious the first one, at least two folks I asked (not from this community) didn't think it was a baby initially (though one is non-native english and didn't know "2 birds of a feather" and assumed "our company" meant "the singers and their partner"). Neither are parents. 

I did select these because they caused confusion in myself when I heard/saw them years ago, but they were "in the wild" instead of in a post on noticing confusion.

I did want a post I could link [non rationalist friends] to that's a more fun intro to noticing confusion, so more regular members might not benefit!

Comment by Logan Riggs (elriggs) on Thane Ruthenis's Shortform · 2025-01-18T02:33:29.401Z · LW · GW

For those also curious, Yamanaka factors are specific genes that turn specialized cells (e.g. skin, hair) into induced pluripotent stem cells (iPSCs) which can turn into any other type of cell.

This is a big deal because you can generate lots of stem cells to make full organs[1] or reverse aging (maybe? they say you just turn the cell back younger, not all the way to stem cells).

 You can also do better disease modeling/drug testing: if you get skin cells from someone w/ a genetic kidney disease, you can turn those cells into the iPSCs, then into kidney cells which will exhibit the same kidney disease because it's genetic. You can then better understand how the [kidney disease] develops and how various drugs affect it. 

So, it's good to have ways to produce lots of these iPSCs. According to the article, SOTA was <1% of cells converted into iPSCs, whereas the GPT suggestions caused a 50x improvement to 33% of cells converted. That's quite huge!, so hopefully this result gets verified. I would guess this is true and still a big deal, but  concurrent work got similar results.

Too bad about the tumors. Turns out iPSCs are so good at turning into other cells, that they can turn into infinite cells (ie cancer).  iPSCs were used to fix spinal cord injuries (in mice) which looked successful for 112 days, but then a follow up study said [a different set of mice also w/ spinal iPSCs] resulted in tumors.

My current understanding is this is caused by the method of delivering these genes (ie the Yamanaka factors) through retrovirus which 

is a virus that uses RNA as its genomic material. Upon infection with a retrovirus, a cell converts the retroviral RNA into DNA, which in turn is inserted into the DNA of the host cell. 

which I'd guess this is the method the Retro Biosciences uses. 

I also really loved the story of how Yamanaka discovered iPSCs:

Induced pluripotent stem cells were first generated by Shinya Yamanaka and Kazutoshi Takahashi at Kyoto University, Japan, in 2006.[1] They hypothesized that genes important to embryonic stem cell (ESC) function might be able to induce an embryonic state in adult cells. They chose twenty-four genes previously identified as important in ESCs and used retroviruses to deliver these genes to mouse fibroblasts. The fibroblasts were engineered so that any cells reactivating the ESC-specific gene, Fbx15, could be isolated using antibiotic selection.

Upon delivery of all twenty-four factors, ESC-like colonies emerged that reactivated the Fbx15 reporter and could propagate indefinitely. To identify the genes necessary for reprogramming, the researchers removed one factor at a time from the pool of twenty-four. By this process, they identified four factors, Oct4, Sox2, cMyc, and Klf4, which were each necessary and together sufficient to generate ESC-like colonies under selection for reactivation of Fbx15.

  1. ^

    These organs would have the same genetics as the person who supplied the [skin/hair cells] so risk of rejection would be lower (I think)

Comment by Logan Riggs (elriggs) on Logan Riggs's Shortform · 2025-01-16T12:49:51.191Z · LW · GW

A trending youtube video w/ 500k views in a day brings up Dario Amodei's Machines of Loving Grace (Timestamp for the quote):
[Note: I had Claude help format, but personally verified the text's faithfulness]

I am an AI optimist. I think our world will be better because of AI. One of the best expressions of that I've seen is this blog post by Dario Amodei, who is the CEO of Anthropic, one of the biggest AI companies. I would really recommend reading this - it's one of the more interesting articles and arguments I have read. He's basically saying AI is going to have an incredibly positive impact, but he's also a realist and is like "AI is going to really potentially fuck up our world"

He's notable and more trustworthy because his company Anthropic has put WAY more effort into safety, way way more effort into making sure there are really high standards for safety and that there isn't going to be danger what these AIs are doing. So I really really like Dario and I've listened to a lot of what he's said. Whereas with some other AI leaders like Sam Altman who runs OpenAI, you don't know what the fuck he's thinking. I really like [Dario] - he also has an interesting background in biological work and biotech, so he's not just some tech-bro; he's a bio-tech-bro. But his background is very interesting.

But he's very realistic. There is a lot of bad shit that is going to happen with AI. I'm not denying that at all. It's about how we maximize the positive while reducing the negatives. I really want AI to solve all of our diseases. I would really like AI to fix cancer - I think that will happen in our lifetimes. To me, I'd rather we fight towards that future rather than say 'there will be problems, let's abandon the whole thing.'

Other notes: This is youtuber/Streamer DougDoug (2.8M subscribers), with this video posted on his other channel DougDougDoug ("DougDoug content that's too rotten for the main channel") who often streams/posts coding/AI integrated content.

The full video is also an entertaining summary of case law on AI generated art/text copyright.

Comment by Logan Riggs (elriggs) on Speedrunning Rationality: Day II · 2025-01-06T14:38:37.180Z · LW · GW

Hey Midius!

My recommended rationality habit is noticing confusion, by which I mean a specific mental feeling that's usually quick & subtle & easy to ignore.

David Chapman has a more wooey version called Eating Your Shadow, which was very helpful for me since it pointed me towards acknowledging parts of my experience that I was denying due to identity & social reasons (hence the easy to ignore part).

Comment by Logan Riggs (elriggs) on When AI 10x's AI R&D, What Do We Do? · 2024-12-28T02:07:19.920Z · LW · GW

Could you go into more details into what skills these advisers would have or what situations to navigate? 

Because I'm baking in the "superhuman in coding/maths" due to the structure of those tasks, and other tasks can either be improved through:
1. general capabilies
2. Specific task 

And there might be ways to differentially accelarate that capability.

Comment by Logan Riggs (elriggs) on What o3 Becomes by 2028 · 2024-12-22T15:19:27.882Z · LW · GW

I really appreciate your post and all the links! This and your other recent posts/comments have really helped make a clearer model of timelines. 

Comment by Logan Riggs (elriggs) on keltan's Shortform · 2024-12-22T12:28:49.188Z · LW · GW

In my experience, most of the general public will verbally agree that AI X-risk is a big deal, but then go about their day (cause reasonably, they have no power). There's no obvious social role/action to do in response to that.

For climate, people understand that they should recycle, not keep the water running, and if there's a way to donate to clean the ocean on a Mr. Beast video, then some will even donate (sadly, none of these are very effective for solving the climate problem though! Gotta avoid that for our case).

Having a clear call-to-action seems relevant. For example, educating the public about AI taking jobs for the purpose of building support for UBI. It's then clear what to communicate and the call-to-action.

I'd be curious to hear what you think an ask should be? 

Alternatively, you could argue that generally informing folks on a wide scale about the risks involved will then allow general public to do what they believe is locally best. This could involve a documentary or realistic movie.

Comment by Logan Riggs (elriggs) on Karl Krueger's Shortform · 2024-12-22T12:08:53.835Z · LW · GW

Claude 3.5 seems to understand the spirit of the law when pursuing a goal X. 

A concern I have is that future training procedures will incentivize more consequential reasoning (because those get higher reward). This might be obvious or foreseeable, but could be missed/ignored under racing pressure or when lab's LLMs are implementing all the details of research.

Comment by Logan Riggs (elriggs) on When AI 10x's AI R&D, What Do We Do? · 2024-12-22T11:48:45.369Z · LW · GW

Thanks! 

I forgot about faithful CoT and definitely think that should be a "Step 0". I'm also concerned here that AGI labs just don't do the reasonable things (ie training for briefness making the CoT more steganographic). 

For Mech-interp, ya, we're currently bottlenecked by:

  1. Finding a good enough unit-of-computation (which would enable most of the higher-guarantee research)
  2. Computing Attention_in--> Attention_out (which Keith got the QK-circuit -> Attention pattern working a while ago, but haven't hooked up w/ the OV-circuit)
Comment by Logan Riggs (elriggs) on When AI 10x's AI R&D, What Do We Do? · 2024-12-21T23:58:19.661Z · LW · GW

This is mostly a "reeling from o3"-post. If anyone is doom/anxiety-reading these posts, well, I've been doing that too! At least, we're in this together:)

Comment by Logan Riggs (elriggs) on o3 · 2024-12-20T21:39:09.887Z · LW · GW

From an apparent author on reddit:

[Frontier Math is composed of] 25% T1 = IMO/undergrad style problems, 50% T2 = grad/qualifying exam style porblems, 25% T3 = early researcher problems

The comment was responding to a claim that Terence Tao said he could only solve a small percentage of questions, but Terence was only sent the T3 questions. 

Comment by Logan Riggs (elriggs) on Careless thinking: A theory of bad thinking · 2024-12-18T14:58:30.326Z · LW · GW

I also have a couple friends that require serious thinking (or being on my toes).  I think it's because they have some model of how something works, and I say something, showing my lack of this model.

Additionally, programming causes this as well (in response to compilation errors, getting nonsense outputs, or runs too long). 

Comment by Logan Riggs (elriggs) on Logan Riggs's Shortform · 2024-12-04T14:52:56.254Z · LW · GW

Was looking up Google Trend lines for chatGPT and noticed a cyclical pattern:

Where the dips are weekends, meaning it's mostly used by people in the workweek. I mostly expect this is students using it for homework. This is substantiated by two other trends:
1. Dips in interest over winter and summer breaks (And Thanksgiving break in above chart)

2. "Humanize AI" which is 

Humanize AI™ is your go-to platform for seamlessly converting AI-generated text into authentic, undetectable, human-like content 

[Although note that overall interest in ChatGPT is WAY higher than Humanize AI]

Comment by Logan Riggs (elriggs) on MIRI’s 2024 End-of-Year Update · 2024-12-03T21:21:05.265Z · LW · GW

I was expecting this to include the output of MIRI for this year. Digging into your links we have:

Two Technical Governance Papers:
1. Mechanisms to verify international agreements about AI development
2. What AI evals for preventing catastrophic risks can and cannot do

Four Media pieces of Eliezer regarding AI risk:
1. Semafor piece
2. 1 hr talk w/ Panel 
3. PBS news hour
4. 4 hr video w/ Stephen Wolfram

Is this the full output for the year, or are there less linkable outputs such as engaging w/ policymakers on AI risks?

Comment by Logan Riggs (elriggs) on (The) Lightcone is nothing without its people: LW + Lighthaven's big fundraiser · 2024-11-30T19:03:10.670Z · LW · GW

Donated $100. 

It was mostly due to LW2 that I decided to work on AI safety, actually, so thanks!

I've had the pleasure of interacting w/ the LW team quite a bit and they definitely embody the spirit of actually trying. Best of luck to y'all's endeavors!

Comment by Logan Riggs (elriggs) on LLM chatbots have ~half of the kinds of "consciousness" that humans believe in. Humans should avoid going crazy about that. · 2024-11-22T21:10:30.192Z · LW · GW

I tried a similar experiment w/ Claude 3.5 Sonnet, where I asked it to come up w/ a secret word and in branching paths:
1. Asked directly for the word
2. Played 20 questions, and then guessed the word

In order to see if it does have a consistent it can refer back to.

Branch 1: 

Branch 2:

Which I just thought was funny.

Asking again, telling it about the experiment and how it's important for it to try to give consistent answers, it initially said "telescope" and then gave hints towards a paperclip.

Interesting to see when it flips it answers, though it's a simple setup to repeatedly ask it's answer every time. 

Also could be confounded by temperature. 

Comment by Logan Riggs (elriggs) on [Completed] The 2024 Petrov Day Scenario · 2024-09-26T17:41:13.427Z · LW · GW

It'd be important to cache the karma of all users > 1000 atm, in order to credibly signal you know which generals were part of the nuking/nuked side. Would anyone be willing to do that in the next 2 & 1/2 hours? (ie the earliest we could be nuked)

Comment by Logan Riggs (elriggs) on [Completed] The 2024 Petrov Day Scenario · 2024-09-26T17:30:11.335Z · LW · GW

We could instead  pre-commit to not engage with any nuker's future posts/comments (and at worse comment to encourage others to not engage) until end-of-year.

Or only include nit-picking comments.

Comment by Logan Riggs (elriggs) on Physics of Language models (part 2.1) · 2024-09-23T11:57:34.765Z · LW · GW

Could you dig into why you think it's great inter work?

Comment by Logan Riggs (elriggs) on Why I'm bearish on mechanistic interpretability: the shards are not in the network · 2024-09-13T20:34:21.603Z · LW · GW

But through gradient descent, shards act upon the neural networks by leaving imprints of themselves, and these imprints have no reason to be concentrated in any one spot of the network (whether activation-space or weight-space). So studying weights and activations is pretty doomed.

This paragraph sounded like you're claiming LLMs do have concepts, but they're not in specific activations or weights, but distributed across them instead. 

But from your comment, you mean that LLMs themselves don't learn the true simple-compressed features of reality, but a mere shadow of them.

This interpretation also matches the title better!

But are you saying the "true features" in the dataset + network? Because SAEs are trained on a dataset! (ignoring the problem pointed out in footnote 1).

Possibly clustering the data points by their network gradients would be a way to put some order into this mess? 

Eric Michaud did cluster datapoints by their gradients here. From the abstract:

...Using language model gradients, we automatically decompose model behavior into a diverse set of skills (quanta). 

Comment by Logan Riggs (elriggs) on Why I'm bearish on mechanistic interpretability: the shards are not in the network · 2024-09-13T20:18:36.166Z · LW · GW

The one we checked last year was just Pythia-70M, which I don't expect the LLM itself to have a gender feature that generalizes to both pronouns and anisogamy.

But again, the task is next-token prediction. Do you expect e.g. GPT 4 to have learned a gender concept that affects both knowledge about anisogamy and pronouns while trained on next-token prediction?

Comment by Logan Riggs (elriggs) on Why I'm bearish on mechanistic interpretability: the shards are not in the network · 2024-09-13T19:05:24.862Z · LW · GW

Sparse autoencoders finds features that correspond to abstract features of words and text. That's not the same as finding features that correspond to reality.

(Base-model) LLMs are trained to minimize prediction error, and SAEs do seem to find features that sparsely predict error, such as a gender feature that, when removed, affects the probability of pronouns. So pragmatically, for the goal of "finding features that explain next-word-prediction", which LLMs are directly trained for, SAEs find good examples![1]

I'm unsure what goal you have in mind for "features that correspond to reality", or what that'd mean.

  1. ^

    Not claiming that all SAE latents are good in this way though.

Comment by Logan Riggs (elriggs) on Decomposing the QK circuit with Bilinear Sparse Dictionary Learning · 2024-09-10T19:14:31.721Z · LW · GW

Is there code available for this?

I'm mainly interested in the loss fuction. Specifically from footnote 4:

We also need to add a term to capture the interaction effect between the key-features and the query-transcoder bias, but we omit this for simplicity

I'm unsure how this is implemented or the motivation. 

Comment by Logan Riggs (elriggs) on A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team · 2024-09-05T17:23:51.532Z · LW · GW

Some MLPs or attention layers may implement a simple linear transformation in addition to actual computation.

@Lucius Bushnaq , why would MLPs compute linear transformations? 

Because two linear transformations can be combined into one linear transformation, why wouldn't downstream MLPs/Attns that rely on this linearly transformed vector just learn the combined function? 

Comment by Logan Riggs (elriggs) on Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning · 2024-08-27T16:44:24.068Z · LW · GW

What is the activation name for the resid SAEs? hook_resid_post or hook_resid_pre?

I found https://github.com/ApolloResearch/e2e_sae/blob/main/e2e_sae/scripts/train_tlens_saes/run_train_tlens_saes.py#L220
to suggest _post
but downloading the SAETransformer from wandb shows:
(saes): 
    ModuleDict( (blocks-6-hook_resid_pre): 
        SAE( (encoder): Sequential( (0):...

which suggests _pre. 
 

Comment by Logan Riggs (elriggs) on Milan W's Shortform · 2024-08-27T15:13:44.595Z · LW · GW

Here’s a link: https://archive.nytimes.com/thelede.blogs.nytimes.com/2008/07/02/a-window-into-waterboarding/

Comment by Logan Riggs (elriggs) on Turning 22 in the Pre-Apocalypse · 2024-08-23T00:09:04.737Z · LW · GW

3. Those who are more able to comprehend and use these models are therefore of a higher agency/utility and higher moral priority than those who cannot. [emphasis mine]

This (along with saying "dignity" implies "moral worth" in Death w/ Dignity post), is confusing to me. Could you give a specific example of how you'd treat differently someone who has more or less moral worth (e.g. give them more money, attention, life-saving help, etc)? 

One thing I could understand from your Death w/ Dignity excerpt is he's definitely implying a metric that scores everyone, and some people will score higher on this metric than others. It's also common to want to score high on these metrics or feel emotionally bad if you don't score high on these metrics (see my post for more). This could even have utility, like having more "dignity" gets you a thumbs up from Yudowsky or have your words listened to more in this community. Is this close to what you mean at all?

Rationalism is path-dependent

I was a little confused on this section. Is this saying that human's goals and options (including options that come to mind) change depending on the environment, so rational choice theory doesn't apply?

Games and Game Theory

I believe the thesis here is that game theory doesn't really apply in real life, that there are usually extra constraints or freedoms in real situations that change the payoffs. 

I do think this criticism is already handled by trying to "actually win" and "trying to try"; though I've personally benefitted specifically from trying to try and David Chapman's meta-rationality post. 

Probability and His Problems

The idea of deference (and when to defer) isn't novel (which is fine! Novelty is another metric I'm bringing up, but not important for everything one writes to be). It's still useful to apply Bayes theorem to deference. Specifically evidence that convinces you to trust someone should imply that there's evidence that convinces you to not trust them. 

This is currently all I have time for; however, my current understanding is that there is a common interpretation of Yudowsky's writings/The sequences/LW/etc that leads to an over-reliance on formal systems that will invevitably fail people. I think you had this interpretation (do correct me if I'm wrong!), and this is your "attempt to renegotiate rationalism ". 

There is the common response of "if you re-read the sequences, you'll see how it actually handles all the flaws you mentioned"; however, it's still true that it's at least a failure in communication that many people consistently mis-interpret it. 

Glad to hear you're synthesizing and doing pretty good now:)

Comment by Logan Riggs (elriggs) on Turning 22 in the Pre-Apocalypse · 2024-08-22T23:41:15.488Z · LW · GW

I think copy-pasting the whole thing will make it more likely to be read! I enjoyed it and will hopefully leave a more substantial comment later.

Comment by Logan Riggs (elriggs) on what becoming more secure did for me · 2024-08-22T21:30:46.709Z · LW · GW

I've really enjoyed these posts; thanks for cross posting!

Comment by Logan Riggs (elriggs) on Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning · 2024-08-22T17:27:22.637Z · LW · GW

Kind of confused on why the KL-only e2e SAE have worse CE than e2e+downstream across dictionary size:
 

This is true for layers 2 & 6. I'm unsure if this means that training for KL directly is harder/unstable, and the intermediate MSE is a useful prior, or if this is a difference in KL vs CE (ie the e2e does in fact do better on KL but worse on CE than e2e+downstream).

Comment by Logan Riggs (elriggs) on Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning · 2024-08-22T17:22:14.624Z · LW · GW

I finally checked!

Here is the Jaccard similarity (ie similarity of input-token activations) across seeds

The e2e ones do indeed have a much lower jaccard sim (there normally is a spike at 1.0, but this is removed when you remove features that only activate <10 times). 

I also (mostly) replicated the decoder similarity chart:

And calculated the encoder sim:

[I, again, needed to remove dead features (< 10 activations) to get the graphs here.] 

So yes, I believe the original paper's claim that e2e features learn quite different features across seeds is substantiated.

Comment by Logan Riggs (elriggs) on You can remove GPT2’s LayerNorm by fine-tuning for an hour · 2024-08-18T04:22:21.853Z · LW · GW

And here's the code to convert it to NNsight (Thanks Caden for writing this awhile ago!)

import torch
from transformers import GPT2LMHeadModel
from transformer_lens import HookedTransformer
from nnsight.models.UnifiedTransformer import UnifiedTransformer


model = GPT2LMHeadModel.from_pretrained("apollo-research/gpt2_noLN").to("cpu")

# Undo my hacky LayerNorm removal
for block in model.transformer.h:
    block.ln_1.weight.data = block.ln_1.weight.data / 1e6
    block.ln_1.eps = 1e-5
    block.ln_2.weight.data = block.ln_2.weight.data / 1e6
    block.ln_2.eps = 1e-5
model.transformer.ln_f.weight.data = model.transformer.ln_f.weight.data / 1e6
model.transformer.ln_f.eps = 1e-5

# Properly replace LayerNorms by Identities
def removeLN(transformer_lens_model):
    for i in range(len(transformer_lens_model.blocks)):
        transformer_lens_model.blocks[i].ln1 = torch.nn.Identity()
        transformer_lens_model.blocks[i].ln2 = torch.nn.Identity()
    transformer_lens_model.ln_final = torch.nn.Identity()

hooked_model = HookedTransformer.from_pretrained("gpt2", hf_model=model, fold_ln=True, center_unembed=False).to("cpu")
removeLN(hooked_model)

model_nnsight = UnifiedTransformer(model="gpt2", hf_model=model, fold_ln=True, center_unembed=False).to("cpu")
removeLN(model_nnsight)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
prompt = torch.tensor([1,2,3,4], device=device)
logits = hooked_model(prompt)

with torch.no_grad(), model_nnsight.trace(prompt) as runner:
    logits2 = model_nnsight.unembed.output.save()
logits, cache = hooked_model.run_with_cache(prompt)
torch.allclose(logits, logits2)
Comment by Logan Riggs (elriggs) on Tokenized SAEs: Infusing per-token biases. · 2024-08-12T15:15:35.227Z · LW · GW

Maybe this should be like Anthropic's shared decoder bias? Essentially subtract off the per-token bias at the beginning, let the SAE reconstruct this "residual", then add the per-token bias back to the reconstructed x. 

The motivation is that the SAE has a weird job in this case. It sees x, but needs to reconstruct x - per-token-bias, which means it needs to somehow learn what that per-token-bias is during training. 

However, if you just subtract it first, then the SAE sees x', and just needs to reconstruct x'. 

So I'm just suggesting changing  here: 

w/  remaining the same:

Comment by Logan Riggs (elriggs) on Tokenized SAEs: Infusing per-token biases. · 2024-08-12T15:04:45.722Z · LW · GW

That's great thanks! 

My suggested experiment to really get at this question (which if I were in your shoes, I wouldn't want to run cause you've already done quite a bit of work on this project!, lol):

Compare 
1. Baseline 80x expansion (56k features) at k=30
2. Tokenized-learned 8x expansion (50k vocab + 6k features) at k=29 (since the token adds 1 extra feature)

for 300M tokens (I usually don't see improvements past this amount) showing NMSE and CE.

If tokenized-SAEs are still better in this experiment, then that's a pretty solid argument to use these!

If they're equivalent, then tokenized-SAEs are still way faster to train in this lower expansion range, while having 50k "features" already interpreted.

If tokenized-SAEs are worse, then these tokenized features aren't a good prior to use. Although both sets of features are learned, the difference would be the tokenized always has the same feature per token (duh), and baseline SAEs allow whatever combination of features (e.g. features shared across different tokens).

Comment by Logan Riggs (elriggs) on Tokenized SAEs: Infusing per-token biases. · 2024-08-12T14:38:50.431Z · LW · GW

About similar tokenized features, maybe I'm misunderstanding, but this seems like a problem for any decoder-like structure.

I didn't mean to imply it's a problem, but the intepretation should be different. For example, if at layer N, all the number tokens have cos-sim=1 in the tokenized-feature set, then if we find a downstream feature reading from " 9" token on a specific task, then we should conclude it's reading from a more general number direction than a specific number direction. 

I agree this argument also applies to the normal SAE decoder (if the cos-sim=1)

Comment by Logan Riggs (elriggs) on Tokenized SAEs: Infusing per-token biases. · 2024-08-10T15:18:30.240Z · LW · GW

Although, tokenized features are dissimilar to normal features in that they don't vary in activation strength. Tokenized features are either 0 or 1 (or norm of the vector). So it's not exactly an apples-to-apples comparison w/ a similar sized dictionary of normal SAE features, although that plot would be nice!

Comment by Logan Riggs (elriggs) on Tokenized SAEs: Infusing per-token biases. · 2024-08-10T15:15:33.338Z · LW · GW

I do really like this work. This is useful for circuit-style work because the tokenized-features are already interpreted. If a downstream encoded feature reads from the tokenized-feature direction, then we know the only info being transmitted is info on the current token.

However, if multiple tokenized-features are similar directions (e.g. multiple tokenizations of word "the") then a circuit reading from this direction is just using information about this set of tokens. 

Comment by Logan Riggs (elriggs) on Tokenized SAEs: Infusing per-token biases. · 2024-08-10T15:07:39.407Z · LW · GW

Do you have a dictionary-size to CE-added plot? (fixing L0)

So, we did not use the lookup biases as additional features (only for decoder reconstruction)

I agree it's not like the other features in that the encoder isn't used, but it is used for reconstruction which affects CE. It'd be good to show the pareto improvement of CE/L0 is not caused by just having an additional vocab_size number of features (although that might mean having to use auxk to have a similar number of alive features).

Comment by Logan Riggs (elriggs) on Tokenized SAEs: Infusing per-token biases. · 2024-08-09T19:55:01.689Z · LW · GW

Did you vary expansion size? The tokenized SAE will have 50k more features in its dictionary (compared to the 16x expansion of ~12k features from the paper version). 

Did you ever train a baseline SAE w/ similar number of features as the tokenized?

Comment by Logan Riggs (elriggs) on You can remove GPT2’s LayerNorm by fine-tuning for an hour · 2024-08-08T18:49:28.046Z · LW · GW

This is extremely useful for SAE circuit work. Now the connections between features are at most ReLU(Wx + b) which is quite interpretable! (Excluding attn_in->attn_out)

Thanks for doing this!

Comment by Logan Riggs (elriggs) on Tokenized SAEs: Infusing per-token biases. · 2024-08-08T18:41:36.511Z · LW · GW

Did you ever do a max-cos-sim between the vectors in the token-biases? I'm wondering how many biases are exactly the same (e.g. " The" vs "The" vs "the" vs " the", etc) which would allow a shrinkage of the number of features (although your point is good that an extra vocab_size num of features isn't large in the scale of millions of features). 

Comment by Logan Riggs (elriggs) on Tokenized SAEs: Infusing per-token biases. · 2024-08-08T18:38:38.886Z · LW · GW

Do you have any tokenized SAEs uploaded (and a simple download script?). I could only find model definitions in the repos. 

If you just have it saved locally, here's my usual script for uploading to huggingface (just need your key).

Comment by Logan Riggs (elriggs) on JumpReLU SAEs + Early Access to Gemma 2 SAEs · 2024-07-23T18:00:32.623Z · LW · GW

Did y'all do any ablations on your loss terms. For example:
1. JumpReLU() -> ReLU
2. L0 (w/ STE) -> L1

I'd be curious to see if the pareto improvements and high frequency features are due to one, the other, or both

Comment by Logan Riggs (elriggs) on An Introduction to Representation Engineering - an activation-based paradigm for controlling LLMs · 2024-07-14T19:39:14.291Z · LW · GW

On activation patching: 

The most salient difference is that RE doesn’t completely change the activations, but merely adds to them. Thus, RE aims to find how a certain concept is represented, while activation patching serves as an ablation of the function of a specific layer or neuron. Thus activation patching doesn’t directly tell you where to find the representation of a concept.

I'm pretty sure both methods give you some approximate location of the representation. RE is typically done on many layers & then picks the best layer. Activation patching ablates each layer & shows you which one is most important for the counterfactual. I would trust the result of patching more than RE for location of a representation (maybe grounded out as what a SAE would find?) due to being more in-distribution.