LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs
Kola Ayonrinde (kola-ayonrinde) · 2024-08-23T18:52:31.019Z · comments (5)

Anthropic rewrote its RSP
Zach Stein-Perlman · 2024-10-15T14:25:12.518Z · comments (19)

[link] AISafety.info: What is the "natural abstractions hypothesis"?
Algon · 2024-10-05T12:31:14.195Z · comments (2)

Compelling Villains and Coherent Values
Cole Wyeth (Amyr) · 2024-10-06T19:53:47.891Z · comments (4)

[Intuitive self-models] 7. Hearing Voices, and Other Hallucinations
Steven Byrnes (steve2152) · 2024-10-29T13:36:16.325Z · comments (2)

Book Review: On the Edge: The Business
Zvi · 2024-09-25T12:20:06.230Z · comments (0)

Australian AI Safety Forum 2024
Liam Carroll (liam-carroll) · 2024-09-27T00:40:11.451Z · comments (0)

You're a Space Wizard, Luke
lsusr · 2024-08-18T05:35:39.238Z · comments (6)

[link] [Paper Blogpost] When Your AIs Deceive You: Challenges with Partial Observability in RLHF
Leon Lang (leon-lang) · 2024-10-22T13:57:41.125Z · comments (0)

[link] Generative ML in chemistry is bottlenecked by synthesis
Abhishaike Mahajan (abhishaike-mahajan) · 2024-09-16T16:31:34.801Z · comments (2)

[link] Characterizing stable regions in the residual stream of LLMs
Jett Janiak (jett) · 2024-09-26T13:44:58.792Z · comments (4)

Toy Models of Feature Absorption in SAEs
chanind · 2024-10-07T09:56:53.609Z · comments (7)

0.202 Bits of Evidence In Favor of Futarchy
niplav · 2024-09-29T21:57:59.896Z · comments (0)

OODA your OODA Loop
Raemon · 2024-10-11T00:50:48.119Z · comments (3)

[link] Turning 22 in the Pre-Apocalypse
testingthewaters · 2024-08-22T20:28:25.794Z · comments (14)

Exploring SAE features in LLMs with definition trees and token lists
mwatkins · 2024-10-04T22:15:28.108Z · comments (5)

[link] A Percentage Model of a Person
Sable · 2024-10-12T17:55:07.560Z · comments (3)

The murderous shortcut: a toy model of instrumental convergence
Thomas Kwa (thomas-kwa) · 2024-10-02T06:48:06.787Z · comments (0)

Open Source Replication of Anthropic’s Crosscoder paper for model-diffing
Connor Kissane (ckkissane) · 2024-10-27T18:46:21.316Z · comments (1)

A New Class of Glitch Tokens - BPE Subtoken Artifacts (BSA)
Lao Mein (derpherpize) · 2024-09-20T13:13:26.181Z · comments (7)

[link] I didn't have to avoid you; I was just insecure
Chipmonk · 2024-08-17T16:41:50.237Z · comments (7)

Distinguish worst-case analysis from instrumental training-gaming
Olli Järviniemi (jarviniemi) · 2024-09-05T19:13:34.443Z · comments (0)

COT Scaling implies slower takeoff speeds
Logan Zoellner (logan-zoellner) · 2024-09-28T16:20:00.320Z · comments (56)

LASR Labs Spring 2025 applications are open!
Erin Robertson · 2024-10-04T13:44:20.524Z · comments (0)

Glitch Token Catalog - (Almost) a Full Clear
Lao Mein (derpherpize) · 2024-09-21T12:22:16.403Z · comments (3)

[link] An X-Ray is Worth 15 Features: Sparse Autoencoders for Interpretable Radiology Report Generation
hugofry · 2024-10-07T08:53:14.658Z · comments (0)

Free Will and Dodging Anvils: AIXI Off-Policy
Cole Wyeth (Amyr) · 2024-08-29T22:42:24.485Z · comments (12)

An anti-inductive sequence
Viliam · 2024-08-14T12:28:54.226Z · comments (10)

Debate: Is it ethical to work at AI capabilities companies?
Ben Pace (Benito) · 2024-08-14T00:18:38.846Z · comments (21)

Winners of the Essay competition on the Automation of Wisdom and Philosophy
AI Impacts (AI Imacts) · 2024-10-28T17:10:04.272Z · comments (3)

I'm creating a deep dive podcast episode about the original Leverage Research - would you like to take part?
spencerg · 2024-09-22T14:03:22.164Z · comments (2)

[link] Shifting Headspaces - Transitional Beast-Mode
Jonathan Moregård (JonathanMoregard) · 2024-08-12T13:02:06.120Z · comments (9)

But Where do the Variables of my Causal Model come from?
Dalcy (Darcy) · 2024-08-09T22:07:57.395Z · comments (1)

[link] Big tech transitions are slow (with implications for AI)
jasoncrawford · 2024-10-24T14:25:06.873Z · comments (16)

We’re not as 3-Dimensional as We Think
silentbob · 2024-08-04T14:39:16.799Z · comments (16)

Eye contact is effortless when you’re no longer emotionally blocked on it
Chipmonk · 2024-09-27T21:47:01.970Z · comments (24)

[link] On Fables and Nuanced Charts
Niko_McCarty (niko-2) · 2024-09-08T17:09:07.503Z · comments (2)

Monthly Roundup #22: September 2024
Zvi · 2024-09-17T12:20:08.297Z · comments (10)

Video and transcript of presentation on Otherness and control in the age of AGI
Joe Carlsmith (joekc) · 2024-10-08T22:30:38.054Z · comments (1)

[link] My Model of Epistemology
adamShimi · 2024-08-31T17:01:45.472Z · comments (0)

AI #88: Thanks for the Memos
Zvi · 2024-10-31T15:00:07.412Z · comments (4)

Open Problems in AIXI Agent Foundations
Cole Wyeth (Amyr) · 2024-09-12T15:38:59.007Z · comments (2)

Book Review: On the Edge: The Gamblers
Zvi · 2024-09-24T11:50:06.065Z · comments (1)

[link] My article in The Nation — California’s AI Safety Bill Is a Mask-Off Moment for the Industry
garrison · 2024-08-15T19:25:59.592Z · comments (0)

[question] If I have some money, whom should I donate it to in order to reduce expected P(doom) the most?
KvmanThinking (avery-liu) · 2024-10-03T11:31:19.974Z · answers+comments (36)

The Cognitive Bootcamp Agreement
Raemon · 2024-10-16T23:24:05.509Z · comments (0)

ARENA4.0 Capstone: Hyperparameter tuning for MELBO + replication on Llama-3.2-1b-Instruct
25Hour (aaron-kaufman) · 2024-10-05T11:30:11.953Z · comments (2)

(Maybe) A Bag of Heuristics is All There Is & A Bag of Heuristics is All You Need
Sodium · 2024-10-03T19:11:58.032Z · comments (17)

[link] Book review: On the Edge
PeterMcCluskey · 2024-08-30T22:18:39.581Z · comments (0)

Augmenting Statistical Models with Natural Language Parameters
jsteinhardt · 2024-09-20T18:30:10.816Z · comments (0)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

sable on What TMS is like

I do think there's something to that idea - physical injury and pain is a very universal and visible experience, whereas mental illness is difficult to parse for those who've never experienced it. I also think there's some sense in which 'treatment' and 'cure' are treated differently for mental and physical illness.

A doctor wouldn't just prescribe painkillers for a broken arm and call it a day because your symptoms have been dealt with; they'd want to actually fix the problem. Depression, on the other hand, doctors seem perfectly fine with merely mitigating the symptoms. Perhaps because that's all they're confident they can do?

avturchin on What TMS is like

BTW, memantine is weak (but legal) analog of ketamine and helped me to cure my depression.

raemon on Fluent, Cruxy Predictions

Update:

At the time I posted this, it was at a stage where I had just unlocked the skill, felt it's promise, but it hadn't really paid off yet. I feel like it has paid off in some important ways since then.

Some of my first round of "Important, Cruxy Predictions" (from 6 months ago, 3 months before I made this post), resolved recently. Many of them were of the form "6 months from now will I have recently used [various rationality techniques I was inventing]", of which "Fluent, Cruxy Predictions" was one.

I had given relatively low chances of most of them panning out. But, it turns out I've used basically my entire toolkit in the past ~week.

I'd also made an overall prediction of "a year from now (6 months ago) will the Lightcone team be using various tools." This had essentially been the motivating example behind the "Murphijitsu, and refusing to be satisfied with 'maybe'" section. I noticed I didn't feel optimistic about Lightcone adopting any of the stuff I was generating. Then I made a set of predictions where "in worlds where it was overwhelmingly obvious that these tools were going to end up helping the rest of Lightcone, what other observations would I make along the way?". This was helpful in clarifying further actions I needed to take for it to work (such as "find ways to mention it more often that felt helpful instead of overbearing").

I've more recently started workshopping a "group decisionmaking" UI that helps turn vague strategy disagreements into concrete disagreements. You can see a prototype spreadsheet here.

sable on What TMS is like

I didn't, but I'm not surprised your sister had that experience. It's a loud, repetitive noise going off next to your ear. My clinic offered me earplugs, which I didn't need, but perhaps your sister could have used?

akash-wasil on The Compendium, A full argument about extinction risk from AGI

I like the section where you list out specific things you think people should do. (One objection I sometimes hear is something like "I know that [evals/RSPs/if-then plans/misc] are not sufficient, but I just don't really know what else there is to do. It feels like you either have to commit to something tangible that doesn't solve the whole problem or you just get lost in a depressed doom spiral.")

I think your section on suggestions could be stronger by presenting more ambitious/impactful stories of comms/advocacy. I think there's something tricky about a document that has the vibe "this is the most important issue in the world and pretty much everyone else is approaching it the wrong way" and then pivots to "and the right way to approach it is to post on Twitter and talk to your friends."

My guess is that you prioritized listing things that were relatively low friction and accessible. (And tbc I do think that the world would be in better shape if more people were sharing their views and contributing to the broad discourse.)

But I think when you're talking to high-context AIS people who are willing to devote their entire career to work on AI Safety, they'll be interested in more ambitious/sexy ways of contributing.

Put differently: Should I really quit my job at [fancy company or high-status technical safety group] to Tweet about my takes, talk to my family/friends, and maybe make some website? Or are there other paths I could pursue?

As I wrote here [LW(p) · GW(p)], I think we have some of those ambitious/sexy/high-impact role models that could be used to make this pitch stronger, more ambitious, and more inspiring. EG:

One possible critique is that their suggestions are not particularly ambitious. This is likely because they're writing for a broader audience (people who haven't been deeply engaged in AI safety).
For people who have been deeply engaged in AI safety, I think the natural steelman here is "focus on helping the public/government better understand the AI risk situation."
There are at least some impactful and high-status examples of this (e.g., Hinton, Bengio, Hendrycks). I think in the last few years, for instance, most people would agree that Hinton/Bengio/Hendrycks have had far more impact in their communications/outreach/policy work than their technical research work.
And it's not just the famous people– I can think of ~10 junior or mid-career people who left technical research in the last year to help policymakers better understand AI progress and AI risk, and I think their work is likely far more impactful than if they had stayed in technical research. (And I'm even excluding people who are working on evals/if-then plans: like, I'm focusing on people who see their primary purpose as helping the public or policymakers develop "situational awareness", develop stronger models of AI progress and AI risk, understand the conceptual arguments for misalignment risk, etc.)

I'd also be curious to hear what your thoughts are on people joining government organizations (like the US AI Safety Institute, UK AI Safety Institute, Horizon Fellowship, etc.) Most of your suggestions seem to involve contributing from outside government, and I'd be curious to hear more about your suggestions for people who are either working in government or open to working in government.

tamsin-leake on (draft) Cyborg software should be open (?)

I think (not sure!) the damage from people/orgs/states going "wow, AI is powerful, I will try to build some" is larger than the upside of people/orgs/states going "wow, AI is powerful, I should be scared of it". It only takes one strong enough one of the former to kill everyone, and the latter is gonna have a very hard time stopping all of them.

By not informing the public that AI is indeed powerful, awareness of that fact is disproportionately allocated to people who will choose to think hard about it on their own, and thus that knowledge is more likely to be in reasonabler hands (for example they'd also be more likely to think "hmm maybe I shouldn't build unaligned powerful AI").

The same goes for cyborg tools, as well as general insights about AI: we should want them to be differentially accessible to alignment people than the general public.

In fact, my biggest criticism of OpenAI is not that they built GPTs, but that they productized it, made it widely available, and created a giant public frenzy about LLMs. I think we'd have more time to solve alignment if they kept it internally and the public wasn't thinking about AI nearly as much.

notfnofn on I turned decision theory problems into memes about trolleys

Oh I missed the quotations; you're right

raemon on JargonBot Beta Test

Yeah, seems good for us to build that today.

akash-wasil on johnswentworth's Shortform

One of the common arguments in favor of investing more resources into current governance approaches (e.g., evals, if-then plans, RSPs) is that there's nothing else we can do. There's not a better alternative– these are the only things that labs and governments are currently willing to support.

The Compendium argues that there are other (valuable) things that people can do, with most of these actions focusing on communicating about AGI risks. Examples:

Share a link to this Compendium online or with friends, and provide your feedback on which ideas are correct and which are unconvincing. This is a living document, and your suggestions will shape our arguments.
Post your views on AGI risk to social media, explaining why you believe it to be a legitimate problem (or not).
Red-team companies’ plans to deal with AI risk, and call them out publicly if they do not have a legible plan.

One possible critique is that their suggestions are not particularly ambitious. This is likely because they're writing for a broader audience (people who haven't been deeply engaged in AI safety).

For people who have been deeply engaged in AI safety, I think the natural steelman here is "focus on helping the public/government better understand the AI risk situation."

There are at least some impactful and high-status examples of this (e.g., Hinton, Bengio, Hendrycks). I think in the last few years, for instance, most people would agree that Hinton/Bengio/Hendrycks have had far more impact in their communications/outreach/policy work than their technical research work.

And it's not just the famous people– I can think of ~10 junior or mid-career people who left technical research in the last year to help policymakers better understand AI progress and AI risk, and I think their work is likely far more impactful than if they had stayed in technical research. (And I'm even excluding people who are working on evals/if-then plans: like, I'm focusing on people who see their primary purpose as helping the public or policymakers develop "situational awareness", develop stronger models of AI progress and AI risk, understand the conceptual arguments for misalignment risk, etc.)

tapatakt on I turned decision theory problems into memes about trolleys

The point is Omega would not send it to you it if it was false and Omega would always send it to you if it was true.