LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[link] World Models I'm Currently Building
temporary · 2024-12-15T16:29:08.287Z · comments (1)

Why empiricists should believe in AI risk
Knight Lee (Max Lee) · 2024-12-11T03:51:17.979Z · comments (0)

Logic vs intuition <=> algorithm vs ML
pchvykov · 2025-01-04T09:06:51.822Z · comments (0)

Project Adequate: Seeking Cofounders/Funders
Lorec · 2024-11-17T03:12:12.995Z · comments (7)

Theories With Mentalistic Atoms Are As Validly Called Theories As Theories With Only Non-Mentalistic Atoms
Lorec · 2024-11-12T06:45:26.039Z · comments (5)

Printable book of some rationalist creative writing (from Scott A. & Eliezer)
CounterBlunder · 2024-12-23T15:44:31.437Z · comments (0)

"Alignment at Large": Bending the Arc of History Towards Life-Affirming Futures
welfvh · 2024-12-03T21:17:56.466Z · comments (0)

Towards mutually assured cooperation
mikko (morrel) · 2024-12-22T20:46:21.965Z · comments (0)

Reducing x-risk might be actively harmful
MountainPath · 2024-11-18T14:25:07.127Z · comments (5)

Linkpost: Look at the Water
J Bostock (Jemist) · 2024-12-30T19:49:04.107Z · comments (3)

Morality as Cooperation Part III: Failure Modes
DeLesley Hutchins (delesley-hutchins) · 2024-12-05T09:39:27.816Z · comments (0)

Visualizing small Attention-only Transformers
WCargo (Wcargo) · 2024-11-19T09:37:42.213Z · comments (0)

Levels of Thought: from Points to Fields
HNX · 2024-12-02T20:25:02.802Z · comments (2)

Fred the Heretic, a GPT for poetry
Bill Benzon (bill-benzon) · 2024-12-08T16:52:07.660Z · comments (0)

[question] Has Anthropic checked if Claude fakes alignment for intended values too?
Maloew (maloew-valenar) · 2024-12-23T00:43:07.490Z · answers+comments (1)

Vision of a positive Singularity
RussellThor · 2024-12-23T02:19:35.050Z · comments (0)

Investing in Robust Safety Mechanisms is critical for reducing Systemic Risks
Tom DAVID (tom-david) · 2024-12-11T13:37:24.177Z · comments (3)

On AI Detectors Regarding College Applications
Kaustubh Kislay (kaustubh-kislay) · 2024-11-27T20:25:48.151Z · comments (2)

Effects of Non-Uniform Sparsity on Superposition in Toy Models
Shreyans Jain (shreyans-jain) · 2024-11-14T16:59:43.234Z · comments (3)

Model Integrity
ryan.lowe · 2024-12-06T21:28:20.775Z · comments (1)

[link] Can AI improve the current state of molecular simulation?
Abhishaike Mahajan (abhishaike-mahajan) · 2024-12-06T20:22:31.685Z · comments (0)

What are Emotions?
Myles H (zarsou9) · 2024-11-15T04:20:27.388Z · comments (13)

ARC-AGI is a genuine AGI test but o3 cheated :(
Knight Lee (Max Lee) · 2024-12-22T00:58:05.447Z · comments (6)

A better “Statement on AI Risk?”
Knight Lee (Max Lee) · 2024-11-25T04:50:29.399Z · comments (6)

More Growth, Melancholy, and MindCraft @3QD [revised and updated]
Bill Benzon (bill-benzon) · 2024-12-05T19:36:02.289Z · comments (0)

[question] What (if anything) made your p(doom) go down in 2024?
Satron · 2024-11-16T16:46:43.865Z · answers+comments (6)

Good Fortune and Many Worlds
Jonah Wilberg (jrwilb@googlemail.com) · 2024-12-27T13:21:43.142Z · comments (0)

[link] Expevolu, a laissez-faire approach to country creation
Fernando · 2024-12-05T19:29:24.011Z · comments (4)

Are SAE features from the Base Model still meaningful to LLaVA?
Shan23Chen (shan-chen) · 2024-12-05T19:24:34.727Z · comments (0)

Grokking revisited: reverse engineering grokking modulo addition in LSTM
Nikita Khomich (nikitoskh) · 2024-12-16T18:48:43.533Z · comments (0)

Germany-wide ACX Meetup
Fernand0 · 2024-11-17T10:08:54.584Z · comments (0)

Dishbrain and implications.
RussellThor · 2024-12-29T10:42:43.912Z · comments (0)

[question] Are there ways to artificially fix laziness?
Aidar (aidar-toktargazin) · 2024-12-08T18:26:26.433Z · answers+comments (2)

[link] Entropic strategy in Two Truths and a Lie
dkl9 · 2024-11-21T22:03:28.986Z · comments (2)

Activation Magnitudes Matter On Their Own: Insights from Language Model Distributional Analysis
Matt Levinson · 2025-01-10T06:53:02.228Z · comments (0)

[link] When the Scientific Method Doesn't Really Help...
casualphysicsenjoyer (hatta_afiq) · 2024-11-27T19:52:30.023Z · comments (1)

You are too dumb to understand insurance
Lorec · 2025-01-09T23:33:53.778Z · comments (6)

Hope to live or fear to die?
Knight Lee (Max Lee) · 2024-11-27T10:42:37.070Z · comments (0)

[question] How do you decide to phrase predictions you ask of others? (and how do you make your own?)
CstineSublime · 2025-01-10T02:44:26.737Z · answers+comments (0)

Workshop Report: Why current benchmarks approaches are not sufficient for safety?
Tom DAVID (tom-david) · 2024-11-26T17:20:47.453Z · comments (1)

[question] How should I optimize my decision making model for 'ideas'?
CstineSublime · 2024-12-18T04:09:58.025Z · answers+comments (0)

Should you increase AI alignment funding, or increase AI regulation?
Knight Lee (Max Lee) · 2024-11-26T09:17:01.809Z · comments (1)

notes on prioritizing tasks & cognition-threads
Emrik (Emrik North) · 2024-11-26T00:28:03.400Z · comments (1)

[question] Are Sparse Autoencoders a good idea for AI control?
Gerard Boxo (gerard-boxo) · 2024-12-26T17:34:55.617Z · answers+comments (2)

[link] The Polite Coup
Charlie Sanders (charlie-sanders) · 2024-12-04T14:03:36.663Z · comments (0)

The boat
RomanS · 2024-11-22T12:56:45.050Z · comments (0)

Don't want Goodhart? — Specify the variables more
YanLyutnev (YanLutnev) · 2024-11-21T22:43:48.362Z · comments (2)

[link] What is Confidence—in Game Theory and Life?
James Stephen Brown (james-brown) · 2024-12-10T23:06:24.072Z · comments (0)

[question] 2025 Alignment Predictions
anaguma · 2025-01-02T05:37:36.912Z · answers+comments (3)

3. Improve Cooperation: Better Technologies
Allison Duettmann (allison-duettmann) · 2025-01-02T19:03:16.588Z · comments (2)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

jake_mendel on Activation space interpretability may be doomed

A thought triggered by reading issue 3:

I agree issue 3 seems like a potential problem with methods that optimise for sparsity too much, but it doesn't seem that directly related to the main thesis? At least in the example you give, it should be possible in principle to notice that the space can be factored as a direct sum without having to look to future layers. I guess what I want to ask here is:

It seems like there is a spectrum of possible views you could have here:

It's achievable to come up with sensible ansatzes (sparsity, linear representations, if we see the possibility to decompose the space into direct sums then we should do that, and so on) which will get us most of the way to finding the ground truth features, but there are edge cases/counterexamples which can only be resolved by looking at how the activation vector is used. this is compatible with the example you gave in issue 3 where the space is factorisable into a direct sum which seems pretty natural/easy to look for in advance, although of course that's the reason you picked that particular structure as an example.
There are many many ways to decompose an activation vector, corresponding to many plausible but mutually incompatible sets of ansatzes, and the only way to know which is correct for the purposes of understanding the model is to see how the activation vector is used in the later layers.
1. Maybe there are many possible decompositions but they are all/mostly straightforwardly related to each other by eg a sparse basis transformation, so finding any one decomposition is a step in the right direction.
2. Maybe not that.
Any sensible approach to decomposing an activation vector without looking forward to subsequent layers will be actively misleading. The right way to decompose the activation vector can't be found in isolation with any set of natural ansatzes because the decomposition depends intimately on the way the activation vector is used.

The main strategy being pursued in interpretability today is (insofar as interp is about fully understanding models):

First decompose each activation vector individually. Then try to integrate the decompositions of different layers together into circuits. This may require merging found features into higher level features, or tweaking the features in some way, or filtering out some features which turn out to be dataset features. (See also superseding vs supplementing superposition) [LW · GW].

This approach is betting that the decompositions you get when you take each vector in isolation are a (big) step in the right direction, even if they require modification, which is more compatible with stance (1) and (2a) in the list above. I don't think your post contains any knockdown arguments that this approach is doomed (do you agree?), but it is maybe suggestive. It would be cool to have some fully reverse engineered toy models where we can study one layer at a time and see what is going on.

nina-panickssery on Activation space interpretability may be doomed

the best vector for probing is not the best vector for steering

I don't understand this. If a feature is represented by a direction v in the activations, surely the best probe for that feature will also be v because then <v,v> is maximized.

fisheater_5491 on microwave drilling is impractical

In this context, the most important advantage of supercritical water is that it contains nearly SIX times as much energy per ton - e.g. at 300 bar and 600°C - than in 160 bar 300°C superheated steam.
As a result, almost 5 times less water has to be driven through the heat exchanger system at depth - whereby - due to the higher pressure - the pump load is about three times lower - and about five times the output is possible with the same borehole diameter. Stone is a poor conductor of heat. So after the initial heat loss to heat up the wall of the riser borehole, only a small part of the 600°C depth temperature at 15-16 km depth is lost, so that about 500°C reaches the turbines. Then the 300 liters per second are enough for about 1 GW production - with a pump output of about 0.1%

zac-hatfield-dodds on Statement on AI Extinction - Signed by AGI Labs, Top Academics, and Many Other Notable Figures

I think this is the most important statement on AI risk to date. Where ChatGPT brought "AI could be very capable" into the overton window, the CAIS Statement brought in AI x-risk. When I give talks to NGOs, or business leaders, or government officials, I almost always include a slide with selected signatories and the full text:

Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war.

I believe it's true, that it was important to say, and that it's had an ongoing, large, and positive impact. Thank you again to the organizers and to my many, many co-signatories.

chris_leong on How can humanity survive a multipolar AGI scenario?

I suspect that this is will be an incredibly difficult scenario to navigate and that our chances will be better if we train wise AI advisors.

I think our chances would be better still if we could pivot a significant fraction of the talent towards developing WisdomTech rather than IntelligenceTech.

On a more concrete level, I suspect the actual plan looks like some combination of alignment hacks, automated alignment research, control, def/acc, limited proliferation of AI, compute governance and the merging of actors. Applied wisely, the combination of all of these components may be enough. But figuring out the right mix isn't going to be easy.

chris_leong on Discursive Warfare and Faction Formation

Committed not to Eliezer's insights but to exaggerated versions of his blind spots

My guess would be that this is an attempt to apply a general critique of what tends to happen in community's in general to the LW community without accounting for its specifics.

Most people in the LW community would say that Eliezer is overconfident or even arrogant (sorry Eliezer!).

The incentive gradient for status hungry folk is not to double-down on Eliezer's views, but to double-down on your idiosyncratic version of rationalism, different enough from the community's to be interesting, but similar enough to be legible.

pablo_stafforini on Actualism, asymmetry and extinction

This is currently at –1 despite being a carefully reasoned post on an important topic. I wonder if the downvoter(s) would have used the disagree vote instead had it been available. (More generally, it is unclear why that button is available in comments but not in posts.)

jake_mendel on Activation space interpretability may be doomed

Nice post! Re issue 1, there are a few things that you can do to work out if a representation you have found is a 'model feature' or a 'dataset feature'. You can:

Check if intervening on the forward pass to modify this feature produces the expected effect on outputs. Caveats:
- the best vector for probing is not the best vector for steering (in general the inverse of a matrix is not the transpose, and finding a basis of steering vectors from a basis of probe vectors involves inverting the basis matrix)
- It's possible that the feature you found is causally upstream of some features the model has learned, and even if the model hasn't learned this feature, changing it affects things the model is aware of. OTOH, I'm not sure whether I want to say that this feature has not been learned by the model in this case.
- Some techniques eg crosscoders don't come equipped with a well defined notion of intervening on the feature during a forward pass.
Nonetheless, we can still sometimes get evidence this way, in particular about whether our probe has found subtle structure in the data that is really causally irrelevant to the model. This is already a common technique in interpretability (see eg the initimitable golden gate claude, and many more systematic steering tests like this one),
Run various shuffle/permutation controls:
- Measure the selectivity of your feature finding technique: replace the structure in the data with some new structure (or just remove the structure) and then see if your probe finds that new structure. To the extent that the probe can learn the new structure, it is not telling you about what the model has learned.
  Most straightforwardly: if you have trained a supervised probe, you can train a second supervised probe on a dataset with randomised labels, and look at how much more accurate the probe is when trained on data with true labels. This can help distinguish between the hypothesis that you have found a real variable in the model, and the null hypothesis that the probing technique is powerful enough to find a direction that can classify any dataset with that accuracy. Selectivity tests should do things like match the bias of the train data (eg if training a probe on a sparsely activating feature, then the value of the feature is almost always zero and that should be preserved in the control).
  You can also test unsupervised techniques like SAEs this way by training them on random sequences of tokens. There's probably more sophisticated controls that can be introduced here: eg you can try to destroy all the structure in the data and replace it with random structure that is still sparse in the same sense, and so on.
- In addition to experiments that destroy the probe training data, you can also run experiments that destroy the structure in the model weights. To the extent that the probe works here, it is not telling you about what the model has learned.
  For example, reinitialise the weights of the model, and train the probe/SAE/look at the PCA directions. This is a weak control: a stronger control could do something like reiniatialising the weights of the model that matches the eigenspectrum of each weight matrix to the eigenspectrum of the corresponding matrix in the trained model (to rule out things like the SAE didn't work in the randomised model because the activation vector is too small etc), although that control is still quite weak.
  This control was used nicely in Towards Monosemanticity here, although I think much more research of this form could be done with SAEs and their cousins.
- I am told by Adam Shai that in experimental neuroscience, it is something of a sport to come up with better and better controls for testing the hypothesis that you have identified structure. Maybe some of that energy should be imported to interp?
Probably some other things not on my mind right now??

I am aware that there is less use in being able to identify whether your features are model features or dataset features than there is in having a technique that zero-shot identifies model features only. However, a reliable set of tools for distinguishing what type of feature we have found would give us feedback loops that could help us search for good feature-finding techniques. eg. good controls would give us the freedom to do things like searching over (potentially nonlinear) probe architectures for those with a high accuracy relative to the control (in the absence of the control, searching over architectures would lead us to more and more expressive nonlinear probes that tell us nothing about the model's computation). I'm curious if this sort of thing would lead us away from treating activation vectors in isolation, as the post argues.

zac-hatfield-dodds on What Have Been Your Most Valuable Casual Conversations At Conferences?

I've been to quite a few Python conferences; typically I find the unstructured time in hallways, over dinner, and in "sprints" both fun and valuable. I've made great friends and recruited new colleagues, conceived and created new libraries, built professional relationships, hashed out how to fix years-old infelicities in various well-known things, etc.

Conversations at afterparties led me to write concrete reasons for hope about AI [LW · GW], and at another event met a friend working on important-to-me biotechnology (I later invested in their startup). I've also occasionally taken something useful away from AI safety conversations, or in one memorable late-night at LessOnline hopefully conveyed something important about my work.

There are many more examples, but it also feels telling that I can't give you examples of conference talks that amazed me in person (there are some great ones recorded but your odds are low, and most I'd prefer to read a good written verion instead), and structured events I've enjoyed are things like "the Python language summit" or "conference dinners which are mostly socializing" - so arguably the bar is low

thane-ruthenis on AI Safety as a YC Startup

Why would we not have the "Direction" component standardized to have unit norm?

I think what the OP is getting at is that the space of endeavors has a bunch of privileged directions of high impact, and your impact depends on (1) how good your aim is and (2) how hard you shoot. So it'd be something like magnitude times the sum of cosine similarities with each high-impact vector; or perhaps just the magnitude if we use the high-impact vectors as the basis.

Also, "Magnitude" is probably the wrong term for the component in question; it seems to mean "how much you achieve", but that's actually what "Impact" is measuring! And indeed, impact is very much a function of the direction in which you're going. "Magnitude" should instead be "Effort" or "Short-Term Profit" or something.

(Yes, I truly believe that nitpicking this toy model is the best use of my time right now.)