LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Transfer learning and generalization-qua-capability in Babbage and Davinci (or, why division is better than Spanish)
RP (Complex Bubble Tea) · 2024-02-09T07:00:45.825Z · comments (6)

[link] Announcing Human-aligned AI Summer School
Jan_Kulveit · 2024-05-22T08:55:10.839Z · comments (0)

n of m ring signatures
DanielFilan · 2023-12-04T20:00:06.580Z · comments (7)

AI #52: Oops
Zvi · 2024-02-22T21:50:07.393Z · comments (9)

On Complexity Science
Garrett Baker (D0TheMath) · 2024-04-05T02:24:32.039Z · comments (19)

Vipassana Meditation and Active Inference: A Framework for Understanding Suffering and its Cessation
Benjamin Sturgeon (benjamin-sturgeon) · 2024-03-21T12:32:22.475Z · comments (8)

Toy models of AI control for concentrated catastrophe prevention
Fabien Roger (Fabien) · 2024-02-06T01:38:19.865Z · comments (2)

Altman firing retaliation incoming?
trevor (TrevorWiesinger) · 2023-11-19T00:10:15.645Z · comments (23)

Scenario Forecasting Workshop: Materials and Learnings
elifland · 2024-03-08T02:30:46.517Z · comments (3)

[link] on the dollar-yen exchange rate
bhauth · 2024-04-07T04:49:53.920Z · comments (21)

Observations on Teaching for Four Weeks
ClareChiaraVincent · 2024-05-06T16:55:59.315Z · comments (14)

GPT-2030 and Catastrophic Drives: Four Vignettes
jsteinhardt · 2023-11-10T07:30:06.480Z · comments (5)

Changes in College Admissions
Zvi · 2024-04-24T13:50:03.487Z · comments (11)

[link] Finding Backward Chaining Circuits in Transformers Trained on Tree Search
abhayesian · 2024-05-28T05:29:46.777Z · comments (1)

[link] A starter guide for evals
Marius Hobbhahn (marius-hobbhahn) · 2024-01-08T18:24:23.913Z · comments (2)

Apply to the Conceptual Boundaries Workshop for AI Safety
Chipmonk · 2023-11-27T21:04:59.037Z · comments (0)

On Overhangs and Technological Change
Roko · 2023-11-05T22:58:51.306Z · comments (19)

The Shortest Path Between Scylla and Charybdis
Thane Ruthenis · 2023-12-18T20:08:34.995Z · comments (8)

Gemini 1.0
Zvi · 2023-12-07T14:40:05.243Z · comments (7)

AI #82: The Governor Ponders
Zvi · 2024-09-19T13:30:04.863Z · comments (8)

Applications of Chaos: Saying No (with Hastings Greer)
Elizabeth (pktechgirl) · 2024-09-21T16:30:07.415Z · comments (16)

[link] Peak Human Capital
PeterMcCluskey · 2024-09-30T21:13:30.421Z · comments (2)

Low Probability Estimation in Language Models
Gabriel Wu (gabriel-wu) · 2024-10-18T15:50:05.947Z · comments (0)

Bounty: Diverse hard tasks for LLM agents
Beth Barnes (beth-barnes) · 2023-12-17T01:04:05.460Z · comments (31)

An issue with training schemers with supervised fine-tuning
Fabien Roger (Fabien) · 2024-06-27T15:37:56.020Z · comments (12)

They are made of repeating patterns
quetzal_rainbow · 2023-11-13T18:17:43.189Z · comments (4)

Public Weights?
jefftk (jkaufman) · 2023-11-02T02:50:18.095Z · comments (19)

AI #67: Brief Strange Trip
Zvi · 2024-06-06T18:50:03.514Z · comments (6)

Should rationalists be spiritual / Spirituality as overcoming delusion
Kaj_Sotala · 2024-03-25T16:48:08.397Z · comments (57)

Job listing: Communications Generalist / Project Manager
Gretta Duleba (gretta-duleba) · 2023-11-06T20:21:03.721Z · comments (7)

Tall Tales at Different Scales: Evaluating Scaling Trends For Deception In Language Models
Felix Hofstätter · 2023-11-08T11:37:43.997Z · comments (0)

[question] why did OpenAI employees sign
bhauth · 2023-11-27T05:21:28.612Z · answers+comments (23)

Consent across power differentials
Ramana Kumar (ramana-kumar) · 2024-07-09T11:42:03.177Z · comments (12)

The Broken Screwdriver and other parables
bhauth · 2024-03-04T03:34:38.807Z · comments (1)

[link] in defense of Linus Pauling
bhauth · 2024-06-03T21:27:43.962Z · comments (8)

[link] Anthropic announces interpretability advances. How much does this advance alignment?
Seth Herd · 2024-05-21T22:30:52.638Z · comments (4)

AI #58: Stargate AGI
Zvi · 2024-04-04T13:10:06.342Z · comments (9)

[link] DM Parenting
Shoshannah Tekofsky (DarkSym) · 2024-07-16T08:50:08.144Z · comments (4)

[LDSL#0] Some epistemological conundrums
tailcalled · 2024-08-07T19:52:55.688Z · comments (10)

Please do not use AI to write for you
Richard_Kennaway · 2024-08-21T09:53:34.425Z · comments (34)

Wrong answer bias
lukehmiles (lcmgcd) · 2024-02-01T20:05:38.573Z · comments (24)

Notes on control evaluations for safety cases
ryan_greenblatt · 2024-02-28T16:15:17.799Z · comments (0)

Book Review: Righteous Victims - A History of the Zionist-Arab Conflict
Yair Halberstadt (yair-halberstadt) · 2024-06-24T11:02:03.490Z · comments (8)

[link] Chapter 1 of How to Win Friends and Influence People
gull · 2024-01-28T00:32:52.865Z · comments (5)

So you want to work on technical AI safety
gw · 2024-06-24T14:29:57.481Z · comments (3)

Interoperable High Level Structures: Early Thoughts on Adjectives
johnswentworth · 2024-08-22T21:12:38.223Z · comments (1)

How to do conceptual research: Case study interview with Caspar Oesterheld
Chi Nguyen · 2024-05-14T15:09:30.390Z · comments (5)

Highlights from Lex Fridman’s interview of Yann LeCun
Joel Burget (joel-burget) · 2024-03-13T20:58:13.052Z · comments (15)

[link] JumpReLU SAEs + Early Access to Gemma 2 SAEs
Senthooran Rajamanoharan (SenR) · 2024-07-19T16:10:54.664Z · comments (10)

SRE's review of Democracy
Martin Sustrik (sustrik) · 2024-08-03T07:20:01.483Z · comments (2)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

jacob_hilton on A bird's eye view of ARC's research

Thank you – this is probably the best critique of ARC's research agenda that I have read since we started working on heuristic explanations. This level of thoughtfulness in external feedback is very rare and I'm grateful for the detail and clarity you put into it. I don't think my response fully rebuts your central concern, but hopefully it gives a sense of my current thinking about it.

It sounds like we are in agreement that something very loosely heuristic explanation-flavored (interpreted so broadly as to include mechanistic interpretability, for example) can reasonably be placed at the root of the diagram, by which I mean that it's productive to try to explain neural network behaviors in this very loose sense, attempt to apply such explanations to downstream applications such as MAD/LPE/ELK etc. We begin to diverge, I think, about the extent to which ARC should focus on a more narrow conception of heuristic explanations. From least to most specific:

Any version that is primarily mathematical rather than "story-centric"
Some (mathematical) version that is consistent with our information-theoretic intuitions about what constitutes a valid explanation (i.e., in the sense of something like surprise accounting)
Some such version that is loosely based on independence assumptions
Some version that satisfies more specific desiderata for heuristic estimators (such as the ones discussed in our papers)

Opinions at ARC will differ, but (1) I feel pretty comfortable defending, (2) I think is quite a promising option to be considering, (3) seems like a reasonable best guess but I don't think we should be that wedded to it, and (4) I think is probably too specific (and with the benefit of hindsight I think we have focused too much on this in the past). ARC's research has actually been trending in the "less specific" direction over time, as should hopefully be evident from our most recent write-ups (with the exception of our recent paper on specific desiderata, which mostly covers work done in 2023), and I am quite unsure exactly where we should settle on this axis.

By contrast, my impression is that you would not really defend even (1) (although I am curious exactly where you come down this axis, if you want to clarify). So I'll give what I see as the basic case for searching for a mathematical rather than a "story-centric" approach:

Mechanistic interpretability has so far yielded very little in the way of beating baselines at downstream tasks (this has been discussed at length elsewhere, see for example here, here [LW · GW] and here [LW(p) · GW(p)]), so I think it should still be considered a largely unproven approach (to be clear, this is roughly my view of all alignment approaches that aren't already in active use at labs, including ARC's, and I remain excited to see people's continued valiant attempts; my point is that the bar is low and a portfolio approach is appropriate).
Relying purely on stories clearly doesn't work at sufficient scale under worst-case assumptions (because the AI will have concepts you don't have words for), and there isn't a lot of evidence that this isn't indeed already a bottleneck in practice (i.e., current AIs may well already have concepts you don't have words for).
I think that ARC's worst-case, theoretical approach (described at zoom level 1) is an especially promising alternative to iterative, empirically-driven work. I think empirical approaches are more promising overall, but have correlated failure modes (namely, they could end up relying on correlated empirical contingencies that later turn out to be false), and have far more total effort going into them (arguably disproportionately so). Conditional on taking such an approach, story-centric methods don't seem super viable (how should one analyze stories theoretically?).
I don't really buy the argument that because a system has a lot of complexity, it can only be analyzed in ad-hoc ways. It seems to me that an analogous argument would have failed to make good predictions about the bitter lesson (i.e., by arguing that a simple algorithm like SGD should not be capable of producing great complexity in a targeted way). Instead, because neural nets are trained in an incremental, automated way based on mathematical principles, it seems quite possible to me that we can find explanations for them in a similar way (which is not an argument that can be applied to biological brains).

This doesn't of course defend (2)–(4) (which I would only want to do more weakly in any case). We've tried to get our intuitions for those across in our write-ups (as linked in (2) and (4) above), but I'm not sure there's anything succinct I can add here if those were unconvincing. I agree that puts us in the rather unfortunate position of sharing a reference class with Stephen Wolfram to many external observers (although hopefully our claims are not quite so overstated).

I think it's important for ARC to recognize this tension, and to strike the right balance between making our work persuasive to external skeptics on the one hand, and having courage in our convictions on the other hand (I think both have been important virtues in scientific development historically). Concretely, my current best guess is that ARC should:

(a) Avoid being too wedded to intuitive desiderata for heuristic explanations that we can't directly tie back to specific applications
(b) Search for concrete cases that put our intuitions to the test, so that we can quickly reach a point where either we no longer believe in them, or they are more convincing to others
(c) Also pursue research that is more agnostic to the specific form of explanation, such as work on low probability estimation or other applications
(d) Stay on the lookout for ideas from alternative theoretical approaches (including singular learning theory, sparsity-based approaches, computational mechanics, causal abstractions, and neural net-oriented varieties of agent foundations), although my sense is that object-level intuitions here just differ enough that it's difficult to collaborate productively. (Separately, I'd argue that proponents of all these alternatives are in a similar predicament, and could generally be doing a better job on analogous versions of (a)–(c).)

I think we have been doing all of (a)–(d) to some extent already, although I imagine you would argue that we have not been going far enough. I'd be interested in more thoughts on how to strike the right balance here.

charbel-raphael on Are we dropping the ball on Recommendation AIs?

I'm a bit surprised this post has so little karma and engagement. I would be really interested to hear from people who think this is a complete distraction.

raemon on Drake Thomas's Shortform

(I also failed to interpret the OP correctly, although I might have been primed by Ryan's comment. Whoops)

deepthoughtlife on LLM Generality is a Timeline Crux

No problem with the failure to respond. I appreciate that this way of communicating is asynchronous (and I don't necessarily reply to things promptly either). And I think it would be reasonable to drop it at any point if it didn't seem valuable.

Also, you're welcome.

romeostevensit on What is malevolence? On the nature, measurement, and distribution of dark traits

I think this is an important topic and am glad to see substantial scholarship efforts on it.

Wrt AI relevance: I think the meme that it matters a lot who builds the self activating doomsday device has done potentially quite a bit of harm and may be a main contributor to what kills us.

Wrt people detecting these traits: I personally feel that the self domestication of humans has made us easier targets for such people, and undermined our ability to even think of doing anything about them. I don't think this is entirely random.

deepthoughtlife on LLM Generality is a Timeline Crux

Sorry, I don't have a link for using actual compression algorithms, it was a while ago. I didn't think it would come up so I didn't note anything down. My recent spate of commenting is unusual for me (and I don't actually keep many notes on AI related subjects).

I definitely agree that it is 'hard to judge' 'more novel and more requiring of intelligence'. It is, after all, a major thing we don't even know how to clearly solve for evaluating other humans (so we use tricks that often rely on other things and these tricks likely do not generalize to other possible intelligences and thus couldn't use here). Intelligence has not been solved.

Still, there is a big difference between the level of intelligence required when discussing how great your favorite popstar is vs what in particular they are good at vs why they are good at it (and within each category there are much more or less intellectual ways to write about it, though intellectual should not be confused with intelligent). It would have been nice if I could think up good examples, but I couldn't. You could possibly check things like how well it completes things like parts of this conversation (which is somewhere in the middle).

I wasn't really able to properly evaluate your links. There's just too much they assume that I don't know.

I found your first link, 'Transformers Learn In-Context by Gradient Descent' a bit hard to follow (though I don't particularly think it is a fault of the paper itself). Once they get down to the details, they lose me. It is interesting that it would come up with similar changes based on training and just 'reading' the context, but if both mechanisms are simple, I suppose that makes sense.

Their claim about how in context can 'curve' better also reminds me of the ODEs used for samplers in diffusion models (I've written a number of samplers for diffusion models as a hobby/ to work on my programming). Higher degree ODEs curve more too (though they have their own drawbacks and particularly high degree is generally a bad idea) by using extra samples, just like this can use extra layers. Gradient descent is effectively first degree by default, right? So it wouldn't be a surprise if you can curve more than it. You would expect sufficiently general things to resemble each other of course. I do find it a bit strange just how similar the loss for steps of gradient descent and transformer layers is. (Random point: I find that loss is not a very good metric for how good the actual results are at least in image generation/reconstruction. Not that I know of a good replacement. People do often come up with various different ways of measuring it though.)

Even though I can't critique the details, I do think it is important to note that I often find claims of similarity like this in areas I understand better to not be very illuminating because people want to find similarities/analogies to understand it more easily.

The graphs really are shockingly similar though in the single layer case, which raises the likelihood that there's something to it. And the multi-layer ones really does seem like simply a higher degree polynomial ODE.

The second link 'In-context Learning and Gradient Descent Revisited', which was equally difficult, has this line "Surprisingly, we find that untrained models achieve similarity scores at least as good as trained ones. This result provides strong evidence against the strong ICL-GD correspondence." Which sounds pretty damning to me, assuming they are correct (which I also can't evaluate).

I could probably figure them out, but I expect it would take me a lot of time.

andy-arditi on Refusal in LLMs is mediated by a single direction

We ablate the direction everywhere for simplicity - intuitively this prevents the model from ever representing the direction in its computation, and so a behavioral change that results from the ablation can be attributed to mediation through this direction.

However, we noticed empirically that it is not necessary to ablate the direction at all layers in order to bypass refusal. Ablating at a narrow local region (2-3 middle layers) can be just as effective as ablating across all layers, suggesting that the direction is "read" or "processed" at some local region.

cameron-berg on Self-prediction acts as an emergent regularizer

Thanks for this! Consider the self-modeling loss gradient: . While the identity function would globally minimize the self-modeling loss with zero loss for all inputs (effectively eliminating the task's influence by zeroing out its gradients), SGD learns local optima rather than global optima, and the gradients don't point directly toward the identity solution. The gradient depends on both the deviation from identity ( $E$ ) and the activation covariance ( $A A^{⊤}$ ), with the network balancing this against the primary task loss. Since the self-modeling prediction isn't just a separate output block—it's predicting the full activation pattern—the interaction between the primary task loss, activation covariance structure ( $A A^{⊤}$ ), and need to maintain useful representations creates a complex optimization landscape where local optima dominate. We see this empirically in the consistent non-zero $W - I$ difference during training.

cameron-berg on Self-prediction acts as an emergent regularizer

The comparison to activation regularization is quite interesting. When we write down the self-modeling loss in terms of the self-modeling layer, we get $| | (W a + b - a) | |^{2} = | | (W - I) a + b | |^{2} = | | E a + b | |^{2}$ .

This does resemble activation regularization, with the strength of regularization attenuated by how far the weight matrix is from identity (the magnitude of $E$ ). However, due to the recurrent nature of this loss—where updates to the weight matrix depend on activations that are themselves being updated by the loss—the resulting dynamics are quite complex. Looking at the gradient $\partial L_{s e l f} / \partial W = 2 E A A^{⊤}$ , we see that self-modeling depends on the full covariance structure of activations, not just pushing them toward zero or any fixed vector. The network must learn to actively predict its own evolving activation patterns rather than simply constraining their magnitude.

Comparing the complexity measures (SD & RLCT) between self-modeling and activation regularization is a great idea and we will definitely add this to the roadmap and report back. And batch norm/other forms of regularization were not added.

dagon on Derivative AT a discontinuity

What’s its derivative?

The is nonstandard and misleading. It should not have a vertical segment at 0, it should have an open-circle at 0,0, and a closed circle at 0,1, showing that the lines do not and do contain the 0 point, respectively.

This makes the intuition pump a little easier. The derivative at all nonzero Xs is 0. The derivative AT ZERO, is 0 to the right (as X increases), and undefined to the left (as X decreases). There is no connection between 0 and 0 - epsilon, and therefore no slope.

You CAN use more complicated models to describe some features of it (hyperreals, or just limits), but those are modeling tools to answer different questions than the intuitive use of derivative (slope of a continuous curve). It's probably not right to say that any of them are "true", without some caveats.