LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Why I think it's net harmful to do technical safety research at AGI labs
Remmelt (remmelt-ellen) · 2024-02-07T04:17:15.246Z · comments (24)

Causality is Everywhere
silentbob · 2024-02-13T13:44:49.952Z · comments (12)

Evidential Correlations are Subjective, and it might be a problem
Martín Soto (martinsq) · 2024-03-07T18:37:54.105Z · comments (6)

[link] Forecasting future gains due to post-training enhancements
elifland · 2024-03-08T02:11:57.228Z · comments (2)

What is the best argument that LLMs are shoggoths?
JoshuaFox · 2024-03-17T11:36:23.636Z · comments (22)

Consequentialism is a compass, not a judge
Neil (neil-warren) · 2024-04-13T10:47:44.980Z · comments (6)

Bayesian inference without priors
DanielFilan · 2024-04-24T23:50:08.312Z · comments (8)

Geometric Utilitarianism (And Why It Matters)
StrivingForLegibility · 2024-05-12T03:41:21.342Z · comments (2)

Smartphone Etiquette: Suggestions for Social Interactions
Declan Molony (declan-molony) · 2024-06-04T06:01:03.336Z · comments (4)

Ideas for Next-Generation Writing Platforms, using LLMs
ozziegooen · 2024-06-04T18:40:24.636Z · comments (4)

[link] Emotional issues often have an immediate payoff
Chipmonk · 2024-06-10T23:39:40.697Z · comments (2)

[link] my favourite Scott Sumner blog posts
DMMF · 2024-06-11T14:40:43.093Z · comments (0)

[link] Positive visions for AI
L Rudolf L (LRudL) · 2024-07-23T20:15:26.064Z · comments (4)

Optimizing Repeated Correlations
SatvikBeri · 2024-08-01T17:33:23.823Z · comments (1)

[link] Can a Bayesian Oracle Prevent Harm from an Agent? (Bengio et al. 2024)
mattmacdermott · 2024-09-01T07:46:26.647Z · comments (0)

Sleeping on Stage
jefftk (jkaufman) · 2024-10-22T00:50:07.994Z · comments (3)

Do Sparse Autoencoders (SAEs) transfer across base and finetuned language models?
Taras Kutsyk · 2024-09-29T19:37:30.465Z · comments (7)

[link] Conventional footnotes considered harmful
dkl9 · 2024-10-01T14:54:01.732Z · comments (16)

[link] A primer on the next generation of antibodies
Abhishaike Mahajan (abhishaike-mahajan) · 2024-09-01T22:37:59.207Z · comments (0)

Are we dropping the ball on Recommendation AIs?
Charbel-Raphaël (charbel-raphael-segerie) · 2024-10-23T17:48:00.000Z · comments (1)

[question] When can I be numerate?
FinalFormal2 · 2024-09-12T04:05:27.710Z · answers+comments (3)

[link] SB 1047 gets vetoed
ryan_b · 2024-09-30T15:49:38.609Z · comments (1)

[link] MIRI's July 2024 newsletter
Harlan · 2024-07-15T21:28:17.343Z · comments (2)

[question] How to Model the Future of Open-Source LLMs?
Joel Burget (joel-burget) · 2024-04-19T14:28:00.175Z · answers+comments (9)

[link] Structured Transparency: a framework for addressing use/mis-use trade-offs when sharing information
habryka (habryka4) · 2024-04-11T18:35:44.824Z · comments (0)

The Drowning Child
Tomás B. (Bjartur Tómas) · 2023-10-22T16:39:53.016Z · comments (8)

[link] Arrogance and People Pleasing
Jonathan Moregård (JonathanMoregard) · 2024-02-06T18:43:09.120Z · comments (7)

Clipboard Filtering
jefftk (jkaufman) · 2024-04-14T20:50:02.256Z · comments (1)

[link] OpenAI Superalignment: Weak-to-strong generalization
Dalmert · 2023-12-14T19:47:24.347Z · comments (3)

Beta Tester Request: Rallypoint Bounties
lukemarks (marc/er) · 2024-05-25T09:11:11.446Z · comments (4)

An experiment on hidden cognition
Olli Järviniemi (jarviniemi) · 2024-07-22T03:26:05.564Z · comments (2)

Using an LLM perplexity filter to detect weight exfiltration
Adam Karvonen (karvonenadam) · 2024-07-21T18:18:05.612Z · comments (11)

[link] Was a Subway in New York City Inevitable?
Jeffrey Heninger (jeffrey-heninger) · 2024-03-30T00:53:21.314Z · comments (4)

Proving the Geometric Utilitarian Theorem
StrivingForLegibility · 2024-08-07T01:39:10.920Z · comments (0)

AXRP Episode 30 - AI Security with Jeffrey Ladish
DanielFilan · 2024-05-01T02:50:04.621Z · comments (0)

Improving SAE's by Sqrt()-ing L1 & Removing Lowest Activating Features
Logan Riggs (elriggs) · 2024-03-15T16:30:00.744Z · comments (5)

Housing Roundup #9: Restricting Supply
Zvi · 2024-07-17T12:50:05.321Z · comments (8)

[link] Announcing Open Philanthropy's AI governance and policy RFP
Julian Hazell (julian-hazell) · 2024-07-17T02:02:39.933Z · comments (0)

[link] Executive Dysfunction 101
DaystarEld · 2024-05-23T12:43:13.785Z · comments (1)

A Review of In-Context Learning Hypotheses for Automated AI Alignment Research
alamerton · 2024-04-18T18:29:33.892Z · comments (4)

Changing Contra Dialects
jefftk (jkaufman) · 2023-10-26T17:30:10.387Z · comments (2)

$250K in Prizes: SafeBench Competition Announcement
ozhang (oliver-zhang) · 2024-04-03T22:07:41.171Z · comments (0)

Control Symmetry: why we might want to start investigating asymmetric alignment interventions
domenicrosati · 2023-11-11T17:27:10.636Z · comments (1)

[question] What ML gears do you like?
Ulisse Mini (ulisse-mini) · 2023-11-11T19:10:11.964Z · answers+comments (4)

The Wisdom of Living for 200 Years
Martin Sustrik (sustrik) · 2024-06-28T04:44:10.609Z · comments (3)

A Triple Decker for Elfland
jefftk (jkaufman) · 2024-10-11T01:50:02.332Z · comments (0)

[link] Fictional parasites very different from our own
Abhishaike Mahajan (abhishaike-mahajan) · 2024-09-08T14:59:39.080Z · comments (0)

Fun With The Tabula Muris (Senis)
sarahconstantin · 2024-09-20T18:20:01.901Z · comments (0)

AXRP Episode 36 - Adam Shai and Paul Riechers on Computational Mechanics
DanielFilan · 2024-09-29T05:50:02.531Z · comments (0)

[link] Introduction to Super Powers (for kids!)
Shoshannah Tekofsky (DarkSym) · 2024-09-20T17:17:27.070Z · comments (0)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

raemon on Drake Thomas's Shortform

(I also failed to interpret the OP correctly, although I might have been primed by Ryan's comment)

deepthoughtlife on LLM Generality is a Timeline Crux

No problem with the failure to respond. I appreciate that this way of communicating is asynchronous (and I don't necessarily reply to things promptly either). And I think it would be reasonable to drop it at any point if it didn't seem valuable.

Also, you're welcome.

romeostevensit on What is malevolence? On the nature, measurement, and distribution of dark traits

I think this is an important topic and am glad to see substantial scholarship efforts on it.

Wrt AI relevance: I think the meme that it matters a lot who builds the self activating doomsday device has done potentially quite a bit of harm and may be a main contributor to what kills us.

Wrt people detecting these traits: I personally feel that the self domestication of humans has made us easier targets for such people, and undermined our ability to even think of doing anything about them. I don't think this is entirely random.

deepthoughtlife on LLM Generality is a Timeline Crux

Sorry, I don't have a link for using actual compression algorithms, it was a while ago. I didn't think it would come up so I didn't note anything down. My recent spate of commenting is unusual for me (and I don't actually keep many notes on AI related subjects).

I definitely agree that it is 'hard to judge' 'more novel and more requiring of intelligence'. It is, after all, a major thing we don't even know how to clearly solve for evaluating other humans (so we use tricks that often rely on other things and these tricks likely do not generalize to other possible intelligences and thus couldn't use here). Intelligence has not been solved.

Still, there is a big difference between the level of intelligence required when discussing how great your favorite popstar is vs what in particular they are good at vs why they are good at it (and within each category there are much more or less intellectual ways to write about it, though intellectual should not be confused with intelligent). It would have been nice if I could think up good examples, but I couldn't. You could possibly check things like how well it completes things like parts of this conversation (which is somewhere in the middle).

I wasn't really able to properly evaluate your links. There's just too much they assume that I don't know.

I found your first link, 'Transformers Learn In-Context by Gradient Descent' a bit hard to follow (though I don't particularly think it is a fault of the paper itself). Once they get down to the details, they lose me. It is interesting that it would come up with similar changes based on training and just 'reading' the context, but if both mechanisms are simple, I suppose that makes sense.

Their claim about how in context can 'curve' better also reminds me of the ODEs used for samplers in diffusion models (I've written a number of samplers for diffusion models as a hobby/ to work on my programming). Higher degree ODEs curve more too (though they have their own drawbacks and particularly high degree is generally a bad idea) by using extra samples, just like this can use extra layers. Gradient descent is effectively first degree by default, right? So it wouldn't be a surprise if you can curve more than it. You would expect sufficiently general things to resemble each other of course. I do find it a bit strange just how similar the loss for steps of gradient descent and transformer layers is. (Random point: I find that loss is not a very good metric for how good the actual results are at least in image generation/reconstruction. Not that I know of a good replacement. People do often come up with various different ways of measuring it though.)

Even though I can't critique the details, I do think it is important to note that I often find claims of similarity like this in areas I understand better to not be very illuminating because people want to find similarities/analogies to understand it more easily.

The graphs really are shockingly similar though in the single layer case, which raises the likelihood that there's something to it. And the multi-layer ones really does seem like simply a higher degree polynomial ODE.

The second link 'In-context Learning and Gradient Descent Revisited', which was equally difficult, has this line "Surprisingly, we find that untrained models achieve similarity scores at least as good as trained ones. This result provides strong evidence against the strong ICL-GD correspondence." Which sounds pretty damning to me, assuming they are correct (which I also can't evaluate).

I could probably figure them out, but I expect it would take me a lot of time.

andy-arditi on Refusal in LLMs is mediated by a single direction

We ablate the direction everywhere for simplicity - intuitively this prevents the model from ever representing the direction in its computation, and so a behavioral change that results from the ablation can be attributed to mediation through this direction.

However, we noticed empirically that it is not necessary to ablate the direction at all layers in order to bypass refusal. Ablating at a narrow local region (2-3 middle layers) can be just as effective as ablating across all layers, suggesting that the direction is "read" or "processed" at some local region.

cameron-berg on Self-prediction acts as an emergent regularizer

Thanks for this! Consider the self-modeling loss gradient: . While the identity function would globally minimize the self-modeling loss with zero loss for all inputs (effectively eliminating the task's influence by zeroing out its gradients), SGD learns local optima rather than global optima, and the gradients don't point directly toward the identity solution. The gradient depends on both the deviation from identity ( $E$ ) and the activation covariance ( $A A^{⊤}$ ), with the network balancing this against the primary task loss. Since the self-modeling prediction isn't just a separate output block—it's predicting the full activation pattern—the interaction between the primary task loss, activation covariance structure ( $A A^{⊤}$ ), and need to maintain useful representations creates a complex optimization landscape where local optima dominate. We see this empirically in the consistent non-zero $W - I$ difference during training.

cameron-berg on Self-prediction acts as an emergent regularizer

The comparison to activation regularization is quite interesting. When we write down the self-modeling loss in terms of the self-modeling layer, we get $| | (W a + b - a) | |^{2} = | | (W - I) a + b | |^{2} = | | E a + b | |^{2}$ .

This does resemble activation regularization, with the strength of regularization attenuated by how far the weight matrix is from identity (the magnitude of $E$ ). However, due to the recurrent nature of this loss—where updates to the weight matrix depend on activations that are themselves being updated by the loss—the resulting dynamics are quite complex. Looking at the gradient $\partial L_{s e l f} / \partial W = 2 E A A^{⊤}$ , we see that self-modeling depends on the full covariance structure of activations, not just pushing them toward zero or any fixed vector. The network must learn to actively predict its own evolving activation patterns rather than simply constraining their magnitude.

Comparing the complexity measures (SD & RLCT) between self-modeling and activation regularization is a great idea and we will definitely add this to the roadmap and report back. And batch norm/other forms of regularization were not added.

dagon on Derivative AT a discontinuity

What’s its derivative?

The is nonstandard and misleading. It should not have a vertical segment at 0, it should have an open-circle at 0,0, and a closed circle at 0,1, showing that the lines do not and do contain the 0 point, respectively.

This makes the intuition pump a little easier. The derivative at all nonzero Xs is 0. The derivative AT ZERO, is 0 to the right (as X increases), and undefined to the left (as X decreases). There is no connection between 0 and 0 - epsilon, and therefore no slope.

You CAN use more complicated models to describe some features of it (hyperreals, or just limits), but those are modeling tools to answer different questions than the intuitive use of derivative (slope of a continuous curve). It's probably not right to say that any of them are "true", without some caveats.

steve2152 on Big tech transitions are slow (with implications for AI)

Seconding quetzel_rainbow’s comment [LW(p) · GW(p)]. Another way to put it is:

If your reference class is “integrating a new technology into the economy”, then you’d expect AI integration to unfold over decades.
…But if your reference class is “integrating a new immigrant human into the economy—a human who is already generally educated, acculturated, entrepreneurial, etc.”, then you’d expect AI integration to unfold over years, months, even weeks. There’s still on-the-job training and so on, for sure, but we expect the immigrant human to take the initiative to figure out for themselves where the opportunities are and how to exploit them.

We don’t have AI that can do the latter yet, and I for one think that we’re still a paradigm-shift away from it [LW · GW]. But I do expect the development of such AI to look like “people find a new type of learning algorithm” as opposed to “many many people find many many new algorithms for different niches”. After all, again, think of humans. Evolution did not design farmer-humans, and separately design truck-driver-humans, and separately design architect-humans, etc. Instead, evolution designed one human brain, and damn, look at all the different things that that one algorithm can figure out how to do (over time and in collaboration with many other instantiations of the same algorithm etc.).

How soon can we expect this new paradigm-shifting type of learning algorithm? I don’t know. But paradigm shifts in AI can be frighteningly fast. Like, go back a mere 12 years ago, and the entirety of deep learning was a backwater. See my tweet here for more fun examples.

nathan-helm-burger on Big tech transitions are slow (with implications for AI)

Another way to conceive of this is that it takes a certain number of competence-adjusted engineer hours to perform an integration of a novel technology into existing processes.

If AI is able to supply the engineer-hours for its own integration, it seems clear that this would change the wall-clock-time of the integration.

If the first thing AI is integrated into is automating AI R&D, then the AI's competence will rise as an industrial output of the process being integrated. Which further accelerates the process.

The result is dramatic changes over a few months or couple of years.

Also, whether or not AI is integrated into the economy is kind of a side-note if you are facing the possibility of an agent far smarter than any human that has ever lived, and also able to parallelize copies of itself and run at 100s of times human speed. So even discussing integration into the economy as relevant presumes a plateau of capability at approximately human-level. What grounds do we have for expecting that?