LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

next page (older posts) →

LessWrong's (first) album: I Have Been A Good Bing
habryka (habryka4) · 2024-04-01T07:33:45.242Z · comments (158)

Transformers Represent Belief State Geometry in their Residual Stream
Adam Shai (adam-shai) · 2024-04-16T21:16:11.377Z · comments (83)

[link] [April Fools' Day] Introducing Open Asteroid Impact
Linch · 2024-04-01T08:14:15.800Z · comments (29)

[link] Thoughts on seed oil
dynomight · 2024-04-20T12:29:14.212Z · comments (114)

Express interest in an "FHI of the West"
habryka (habryka4) · 2024-04-18T03:32:58.592Z · comments (41)

[link] Paul Christiano named as US AI Safety Institute Head of AI Safety
Joel Burget (joel-burget) · 2024-04-16T16:22:06.937Z · comments (59)

Funny Anecdote of Eliezer From His Sister
Daniel Birnbaum (daniel-birnbaum) · 2024-04-22T22:05:31.886Z · comments (5)

Refusal in LLMs is mediated by a single direction
Andy Arditi (andy-arditi) · 2024-04-27T11:13:06.235Z · comments (76)

OMMC Announces RIP
Adam Scholl (adam_scholl) · 2024-04-01T23:20:00.433Z · comments (5)

[question] Examples of Highly Counterfactual Discoveries?
johnswentworth · 2024-04-23T22:19:19.399Z · answers+comments (93)

[link] FHI (Future of Humanity Institute) has shut down (2005–2024)
gwern · 2024-04-17T13:54:16.791Z · comments (21)

Why Would Belief-States Have A Fractal Structure, And Why Would That Matter For Interpretability? An Explainer
johnswentworth · 2024-04-18T00:27:43.451Z · comments (18)

Reconsider the anti-cavity bacteria if you are Asian
Lao Mein (derpherpize) · 2024-04-15T07:02:02.655Z · comments (41)

On Not Pulling The Ladder Up Behind You
Screwtape · 2024-04-26T21:58:29.455Z · comments (14)

Ironing Out the Squiggles
Zack_M_Davis · 2024-04-29T16:13:00.371Z · comments (34)

[link] Daniel Dennett has died (1942-2024)
kave · 2024-04-19T16:17:04.742Z · comments (5)

LLMs for Alignment Research: a safety priority?
abramdemski · 2024-04-04T20:03:22.484Z · comments (24)

My experience using financial commitments to overcome akrasia
William Howard (william-howard) · 2024-04-15T22:57:32.574Z · comments (31)

RTFB: On the New Proposed CAIP AI Bill
Zvi · 2024-04-10T18:30:08.410Z · comments (14)

[link] Simple probes can catch sleeper agents
Monte M (montemac) · 2024-04-23T21:10:47.784Z · comments (15)

A Selection of Randomly Selected SAE Features
CallumMcDougall (TheMcDouglas) · 2024-04-01T09:09:49.235Z · comments (2)

The first future and the best future
KatjaGrace · 2024-04-25T06:40:04.510Z · comments (11)

[question] What convincing warning shot could help prevent extinction from AI?
Charbel-Raphaël (charbel-raphael-segerie) · 2024-04-13T18:09:29.096Z · answers+comments (18)

Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight
Sam Marks (samuel-marks) · 2024-04-18T16:17:39.136Z · comments (7)

[link] Carl Sagan, nuking the moon, and not nuking the moon
eukaryote · 2024-04-13T04:08:50.166Z · comments (7)

[link] MIRI's April 2024 Newsletter
Harlan · 2024-04-12T23:38:20.781Z · comments (0)

Partial value takeover without world takeover
KatjaGrace · 2024-04-05T06:20:03.961Z · comments (23)

Sparsify: A mechanistic interpretability research agenda
Lee Sharkey (Lee_Sharkey) · 2024-04-03T12:34:12.043Z · comments (22)

A Dozen Ways to Get More Dakka
Davidmanheim · 2024-04-08T04:45:19.427Z · comments (5)

[link] Essay competition on the Automation of Wisdom and Philosophy — $25k in prizes
owencb · 2024-04-16T10:10:13.338Z · comments (6)

Priors and Prejudice
MathiasKB (MathiasKirkBonde) · 2024-04-22T15:00:41.782Z · comments (16)

When is a mind me?
Rob Bensinger (RobbBB) · 2024-04-17T05:56:38.482Z · comments (62)

A couple productivity tips for overthinkers
Steven Byrnes (steve2152) · 2024-04-20T16:05:50.332Z · comments (9)

Coherence of Caches and Agents
johnswentworth · 2024-04-01T23:04:31.320Z · comments (7)

A Gentle Introduction to Risk Frameworks Beyond Forecasting
pendingsurvival · 2024-04-11T18:03:25.605Z · comments (10)

Creating unrestricted AI Agents with Command R+
Simon Lermen (dalasnoin) · 2024-04-16T14:52:50.917Z · comments (12)

[Full Post] Progress Update #1 from the GDM Mech Interp Team
Neel Nanda (neel-nanda-1) · 2024-04-19T19:06:59.185Z · comments (8)

Announcing Suffering For Good
Garrett Baker (D0TheMath) · 2024-04-01T17:08:12.322Z · comments (5)

Rejecting Television
Declan Molony (declan-molony) · 2024-04-23T04:59:50.253Z · comments (10)

[link] Motivation gaps: Why so much EA criticism is hostile and lazy
titotal (lombertini) · 2024-04-22T11:49:59.389Z · comments (5)

[link] The 2nd Demographic Transition
Maxwell Tabarrok (maxwell-tabarrok) · 2024-04-06T14:10:13.095Z · comments (17)

Mid-conditional love
KatjaGrace · 2024-04-17T04:00:08.341Z · comments (19)

[Summary] Progress Update #1 from the GDM Mech Interp Team
Neel Nanda (neel-nanda-1) · 2024-04-19T19:06:17.755Z · comments (0)

Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers
hugofry · 2024-04-29T20:57:35.127Z · comments (7)

[link] The Inner Ring by C. S. Lewis
Saul Munn (saul-munn) · 2024-04-24T22:48:09.228Z · comments (6)

Generalized Stat Mech: The Boltzmann Approach
David Lorell · 2024-04-12T17:47:31.880Z · comments (7)

AXRP Episode 27 - AI Control with Buck Shlegeris and Ryan Greenblatt
DanielFilan · 2024-04-11T21:30:04.244Z · comments (10)

Best in Class Life Improvement
sapphire (deluks917) · 2024-04-04T01:51:02.556Z · comments (15)

How We Picture Bayesian Agents
johnswentworth · 2024-04-08T18:12:48.595Z · comments (11)

Constructability: Plainly-coded AGIs may be feasible in the near future
Épiphanie Gédéon (joy_void_joy) · 2024-04-27T16:04:45.894Z · comments (12)

next page (older posts) →

Archive

Recent comments

raemon on Raemon's Shortform

Are the disagree reacts with ‘small icons are good for this reason (enough to override other concerns)’ or ‘I didn’t update previously?’

d0themath on We might be missing some key feature of AI takeoff; it'll probably seem like "we could've seen this coming"

I will also suggest the questions: 1) What are the things I’m really confident in? And 2) What are the things those I often read or talk to are really confident in? 3) And are there simple arguments which just involve bringing in little-thought-about domains of effect which throw that confidence into question?

jesse-hoogland on Examples of Highly Counterfactual Discoveries?

Anecdotally (I couldn't find confirmation after a few minutes of searching), I remember hearing a claim about Darwin being particularly ahead of the curve with sexual selection & mate choice. That without Darwin it might have taken decades for biologists to come to the same realizations.

review-bot on Cohabitive Games so Far

The LessWrong Review [? · GW] runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year. Will this post make the top fifty?

flewrint-ophiuni on Cooperation is optimal, with weaker agents too - tldr

Thank you for the references! I'm reading your writings, it's interesting

I posted the super-cooperation argument while expecting that LessWrong would likely not be receptive, but I'm not sure which community would engage with all this and find it pertinent at this stage

More concrete and empirical productions seems needed

mir on Johannes C. Mayer's Shortform

I rly like the idea of making songs to powerfwly remind urself abt things. TODO.

Step 1: Set an alarm for the morning. Step 2: Set the alarm tone for this song. Step 3: Make the alarm snooze for 30 minutes after the song has played. Step 4: Make the alarm only dismissable with solving a puzzle. Step 5: Only ever dismiss the alarm after you already left the house for the walk. Step 6: Always have an umbrella for when it is rainy, and have an alternative route without muddy roads.

I currently (until I get around to making a better system...) have an AI voice say reminders to myself based on calendar events I've set up to repeat every day (or any period I've defined). The event description is JSON, and if '"prompt": "Time to take a walk!"' is nonempty, the voice says what's in the prompt.

I don't have any routines that are too forcefwl (like "only dismissable with solving a puzzle"), because I want to minimize whip and maximize carrot. If I can only do what's good bc I force myself to do it, it's much less effective compared to if I just *want* to do what's good all the time.

...But whip can often be effective, so I don't recommend never using it. I'm just especially weak to it, due to not having much social backup-motivation, and a heavy tendency to fall into deep depressive equilibria.

bogdan-ionut-cirstea on Refusal in LLMs is mediated by a single direction

We can implement this as an inference-time intervention: every time a component (e.g. an attention head) writes its output $c_{out} \in R^{d_{model}}$ to the residual stream, we can erase its contribution to the "refusal direction" $^r$ . We can do this by computing the projection of $c_{out}$ onto $^r$ , and then subtracting this projection away:
$c_{out}^{'} \leftarrow c_{out} - (c_{out} \cdot^r)^r$
Note that we are ablating the same direction at every token and every layer. By performing this ablation at every component that writes the residual stream, we effectively prevent the model from ever representing this feature.

I'll note that to me this seems surprisingly spiritually similar to lines 7-8 from Algorithm 1 (at page 13) from Concept Algebra for (Score-Based) Text-Controlled Generative Models, where they 'project out' a direction corresponding to a semantic concept after each diffusion step (in a diffusion model).

This seems notable because the above paper proposes a theory for why linear representations might emerge in diffusion models and the authors seem interested in potentially connecting their findings to representations in transformers (especially in the residual stream). From a response to a review:

Application to Other Generative Models Ultimately, the results in the paper are about non-parametric representations (indeed, the results are about the structure of probability distributions directly!) The importance of diffusion models is that they non-parametrically model the conditional distribution, so that the score representation directly inherits the properties of the distribution.
To apply the results to other generative models, we must articulate the connection between the natural representations of these models (e.g., the residual stream in transformers) and the (estimated) conditional distributions. For autoregressive models like Parti, it’s not immediately clear how to do this. This is an exciting and important direction for future work!
(Very speculatively: models with finite dimensional representations are often trained with objective functions corresponding to log likelihoods of exponential family probability models, such that the natural finite dimensional representation corresponds to the natural parameter of the exponential family model. In exponential family models, the Stein score is exactly the inner product of the natural parameter with $y$. This weakly suggests that additive subspace structure may originate in these models following the same Stein score representation arguments!)
Connection to Interpretability This is a great question! Indeed, a major motivation for starting this line of work is to try to understand if the ''linear subspace hypothesis'' in mechanistic interpretability of transformers is true, and why it arises if so. As just discussed, the missing step for precisely connecting our results to this line of work is articulating how the finite dimensional transformer representation (the residual stream) relates to the log probability of the conditional distributions. Solving this missing step would presumably allow the tool set developed here to be brought to bear on the interpretation of transformers.
One exciting observation here is that linear subspace structure appears to be a generic feature of probability distributions! Much mechanistic interpretability work motivates the linear subspace hypothesis by appealing to special structure of the transformer architecture (e.g., this is Anthropic's usual explanation). In contrast, our results suggest that linear encoding may fundamentally be about the structure of the data generating process.

johannes-c-mayer on Johannes C. Mayer's Shortform

Take a Walk

Taking a walk is the single most important thing. It is really helpful for helping me think. My life magically reassembles itself when I reflect. I notice all the things that I know are good to do but fail to do.

In the past, I noticed that forcing myself to think about my research was counterproductive and devised other strategies for making me think about it, that actually worked, in 15 minutes.

The obvious things just work. Name you just fill your brain with all the research's current state. What did you think about yesterday? Just remember. Just explain it to yourself. With the context loaded the thoughts you want to have will come unbidden. Even when your walk is over you retain this context. Doing more research is natural now.

There were many other things I figured out during the walk, like the importance of structuring my research workflow, how meditation can help me, what the current bottleneck in my research is, and more.

It's proven tried and true. So it's ridiculous that so far I have not managed to can't notice its power. Of all the things that I do in a day, I thought this was one of the least important. But I was so wrong.

I also like talking to IA out loud during the walk. It's really fun and helpful. Talking out loud is helpful for me to build a better understanding, and IA often has good suggestions.

So how do we do this? How can we never forget to take a 30-minute walk in the sun? We make this song, and then go on:

and on and on and on.

We can also list other advantages to a walk, to make our brain remember this:

If you do it in the morning you get some sunlight which tells your brain to wake up. It's very effective.
Taking a walk takes you away from your computer. It's much harder for NixOS to eat you.
It's easy for me to talk to IA out loud when I am in a forest where nobody can hear me. The interaction is just better there. I hope to one day carry through my fearlessness from the walk to the rest of my life.

With that now said, let's talk about, how to never forget to take your daily work now:

Step 1: Set an alarm for the morning. Step 2: Set the alarm tone for this song. Step 3: Make the alarm snooze for 30 minutes after the song has played. Step 4: Make the alarm only dismissable with solving a puzzle. Step 5: Only ever dismiss the alarm after you already left the house for the walk. Step 6: Always have an umbrella for when it is rainy, and have an alternative route without muddy roads.

Now may you succeed!

askwho on AI #63: Introducing Alpha Fold 3

I've produced a muti-voiced ElevenLabs quality AI narration episode for this post:

https://askwhocastsai.substack.com/p/ai-63-introducing-alpha-fold-3-by

erioire on Deep Honesty

There's soft skills in "communicating to others without hurting them", (i.e. "tact")

What about the situation in which:

One has highly religious relatives who are somewhat less cognitively functional that oneself
You wish you could help them have a map more closely coupled to reality
You are confident that you have a good chance of convincing them of reality, but not that the knowledge would actually be a net gain for them to have, since:
They are so invested in their beliefs that the realization of falsehood might do irreparable psychological damage