LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[link] Searching for the Root of the Tree of Evil
Ivan Vendrov (ivan-vendrov) · 2024-06-08T17:05:53.950Z · comments (14)

Debate: Is it ethical to work at AI capabilities companies?
Ben Pace (Benito) · 2024-08-14T00:18:38.846Z · comments (21)

My disagreements with "AGI ruin: A List of Lethalities"
Noosphere89 (sharmake-farah) · 2024-09-15T17:22:18.367Z · comments (46)

[link] UC Berkeley course on LLMs and ML Safety
Dan H (dan-hendrycks) · 2024-07-09T15:40:00.920Z · comments (1)

[link] [Paper] Hidden in Plain Text: Emergence and Mitigation of Steganographic Collusion in LLMs
Yohan Mathew (ymath) · 2024-09-25T14:52:48.263Z · comments (2)

[link] Big tech transitions are slow (with implications for AI)
jasoncrawford · 2024-10-24T14:25:06.873Z · comments (16)

[question] What are your cruxes for imprecise probabilities / decision rules?
Anthony DiGiovanni (antimonyanthony) · 2024-07-31T15:42:27.057Z · answers+comments (33)

Is the Power Grid Sustainable?
jefftk (jkaufman) · 2024-10-26T02:30:06.612Z · comments (38)

But Where do the Variables of my Causal Model come from?
Dalcy (Darcy) · 2024-08-09T22:07:57.395Z · comments (1)

A Matter of Taste
Zvi · 2024-12-18T17:50:07.201Z · comments (4)

Drone Wars Endgame
RussellThor · 2024-02-01T02:30:46.161Z · comments (71)

AI Safety Camp final presentations
Linda Linsefors · 2024-03-29T14:27:43.503Z · comments (3)

Good job opportunities for helping with the most important century
HoldenKarnofsky · 2024-01-18T17:30:03.332Z · comments (0)

Closeness To the Issue (Part 5 of "The Sense Of Physical Necessity")
LoganStrohl (BrienneYudkowsky) · 2024-03-09T00:36:47.388Z · comments (0)

[link] Claude 3 Opus can operate as a Turing machine
Gunnar_Zarncke · 2024-04-17T08:41:57.209Z · comments (2)

AI #47: Meet the New Year
Zvi · 2024-01-13T16:20:10.519Z · comments (7)

Childhood and Education Roundup #5
Zvi · 2024-04-17T13:00:03.015Z · comments (4)

Introduce a Speed Maximum
jefftk (jkaufman) · 2024-01-11T02:50:04.284Z · comments (28)

The Evolution of Humans Was Net-Negative for Human Values
Zack_M_Davis · 2024-04-01T16:01:10.037Z · comments (1)

[link] Toki pona FAQ
dkl9 · 2024-03-17T21:44:21.782Z · comments (8)

On Dwarkesh’s 3rd Podcast With Tyler Cowen
Zvi · 2024-02-02T19:30:05.974Z · comments (9)

Grammars, subgrammars, and combinatorics of generalization in transformers
Dmitry Vaintrob (dmitry-vaintrob) · 2025-01-02T09:37:23.191Z · comments (0)

[link] The Way According To Zvi
Sable · 2024-12-07T17:35:48.769Z · comments (0)

Deep Learning is cheap Solomonoff induction?
Lucius Bushnaq (Lblack) · 2024-12-07T11:00:56.455Z · comments (1)

Cross-context abduction: LLMs make inferences about procedural training data leveraging declarative facts in earlier training data
Sohaib Imran (sohaib-imran) · 2024-11-16T23:22:21.857Z · comments (11)

Intent alignment as a stepping-stone to value alignment
Seth Herd · 2024-11-05T20:43:24.950Z · comments (4)

[question] Feedback request: what am I missing?
Nathan Helm-Burger (nathan-helm-burger) · 2024-11-02T17:38:39.625Z · answers+comments (5)

Fertility Roundup #4
Zvi · 2024-12-02T14:30:05.968Z · comments (16)

[question] Which Biases are most important to Overcome?
abstractapplic · 2024-12-01T15:40:06.096Z · answers+comments (24)

Rolling Thresholds for AGI Scaling Regulation
Larks · 2025-01-12T01:30:23.797Z · comments (3)

[link] Twitter thread on politics of AI safety
Richard_Ngo (ricraz) · 2024-07-31T00:00:34.298Z · comments (2)

[link] Is the AI Doomsday Narrative the Product of a Big Tech Conspiracy?
garrison · 2024-12-04T19:20:59.286Z · comments (1)

Humans aren't fleeb.
Charlie Steiner · 2024-01-24T05:31:46.929Z · comments (5)

An Introduction to Representation Engineering - an activation-based paradigm for controlling LLMs
Jan Wehner · 2024-07-14T10:37:21.544Z · comments (5)

Index of rationalist groups in the Bay Area July 2024
Lucie Philippon (lucie-philippon) · 2024-07-26T16:32:25.337Z · comments (14)

Categories of leadership on technical teams
benkuhn · 2024-07-22T04:50:04.071Z · comments (0)

A sketch of acausal trade in practice
Richard_Ngo (ricraz) · 2024-02-04T00:32:54.622Z · comments (4)

My Detailed Notes & Commentary from Secular Solstice
Jeffrey Heninger (jeffrey-heninger) · 2024-03-23T18:48:51.894Z · comments (16)

Open consultancy: Letting untrusted AIs choose what answer to argue for
Fabien Roger (Fabien) · 2024-03-12T20:38:03.785Z · comments (5)

LessWrong: After Dark, a new side of LessWrong
So8res · 2024-04-01T22:44:04.449Z · comments (5)

Dangers of Closed-Loop AI
Gordon Seidoh Worley (gworley) · 2024-03-22T23:52:22.010Z · comments (9)

How I select alignment research projects
Ethan Perez (ethan-perez) · 2024-04-10T04:33:08.092Z · comments (4)

Agency in Politics
Martin Sustrik (sustrik) · 2024-07-17T05:30:01.873Z · comments (2)

Predictive model agents are sort of corrigible
Raymond D · 2024-01-05T14:05:03.037Z · comments (6)

What Helped Me - Kale, Blood, CPAP, X-tiamine, Methylphenidate
Johannes C. Mayer (johannes-c-mayer) · 2024-01-03T13:22:11.700Z · comments (12)

Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI?
RogerDearnaley (roger-d-1) · 2024-01-11T12:56:29.672Z · comments (4)

[link] On Fables and Nuanced Charts
Niko_McCarty (niko-2) · 2024-09-08T17:09:07.503Z · comments (2)

The Cognitive Bootcamp Agreement
Raemon · 2024-10-16T23:24:05.509Z · comments (0)

[link] My article in The Nation — California’s AI Safety Bill Is a Mask-Off Moment for the Industry
garrison · 2024-08-15T19:25:59.592Z · comments (0)

Representation Tuning
Christopher Ackerman (christopher-ackerman) · 2024-06-27T17:44:33.338Z · comments (9)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

jatin-nainani on Scaling Sparse Feature Circuit Finding to Gemma 9B

Thanks a lot for this review!

On strengths, we also believe that we are the first to examine “few saes” for scalable circuit discovery.

On weaknesses,

While we plan to do a more thorough sweep of SAE placements and comparison, the first weakness remains true for this post.
Our major argument for the support of using few SAEs is imaging them as interpretable bottlenecks. Because they are so minimal and interpretable, they allow us to understand blocks of the transformer between them functionally (in terms of input and output). We were going to include more intuition about this but were worried it might add unnecessary complications. We mention the fact about residual stream to highlight that information cannot be passed to layer L+1 by any other path than the residual output of layer L. Thus, by training a mask at layer L, we find a minimal set of representations needed for future layers. To future layers, nothing other than these latents matter. We do agree that the nature of circuits found with coarse grained saes will differ, and this needs to be further studied.
We plan to explore the “gender bias removal” of Marks et al. [1] to compare the downstream application effectiveness. However, we do have a small application where we found a "bug" in the model, covered in section 5, where it over relies on duplicate token latents. We can try to do something similar to Marks et al.[1] in trying to "fix" this bug
Thanks for sharing the citation!

A core question shared in the community is whether the idea of circuits is plausible as models continue to scale up. Current automated methods either are too computationally expensive or generate a subgraph that is too large to examine. We explore the idea of a few equally spaced SAEs with the goal of solving both those issues. Though as you mentioned, a more thorough comparison between circuits of different numbers of saes is needed.

diego-caples on Scaling Sparse Feature Circuit Finding to Gemma 9B

Thanks! We use mean ablation because this lets us create circuits including only the things which change between examples in a task. So, for example, in the code task, our circuits do not need “is python” latents, as these latents are consistent across all samples. Were we to zero ablate, we would need every single SAE latent necessary for every part of the task. This includes things which were consistent across all tasks! This means many latents which we don’t really care about are included in our circuits.

daniel-tan on Daniel Tan's Shortform

Introspection is an instantiation of 'Connecting the Dots'.

Connecting the Dots: train a model g on (x, f(x)) pairs; the model g can infer things about f.
Introspection: Train a model g on (x, f(x)) pairs, where x are prompts and f(x) are the model's responses. Then the model can infer things about f. Note that here we have f = g, which is a special case of the above.

daniel-tan on Daniel Tan's Shortform

How does language model introspection work? What mechanisms could be at play?

'Introspection': When we ask a language model about its own capabilities, a lot of times this turns out to be a calibrated estimate of the actual capabilities. E.g. models 'know what they know', i.e. can predict whether they know answers to factual questions. Furthermore this estimate gets better when models have access to their previous (question, answer) pairs.

One simple hypothesis is that a language model simply infers the general level of capability from the previous text. Then we'd expect that more powerful language models are better at this than weaker ones. However, Owain's work on introspection finds evidence to the contrary. This implies there must be 'privileged information'.

Another possibility is that model simulates itself answering the inner question, and then uses that information to answer the outer question, similar to latent reasoning. If models really do this two-step computation, then it should be possible to recover a 'bridge entity' at some point in the representations.

It's plausible that the privileged information is something 'more abstract', not amounting to a full simulation of the language model's own forward pass, but nonetheless carrying useful information about it's own level of capability.

samuelshadrach on xpostah's Shortform

I've made a new "quick take" explaining it. Please let me know.

P.S. Anybody can purchase any domain for $10, I don't see why domains should be more trustworthy than IP addresses. Anyway, I've added it to my domain now.

samuelshadrach on xpostah's Shortform

http://tokensfortokens.samuelshadrach.com/sender.html

Pay for OpenAI API usage using cryptocurrency.

Currently supported: OpenAI o1 model, USDC on Optimism Rollup on ethereum.

Why use this?

- You want anonymity

- You want to use AI for cheaper than the rate OpenAI charges

How to use this?

- You have to purchase a few dollars of USDC and ETH on Optimism Rollup, and install Metamask browser extension. Then you can visit the website.

More info:

- o1 by OpenAI is the best AI model in the world as of Jan 2025. It is good for reasoning especially on problems involving math and code. OpenAI is partially owned by Microsoft and is currently valued above $100 billion.

- Optimism is the second largest rollup on top of ethereum blockchain. Ethereum is the second largest blockchain in terms of market capitalisation. (Bitcoin is the largest. Bitcoin has very limited functionality, and it is difficult to build apps using it.) People use rollups to avoid the large transaction fees charged by blockchains, while still getting similar level of security. As of 2025 users have trusted Optimism with around $7 billion in assets. Optimism is funded by Paradigm, one of top VCs in the cryptocurrency space.

- USDC is a stablecoin issued by Circle, a registered financial company in the US. A stablecoin is a cryptocurrency token issued by a financial company where the company holds one dollar (or euro etc) in their bank account for every token they issue. This ensures the value of the token remains $1. As of 2025, USDC is the world's second largest stablecoin with $45 billion in reserves.

hzn on Do Antidepressants work? (First Take)

Do you have any opinion on bupropion vs SSRIs/SNRIs?

hzn on Do Antidepressants work? (First Take)

I don't know about depression. But anecdotally they seem to be highly effective (even overly effective) against anxiety. They also tend to have undesirable effects like reduced sex drive & inappropriate or reduced motivation -- the latter possibly a downstream effect of reduced anxiety. So the fact that they would help some people but hurt others seems very likely true.

mitchell_porter on A Novel Idea for Harnessing Magnetic Reconnection as an Energy Source

Was this idea or post AI-assisted in any way? If it was, it seems a cut above the usual AI-generated material, and I would be interested to know your methods.

mitchell_porter on No one has the ball on 1500 Russian olympiad winners who've received HPMOR

Is there a good Russian-language introduction to AI alignment?