LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

My January alignment theory Nanowrimo
Dmitry Vaintrob (dmitry-vaintrob) · 2025-01-02T00:07:24.050Z · comments (2)

Estimating the benefits of a new flu drug (BXM)
DirectedEvolution (AllAmericanBreakfast) · 2025-01-06T04:31:16.837Z · comments (2)

2025 Color Trends
sarahconstantin · 2024-10-07T21:20:03.962Z · comments (7)

[Linkpost] Play with SAEs on Llama 3
Tom McGrath · 2024-09-25T22:35:44.824Z · comments (2)

[question] What are the most interesting / challenging evals (for humans) available?
Raemon · 2024-12-27T03:05:26.831Z · answers+comments (13)

[link] FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI
Tamay · 2024-11-14T06:13:22.042Z · comments (0)

[question] Are You More Real If You're Really Forgetful?
Thane Ruthenis · 2024-11-24T19:30:55.233Z · answers+comments (25)

Open Source Replication of Anthropic’s Crosscoder paper for model-diffing
Connor Kissane (ckkissane) · 2024-10-27T18:46:21.316Z · comments (4)

Monthly Roundup #23: October 2024
Zvi · 2024-10-16T13:50:05.869Z · comments (13)

Nitric oxide for covid and other viral infections
Elizabeth (pktechgirl) · 2024-02-07T21:30:03.774Z · comments (6)

[link] Understanding Gödel’s completeness theorem
jessicata (jessica.liu.taylor) · 2024-05-27T18:55:02.079Z · comments (0)

Natural abstractions are observer-dependent: a conversation with John Wentworth
Martín Soto (martinsq) · 2024-02-12T17:28:38.889Z · comments (13)

Apply to the 2024 PIBBSS Summer Research Fellowship
Nora_Ammann · 2024-01-12T04:06:58.328Z · comments (1)

[link] Language Models Don't Learn the Physical Manifestation of Language
Bruce W. Lee (bruce-lee) · 2024-02-22T18:52:32.237Z · comments (23)

International Scientific Report on the Safety of Advanced AI: Key Information
Aryeh Englander (alenglander) · 2024-05-18T01:45:10.194Z · comments (0)

Instrumental deception and manipulation in LLMs - a case study
Olli Järviniemi (jarviniemi) · 2024-02-24T02:07:01.769Z · comments (13)

China-AI forecasts
[deleted] · 2024-02-25T16:49:33.652Z · comments (29)

[link] Linear infra-Bayesian Bandits
Vanessa Kosoy (vanessa-kosoy) · 2024-05-10T06:41:09.206Z · comments (5)

Simple distribution approximation: When sampled 100 times, can language models yield 80% A and 20% B?
Teun van der Weij (teun-van-der-weij) · 2024-01-29T00:24:27.706Z · comments (5)

You're a Space Wizard, Luke
lsusr · 2024-08-18T05:35:39.238Z · comments (6)

[link] AI governance needs a theory of victory
Corin Katzke (corin-katzke) · 2024-06-21T16:15:46.560Z · comments (6)

Forget Everything (Statistical Mechanics Part 1)
J Bostock (Jemist) · 2024-04-22T13:33:35.446Z · comments (6)

[link] An AI Manhattan Project is Not Inevitable
Maxwell Tabarrok (maxwell-tabarrok) · 2024-07-06T16:42:35.920Z · comments (25)

[Interim research report] Evaluating the Goal-Directedness of Language Models
Rauno Arike (rauno-arike) · 2024-07-18T18:19:04.260Z · comments (4)

Signaling with Small Orange Diamonds
jefftk (jkaufman) · 2024-11-07T20:20:08.026Z · comments (1)

What happens next?
Logan Zoellner (logan-zoellner) · 2024-12-29T01:41:33.685Z · comments (19)

Stitching SAEs of different sizes
Bart Bussmann (Stuckwork) · 2024-07-13T17:19:20.506Z · comments (12)

Medical Roundup #3
Zvi · 2024-07-09T13:10:06.862Z · comments (4)

D&D.Sci: Whom Shall You Call?
abstractapplic · 2024-07-05T20:53:37.010Z · comments (6)

Mech Interp Lacks Good Paradigms
Daniel Tan (dtch1997) · 2024-07-16T15:47:32.171Z · comments (0)

Evaluating Sparse Autoencoders with Board Game Models
Adam Karvonen (karvonenadam) · 2024-08-02T19:50:21.525Z · comments (1)

Free Will and Dodging Anvils: AIXI Off-Policy
Cole Wyeth (Amyr) · 2024-08-29T22:42:24.485Z · comments (12)

[question] What progress have we made on automated auditing?
LawrenceC (LawChan) · 2024-07-06T01:49:43.714Z · answers+comments (1)

Inducing Unprompted Misalignment in LLMs
Sam Svenningsen (sven) · 2024-04-19T20:00:58.067Z · comments (7)

(Appetitive, Consummatory) ≈ (RL, reflex)
Steven Byrnes (steve2152) · 2024-06-15T15:57:39.533Z · comments (1)

[link] Generative ML in chemistry is bottlenecked by synthesis
Abhishaike Mahajan (abhishaike-mahajan) · 2024-09-16T16:31:34.801Z · comments (2)

AI #70: A Beautiful Sonnet
Zvi · 2024-06-27T14:40:08.087Z · comments (0)

[link] [Fiction] A Confession
Arjun Panickssery (arjun-panickssery) · 2024-04-18T16:28:48.194Z · comments (2)

Dialogue on What It Means For Something to Have A Function/Purpose
johnswentworth · 2024-07-15T16:28:56.609Z · comments (5)

LLMs as a Planning Overhang
Larks · 2024-07-14T02:54:14.295Z · comments (8)

The "context window" analogy for human minds
Ruby · 2024-02-13T19:29:10.387Z · comments (0)

[link] On what research policymakers actually need
MondSemmel · 2024-04-23T19:50:12.833Z · comments (0)

[link] Tinker
Richard_Ngo (ricraz) · 2024-04-16T18:26:38.679Z · comments (0)

[link] Win Friends and Influence People Ch. 2: The Bombshell
gull · 2024-01-28T21:40:47.986Z · comments (13)

[link] Things You're Allowed to Do: At the Dentist
rbinnn · 2024-01-28T18:39:33.584Z · comments (16)

Losing Faith In Contrarianism
omnizoid · 2024-04-25T20:53:34.842Z · comments (44)

[link] Elon files grave charges against OpenAI
mako yass (MakoYass) · 2024-03-01T17:42:13.963Z · comments (10)

Are we so good to simulate?
KatjaGrace · 2024-03-04T05:20:03.535Z · comments (24)

Is This Lie Detector Really Just a Lie Detector? An Investigation of LLM Probe Specificity.
Josh Levy (josh-levy) · 2024-06-04T15:45:54.399Z · comments (0)

AI #48: The Talk of Davos
Zvi · 2024-01-25T16:20:26.625Z · comments (9)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

wachichornia on Applying traditional economic thinking to AGI: a trilemma

I think he’s talking about coast disease?

https://en.m.wikipedia.org/wiki/Baumol_effect

zack_m_davis on Alignment Implications of LLM Successes: a Debate in One Act

(Self-review.) I'm as proud of this post as I am disappointed that it was necessary. As I explained to my prereaders on 19 October 2023:

My intent is to raise the level of the discourse by presenting an engagement between the standard MIRI view and a view that's relatively optimistic about prosaic alignment. The bet is that my simulated dialogue (with me writing both parts) can do a better job than the arguments being had by separate people in the wild; I think Simplicia understands things that e.g. Matthew Barnett doesn't [LW · GW]. (The karma system loved my dialogue comment on Barnett's post [LW(p) · GW(p)]; this draft is trying to scale that up.)

I'm annoyed at the discourse situation where MIRI thinks we're dead for the same fundamental reasons as in 2016, but meanwhile, there are a lot of people who are looking at GPT-4, and thinking, "Hey, this thing seems pretty smart and general and good at Doing What I Mean, in contrast to how 2016-era MIRI said that we didn't know how to get an agent to fill a cauldron; maybe alignment is easy??"—to which MIRI's response has been (my uncharitable paraphrase), "You people are idiots who didn't understand the core arguments; the cauldron thing was a toy illustration of a deep math thing; we never said Midjourney can't exist".

And just, I agree that Midjourney doesn't refute the deep math thing and the people who don't realize that are idiots, but I think the idiots deserve a better response!—particularly insofar as we're worried about transformative AI looking a lot like the systems we see now, rather than taking a "LLMs are nothing like AGI" stance.

Simplicia isn't supposed to pass the ITT of anyone in particular, but if the other character [...] doesn't match the MIRI party line, that's definitely a serious flaw that needs to be fixed!

I think the dialogue format works particularly well in cases like this where the author or the audience is supposed to find both viewpoints broadly credible, rather than [LW(p) · GW(p)] an author avatar beating up on a strawman. (I did have some fun with Doomimir's characterization, but that shouldn't affect the arguments.)

This is a complicated topic. To the extent that I was having my own doubts about the "orthodox" pessimist story in the GPT-4-era, it was liberating to be able to explore those doubts in public by putting them in the mouth of a character with the designated idiot character name without staking my reputation on Simplicia's counterarguments necessarily being correct.

Giving both characters perjorative names makes it fair. In an earlier draft, Doomimir was "Doomer", but I was already using the "Optimistovna" and "Doomovitch" patronymics (I had been consuming fiction about the Soviet Union recently) and decided it should sound more Slavic. (Plus, "-mir" (мир) can mean "world".)

oliver-daniels on Scaling Sparse Feature Circuit Finding to Gemma 9B

Thanks for the thorough response, and apologies for missing the case study!

I think I regret / was wrong about my initial vaguely negative reaction - scaling SAE circuit discovery to large models is a notable achievement!

Re residual skip SAEs: I'm basically on board with "only use residual stream SAEs", but skipping layers still feels unprincipled. Like imagine if you only trained an SAE on the final layer of the model. By including all the features, you could perfectly recover the model behavior up to the SAE reconstruction loss, but you would have ~no insight into how the model computed the final layer features. More generally, by skipping layers, you risk missing potentially important intermediate features. ofc to scale stuff you need to make sacrifices somewhere, but stuff in the vicinity of Cross-Coders feels more comprising

cstinesublime on CstineSublime's Shortform

How can you mimic the decision making of someone 'smarter' or at least with more know-how than you if... you... don't know-how?

Wearing purple clothes like Prince, getting his haircut, playing a 'love symbol guitar' and other superficialities won't make me as great a performer as he was, because the tail doesn't wag the dog.

Similarly if I wanted to write songs like him, using the same drum machines, writing lyrics with "2" and "U" and "4" and loading them with Christian allusions and sexual imagery, I'd be lucky if I'm perceptive enough as a mimic to produce some pastiches. However if I wanted to drill further, how might I 'black box' his songwriting mind, reverse engineer which cruxes and decision pivots determine what rhyming or rhythm patterns he chooses, what chord progressions he operates on. Maybe after years of doing this I'd have a model composed of testable hypotheses that I could run experiments on, either by reverse engineering songs of his at random and seeing if they hold to the patterns I observed, writing my own songs in this manner and seeing if they have that 'x-factor' (hardest and most subjective of all), and finally comparing the stated narratives in biographies and interviews about how certain songs were written in accordance with my hypotheses.

Of course someone is going to say that you can't reduce a human being, let alone a super-talented human being to a formula, and perhaps draws a long bow about why they don't like A.I. art or modern Hollywood or whatever. All sentiments I'm sympathetic too even if I'm not 100% sold on.

What I'm thinking about is not too dissimilar from what Ray Dalio advises: One shouldn't just trust an expert's conclusion or advice blindly, even if they have a unparalleled pedigree.

But because I'm pretty extreme in believing that it is important to obtain understanding rather than accepting doctrine at face value, I would encourage the new batter not to accept what [Babe] Ruth has to say as right just because he was the greatest slugger of all time. If I were that new batter, I wouldn't stop questioning Ruth until I was confident I had found the truth.

In both cases rather than just taking the end result blindly - writing parodic or pastiches songs - the tail doesn't wag the dog - there is an attempt to find out why, to question!

My problem isn't so much that Prince (or Babe Ruth) are no longer around to answer these questions, but that unlike a multi-billionaire like Ray Dalio, anyone with sufficient pedigree is unlikely to pick up the phone and answer my incessant questions about "why?" and "how come?". I have to black-box it.

kave on Agreeing With Stalin in Ways That Exhibit Generally Rationalist Principles

Here's a little bit about the tapes in the video

jatin-nainani on Scaling Sparse Feature Circuit Finding to Gemma 9B

Thanks a lot for this review!

On strengths, we also believe that we are the first to examine “few saes” for scalable circuit discovery.

On weaknesses,

While we plan to do a more thorough sweep of SAE placements and comparison, the first weakness remains true for this post.
Our major argument for the support of using few SAEs is imaging them as interpretable bottlenecks. Because they are so minimal and interpretable, they allow us to understand blocks of the transformer between them functionally (in terms of input and output). We were going to include more intuition about this but were worried it might add unnecessary complications. We mention the fact about residual stream to highlight that information cannot be passed to layer L+1 by any other path than the residual output of layer L. Thus, by training a mask at layer L, we find a minimal set of representations needed for future layers. To future layers, nothing other than these latents matter. We do agree that the nature of circuits found with coarse grained saes will differ, and this needs to be further studied.
We plan to explore the “gender bias removal” of Marks et al. [1] to compare the downstream application effectiveness. However, we do have a small application where we found a "bug" in the model, covered in section 5, where it over relies on duplicate token latents. We can try to do something similar to Marks et al.[1] in trying to "fix" this bug
Thanks for sharing the citation!

A core question shared in the community is whether the idea of circuits is plausible as models continue to scale up. Current automated methods either are too computationally expensive or generate a subgraph that is too large to examine. We explore the idea of a few equally spaced SAEs with the goal of solving both those issues. Though as you mentioned, a more thorough comparison between circuits of different numbers of saes is needed.

diego-caples on Scaling Sparse Feature Circuit Finding to Gemma 9B

Thanks! We use mean ablation because this lets us create circuits including only the things which change between examples in a task. So, for example, in the code task, our circuits do not need “is python” latents, as these latents are consistent across all samples. Were we to zero ablate, we would need every single SAE latent necessary for every part of the task. This includes things which were consistent across all tasks! This means many latents which we don’t really care about are included in our circuits.

daniel-tan on Daniel Tan's Shortform

Introspection is an instantiation of 'Connecting the Dots'.

Connecting the Dots: train a model g on (x, f(x)) pairs; the model g can infer things about f.
Introspection: Train a model g on (x, f(x)) pairs, where x are prompts and f(x) are the model's responses. Then the model can infer things about f. Note that here we have f = g, which is a special case of the above.

daniel-tan on Daniel Tan's Shortform

How does language model introspection work? What mechanisms could be at play?

'Introspection': When we ask a language model about its own capabilities, a lot of times this turns out to be a calibrated estimate of the actual capabilities. E.g. models 'know what they know', i.e. can predict whether they know answers to factual questions. Furthermore this estimate gets better when models have access to their previous (question, answer) pairs.

One simple hypothesis is that a language model simply infers the general level of capability from the previous text. Then we'd expect that more powerful language models are better at this than weaker ones. However, Owain's work on introspection finds evidence to the contrary. This implies there must be 'privileged information'.

Another possibility is that model simulates itself answering the inner question, and then uses that information to answer the outer question, similar to latent reasoning. If models really do this two-step computation, then it should be possible to recover a 'bridge entity' at some point in the representations.

It's plausible that the privileged information is something 'more abstract', not amounting to a full simulation of the language model's own forward pass, but nonetheless carrying useful information about it's own level of capability.

samuelshadrach on xpostah's Shortform

I've made a new "quick take" explaining it. Please let me know.

P.S. Anybody can purchase any domain for $10, I don't see why domains should be more trustworthy than IP addresses. Anyway, I've added it to my domain now.