LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Mid-conditional love
KatjaGrace · 2024-04-17T04:00:08.341Z · comments (19)

[link] Motivation gaps: Why so much EA criticism is hostile and lazy
titotal (lombertini) · 2024-04-22T11:49:59.389Z · comments (5)

[link] The Inner Ring by C. S. Lewis
Saul Munn (saul-munn) · 2024-04-24T22:48:09.228Z · comments (6)

[Summary] Progress Update #1 from the GDM Mech Interp Team
Neel Nanda (neel-nanda-1) · 2024-04-19T19:06:17.755Z · comments (0)

LessWrong Community Weekend 2024, open for applications
UnplannedCauliflower · 2024-05-01T10:18:21.992Z · comments (2)

Duct Tape security
Isaac King (KingSupernova) · 2024-04-26T18:57:05.659Z · comments (12)

Constructability: Plainly-coded AGIs may be feasible in the near future
Épiphanie Gédéon (joy_void_joy) · 2024-04-27T16:04:45.894Z · comments (12)

Introducing AI-Powered Audiobooks of Rational Fiction Classics
Askwho · 2024-05-04T17:32:49.719Z · comments (13)

LW Frontpage Experiments! (aka "Take the wheel, Shoggoth!")
Ruby · 2024-04-23T03:58:43.443Z · comments (25)

[link] DeepMind: Frontier Safety Framework
Zach Stein-Perlman · 2024-05-17T17:30:02.504Z · comments (0)

Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems
Joar Skalse (Logical_Lunatic) · 2024-05-17T19:13:31.380Z · comments (5)

On Llama-3 and Dwarkesh Patel’s Podcast with Zuckerberg
Zvi · 2024-04-22T13:10:02.645Z · comments (4)

AISC9 has ended and there will be an AISC10
Linda Linsefors · 2024-04-29T10:53:18.812Z · comments (4)

[link] Improving Dictionary Learning with Gated Sparse Autoencoders
Senthooran Rajamanoharan (SenR) · 2024-04-25T18:43:47.003Z · comments (35)

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks
Lucius Bushnaq (Lblack) · 2024-05-20T17:53:25.985Z · comments (2)

How to be an amateur polyglot
arisAlexis (arisalexis) · 2024-05-08T15:08:11.404Z · comments (16)

[link] How do open AI models affect incentive to race?
jessicata (jessica.liu.taylor) · 2024-05-07T00:33:20.658Z · comments (13)

LessOnline Festival Updates Thread
Ben Pace (Benito) · 2024-04-18T21:55:08.003Z · comments (26)

Transcoders enable fine-grained interpretable circuit analysis for language models
Jacob Dunefsky (jacob-dunefsky) · 2024-04-30T17:58:09.982Z · comments (14)

[link] Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems
Gunnar_Zarncke · 2024-05-16T13:09:39.265Z · comments (4)

[link] "Why I Write" by George Orwell (1946)
Arjun Panickssery (arjun-panickssery) · 2024-04-25T16:02:28.668Z · comments (2)

So What's Up With PUFAs Chemically?
J Bostock (Jemist) · 2024-04-27T13:32:52.159Z · comments (23)

[question] Shane Legg's necessary properties for every AGI Safety plan
jacquesthibs (jacques-thibodeau) · 2024-05-01T17:15:41.233Z · answers+comments (12)

[link] This is Water by David Foster Wallace
Nathan Young · 2024-04-24T21:21:09.445Z · comments (16)

Apply to ESPR & PAIR, Rationality and AI Camps for Ages 16-21
Anna Gajdova (anna-gajdova) · 2024-05-03T12:36:37.610Z · comments (0)

Now THIS is forecasting: understanding Epoch’s Direct Approach
Elliot_Mckernon (elliot) · 2024-05-04T12:06:48.144Z · comments (4)

Experiment on repeating choices
KatjaGrace · 2024-04-19T04:20:03.992Z · comments (1)

[link] Let's Design A School, Part 1
Sable · 2024-04-23T21:50:20.937Z · comments (5)

[link] OpenAI releases GPT-4o, natively interfacing with text, voice and vision
Martín Soto (martinsq) · 2024-05-13T18:50:52.337Z · comments (23)

[link] Moving on from community living
Vika · 2024-04-17T17:02:11.357Z · comments (7)

some thoughts on LessOnline
Raemon · 2024-05-08T23:17:41.372Z · comments (5)

[link] Questions are usually too cheap
Nathan Young · 2024-05-11T13:00:54.302Z · comments (19)

Transfer Learning in Humans
niplav · 2024-04-21T20:49:42.595Z · comments (1)

[link] Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
Dan Braun (Daniel Braun) · 2024-05-17T16:25:02.267Z · comments (2)

Superposition is not "just" neuron polysemanticity
LawrenceC (LawChan) · 2024-04-26T23:22:06.066Z · comments (4)

Towards a formalization of the agent structure problem
Alex_Altair · 2024-04-29T20:28:15.190Z · comments (5)

Can we build a better Public Doublecrux?
Raemon · 2024-05-11T19:21:53.326Z · comments (7)

Spatial attention as a “tell” for empathetic simulation?
Steven Byrnes (steve2152) · 2024-04-26T15:10:58.040Z · comments (11)

[link] LLMs seem (relatively) safe
JustisMills · 2024-04-25T22:13:06.221Z · comments (24)

Why Care About Natural Latents?
johnswentworth · 2024-05-09T23:14:30.626Z · comments (3)

Observations on Teaching for Four Weeks
ClareChiaraVincent · 2024-05-06T16:55:59.315Z · comments (14)

Why you should learn a musical instrument
cata · 2024-05-15T20:36:16.034Z · comments (23)

Changes in College Admissions
Zvi · 2024-04-24T13:50:03.487Z · comments (10)

Catastrophic Goodhart in RL with KL penalty
Thomas Kwa (thomas-kwa) · 2024-05-15T00:58:20.763Z · comments (7)

Mechanistic Interpretability Workshop Happening at ICML 2024!
Neel Nanda (neel-nanda-1) · 2024-05-03T01:18:26.936Z · comments (6)

[link] Designing for a single purpose
Itay Dreyfus (itay-dreyfus) · 2024-05-07T14:11:22.242Z · comments (12)

The Mom Test: Summary and Thoughts
Adam Zerner (adamzerner) · 2024-04-18T03:34:21.020Z · comments (3)

I'm open for projects (sort of)
cousin_it · 2024-04-18T18:05:01.395Z · comments (12)

[link] "If we go extinct due to misaligned AI, at least nature will continue, right? ... right?"
plex (ete) · 2024-05-18T14:09:53.014Z · comments (23)

How to do conceptual research: Case study interview with Caspar Oesterheld
Chi Nguyen · 2024-05-14T15:09:30.390Z · comments (5)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

lblack on Interpretability: Integrated Gradients is a decent attribution method

If you want to get attributions between all pairs of basis elements/features in two layers, attributions based on the effect of a marginal ablation will take you forward passes, where $d$ is the number of features in a layer. Integrated gradients will take $O (d)$ backward passes, and if you're willing to write custom code that exploits the specific form of the layer transition, it can take less than that.

If you're averaging over a data set, IG is also amendable to additional cost reduction through stochastic source techniques.

johannes-c-mayer on Fund me please - I Work so Hard that my Feet start Bleeding and I Need to Infiltrate University

How much does this [LW(p) · GW(p)] line up with your model.

stefan42 on Interpretability: Integrated Gradients is a decent attribution method

Maybe I'm confused, but isn't integrated gradients strictly slower than an ablation to a baseline?

For a single interaction yes (1 forward pass vs integral with n_alpha integration steps, each requiring a backward pass).

For many interactions (e.g. all connections between two layers) IGs can be faster:

Ablation requires d_embed^2 forward passes (if you want to get the effect of every patch on the loss)
Integrated gradients requires d_embed * n_alpha forward & backward passes

(This is assuming you do path patching rather than "edge patching", which you should in this scenario.)

Sam Marks makes a similar point in Sparse Feature Circuits, near equations (2), (3), and (4).

amalthea on Open Thread Spring 2024

I think the best bet is to vote for a generally reasonable party. Despite their many flaws, it seems like Green Party or SPD are the best choices right now. (CDU seems to be too influenced in business interests, the current FDP is even worse)

The alternative would be to vote for a small party with a good agenda to help signal-boost them, but I don't know who's around these days.

mondsemmel on Open Thread Spring 2024

I didn't get any replies on my question post [LW · GW] re: the EU parliamentary election and AI x-risk, but does anyone have a suggestion for a party I could vote for (in Germany) when it comes to x-risk?

lblack on Interpretability: Integrated Gradients is a decent attribution method

The same applies with attribution in general (e.g. in decision making).

As in, you're also skeptical of traditional Shapley values in discrete coalition games?

"Completeness" strikes me as a desirable property for attributions to be properly normalized. If attributions aren't bounded in some way, it doesn't seem to me like they're really 'attributions'.

Very open to counterarguments here, though. I'm not particularly confident here either. There's a reason this post isn't titled 'Integrated Gradients are the correct attribution method'.

lblack on The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

I doubt it. Evaluating gradients along an entire trajectory from a baseline gave qualitatively similar results.

A saturated softmax also really does induce insensitivity to small changes. If two nodes are always connected by a saturated softmax, they can't be exchanging more than one bit of information. Though the importance of that bit can be large.

My best guess for why the Interaction Basis didn't work is that sparse, overcomplete representations really are a thing. So in general, you're not going to get a good decomposition of LMs from a Cartesian basis of activation space.

simeon_c on Stephen Fowler's Shortform

Mhhh, that seems very bad for someone in an AISI in general. I'd guess Jade Leung might sadly be under the same obligations...

That seems like a huge deal to me with disastrous consequences, thanks a lot for flagging.

zach-stein-perlman on Questions for labs

Fine-tuning for GPT-4 is in an experimental access program since at least November, and OpenAI has written about fine-tuning GPT-4 for a telecom company.
Anthropic says "Our API does not currently offer fine-tuning, but please ask your Anthropic contact if you are interested in exploring this option."
You can apparently fine-tune Gemini 1.0 Pro.

Maybe setting up custom fine-tuning is hard and labs often only set it up during deployment...

(Separately, it would be nice if OpenAI and Anthropic let some safety researchers do fine-tuning now.)

o-o on What's Going on With OpenAI's Messaging?

He isn’t in charge there. He simply offers research directions and probably a link to academia.