LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

DeekSeek v3: The Six Million Dollar Model
Zvi · 2024-12-31T15:10:06.924Z · comments (6)

Greedy-Advantage-Aware RLHF
sej2020 · 2024-12-27T19:47:25.562Z · comments (15)

Role embeddings: making authorship more salient to LLMs
Nina Panickssery (NinaR) · 2025-01-07T20:13:16.677Z · comments (0)

[link] Discursive Warfare and Faction Formation
Benquo · 2025-01-09T16:47:31.824Z · comments (2)

[link] Recommendations for Technical AI Safety Research Directions
Sam Marks (samuel-marks) · 2025-01-10T19:34:04.920Z · comments (1)

Considerations on orca intelligence
Towards_Keeperhood (Simon Skade) · 2024-12-29T14:35:16.445Z · comments (5)

[link] The Deep Lore of LightHaven, with Oliver Habryka (TBC episode 228)
Eneasz · 2024-12-24T22:45:50.065Z · comments (4)

Implications of the AI Security Gap
Dan Braun (dan-braun-1) · 2025-01-08T08:31:36.789Z · comments (0)

AI #97: 4
Zvi · 2025-01-02T14:10:06.505Z · comments (4)

[link] Preference Inversion
Benquo · 2025-01-02T18:15:52.938Z · comments (46)

[link] Oppression and production are competing explanations for wealth inequality.
Benquo · 2025-01-05T14:13:15.398Z · comments (15)

[link] Began a pay-on-results coaching experiment, made $40,300 since July
Chipmonk · 2024-12-29T21:12:02.574Z · comments (14)

Practicing Bayesian Epistemology with "Two Boys" Probability Puzzles
Liron · 2025-01-02T04:42:20.362Z · comments (14)

MATS mentor selection
DanielFilan · 2025-01-10T03:12:52.141Z · comments (8)

My January alignment theory Nanowrimo
Dmitry Vaintrob (dmitry-vaintrob) · 2025-01-02T00:07:24.050Z · comments (2)

[question] What are the most interesting / challenging evals (for humans) available?
Raemon · 2024-12-27T03:05:26.831Z · answers+comments (13)

On Dwarkesh Patel’s 4th Podcast With Tyler Cowen
Zvi · 2025-01-10T13:50:05.563Z · comments (6)

Estimating the benefits of a new flu drug (BXM)
DirectedEvolution (AllAmericanBreakfast) · 2025-01-06T04:31:16.837Z · comments (2)

[link] Alignment Is Not All You Need
Adam Jones (domdomegg) · 2025-01-02T17:50:00.486Z · comments (10)

What happens next?
Logan Zoellner (logan-zoellner) · 2024-12-29T01:41:33.685Z · comments (19)

Building Big Science from the Bottom-Up: A Fractal Approach to AI Safety
Lauren Greenspan (LaurenGreenspan) · 2025-01-07T03:08:51.447Z · comments (2)

The Laws of Large Numbers
Dmitry Vaintrob (dmitry-vaintrob) · 2025-01-04T11:54:16.967Z · comments (9)

Grammars, subgrammars, and combinatorics of generalization in transformers
Dmitry Vaintrob (dmitry-vaintrob) · 2025-01-02T09:37:23.191Z · comments (0)

Alternative Cancer Care As Biohacking & Book Review: Surviving "Terminal" Cancer
DenizT · 2025-01-06T07:43:52.773Z · comments (6)

Rolling Thresholds for AGI Scaling Regulation
Larks · 2025-01-12T01:30:23.797Z · comments (3)

Fireplace and Candle Smoke
jefftk (jkaufman) · 2025-01-01T01:50:01.408Z · comments (4)

Dmitry's Koan
Dmitry Vaintrob (dmitry-vaintrob) · 2025-01-10T04:27:30.346Z · comments (2)

D&D.Sci Dungeonbuilding: the Dungeon Tournament Evaluation & Ruleset
aphyer · 2025-01-07T05:02:25.929Z · comments (8)

Childhood and Education #8: Dealing with the Internet
Zvi · 2025-01-06T14:00:09.604Z · comments (6)

If all trade is voluntary, then what is "exploitation?"
Darmani · 2024-12-27T11:21:30.036Z · comments (59)

XX by Rian Hughes: Pretentious Bullshit
Yair Halberstadt (yair-halberstadt) · 2025-01-08T13:02:52.438Z · comments (5)

[link] You should delay engineering-heavy research in light of R&D automation
Daniel Paleka · 2025-01-07T02:11:11.501Z · comments (3)

1. Meet the Players: Value Diversity
Allison Duettmann (allison-duettmann) · 2025-01-02T19:00:52.696Z · comments (2)

Last week of the Discussion Phase
Raemon · 2025-01-09T19:26:59.136Z · comments (0)

A Principled Cartoon Guide to NVC
plex (ete) · 2025-01-07T21:01:07.904Z · comments (5)

People aren't properly calibrated on FrontierMath
cakubilo · 2024-12-23T19:35:44.467Z · comments (4)

Acknowledging Background Information with P(Q|I)
JenniferRM · 2024-12-24T18:50:25.323Z · comments (8)

Disagreement on AGI Suggests It’s Near
tangerine · 2025-01-07T20:42:43.456Z · comments (15)

Two Weeks Without Sweets
jefftk (jkaufman) · 2024-12-31T03:30:02.003Z · comments (0)

Corrigibility's Desirability is Timing-Sensitive
RobertM (T3t) · 2024-12-26T22:24:17.435Z · comments (4)

Is AI Alignment Enough?
Aram Panasenco (panasenco) · 2025-01-10T18:57:48.409Z · comments (6)

Book Summary: Zero to One
bilalchughtai (beelal) · 2024-12-29T16:13:52.922Z · comments (1)

AI #98: World Ends With Six Word Story
Zvi · 2025-01-09T16:30:07.341Z · comments (1)

Will bird flu be the next Covid? "Little chance" says my dashboard.
Nathan Young · 2025-01-07T20:10:50.080Z · comments (0)

Preface
Allison Duettmann (allison-duettmann) · 2025-01-02T18:59:46.290Z · comments (1)

Intranasal mRNA Vaccines?
J Bostock (Jemist) · 2025-01-01T23:46:40.524Z · comments (2)

Living with Rats in College
lsusr · 2024-12-25T10:44:13.085Z · comments (0)

[link] The Roots of Progress 2024 in review
jasoncrawford · 2025-01-01T00:02:06.441Z · comments (0)

[link] Impact in AI Safety Now Requires Specific Strategic Insight
MiloSal (milosal) · 2024-12-29T00:40:53.780Z · comments (1)

[link] Letter from an Alien Mind
Shoshannah Tekofsky (DarkSym) · 2024-12-27T13:20:49.277Z · comments (7)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

bronson-schoen on Human takeover might be worse than AI takeover

so we can control values by controlling data.

What do you mean? As in you would filter specific data from the posttraining step? What would you be trying to prevent the model from learning specifically?

quetzal_rainbow on Daniel Tan's Shortform

We need to split "search" into more fine-grained concepts.

For example, "model has representation of the world and simulates counterfactual futures depending of its actions and selects action with the highest score over the future" is a one notion of search.

The other notion can be like this: imagine possible futures as a directed tree graph. This graph has set of axioms and derived theorems describing it. Some of the axioms/theorems are encoded in model. When model gets sensory input, it makes 2-3 inferences from combination of encoded theorems + input and selects action depending on the result of inference. While logically this situation is equivalent to some search over tree graph, mechanistically it looks like "bag of heuristics".

daniel-tan on Daniel Tan's Shortform

"Feature multiplicity" in language models.

This refers to the idea that there may be many representations of a 'feature' in a neural network.

Usually there will be one 'primary' representation, but there can also be a bunch of 'secondary' or 'dormant' representations.

If we assume the linear representation hypothesis, then there may be multiple direction in activation space that similarly produce a 'feature' in the output. E.g. the existence of 800 orthogonal steering vectors for code [LW · GW].

This is consistent with 'circuit formation' resulting in many different circuits / intermediate features [LW(p) · GW(p)], and 'circuit cleanup' happening only at grokking. Because we don't train language models till the grokking regime, 'feature multiplicity' may be the default state.

Feature multiplicity is one possible explanation for adversarial examples. In turn, adversarial defense procedures such as obfuscated adversarial training or multi-scale, multi-layer aggregation may work by removing feature multiplicity, such that the only 'remaining' feature direction is the 'primary' one.

Thanks to @Andrew Mack [LW · GW] for discussing this idea with me

wuschel-schulz on Activation space interpretability may be doomed

Really liked this post!

Just for my understanding:

You mention trans/cross-coders as possible solutions to the listed problems, but they also fall prey to issues 1 & 3, right?

Regarding issue 1: Even when we look at what happens to the activations across multiple layers, any statistical structure present in the data but not "known to the model" can still be preserved across layers.

For example: Consider a complicated curve in 2D space. If we have an MLP that simply rotates this 2D space, without any knowledge that the data falls on a curve, a Crosscoder trained on the pre-MLP & post-MLP residual stream would still decompose the curve into distinct features. Similarly, a Transcoder trained to predict the post-MLP from the pre-MLP residual stream would also use these distinct features and predict the rotated features from the non-rotated features.

Regarding issue 3: I also don't see how trans/cross-coders help here. If we have multiple layers where the {blue, red} ⊗ {square, circle} decomposition would be possible, I don't see why they would be more likely than classic SAEs to find this product structure rather than the composed representation.

quila on quila's Shortform

A big reason for this is logistics, as how you are getting to the fight can actually hamper you a lot, and this especially bites hard on offense, because it's easier to get supplies to your area than it is to get supplies to an offensive unit.

ah. for 'at optimality' which you wrote, i don't imagine it to take place on that high of a macroscopic level (the one on which 'supplies' could be transported), i think the limit is more things that look to us like the category of 'angling rays of light just right to cause distant matter to interact in such away as to create an atomic explosion, or some even more destructive reaction we don't yet know about, or to suddenly carve out a copy of itself there to start doing things locally', and also i'm not imagining the competitors being 'solid' macroscopic entities anymore, but rather being patterns imbued (and dispersed) in a relatively 'lower' level of physics (which also do not need 'supplies'). (edit: maybe this picture is wrong, at optimality you can maybe absorb the energy of such explosions / not be damaged by them, if you're not a macroscopic thing)

(i'm just exploring what it would be like to be clear, i don't think such conflicts will happen because i still expect just one optimal-level-agent to come from earth)

daniel-tan on Daniel Tan's Shortform

Why understanding planning / search might be hard

It's hypothesized that, in order to solve complex tasks, capable models perform implicit search during the forward pass. If so, we might hope to be able to recover the search representations from the model. There are examples of work that try to understand search in chess models and Sokoban models [LW · GW].

However I expect this to be hard for three reasons.

The model might just implement a bag of heuristics [LW · GW]. A patchwork collection of local decision rules might be sufficient for achieving high performance. This seems especially likely for pre-trained generative models [LW · GW].
Even if the model has a globally coherent search algorithm, it seems difficult to elucidate this without knowing the exact implementation (of which there can be many equivalent ones). For example, search over different subtrees may be parallelised [LW · GW] and subsequently merged into an overall solution.
The 'search' circuit may also not exist in a crisp form, but as a collection of many sub-components that do similar / identical things. 'Circuit cleanup' only happens in the grokking regime, and we largely do not train language models till they grok.

chris_leong on My Model of Epistemology

Well done for managing to push something out there. It's a good start, I'm sure you'll fill in some of the details with other posts over time.

mikbp on Is Musk still net-positive for humanity?

? I don't know Rosencranz.

I'm asking you because you say "Is it the case that the tech would exist without him? I think that's pretty unclear" and this, in my view, depends a lot on the answers to those questions.

Is China doing well in the EV space a bad thing?

The opposite, it is good. But if Musk did not have any influence on it, this diminishes Musk's positive impact in this field, making his impact less positive.

shankar-sivarajan on In Defense of a Butlerian Jihad

What do we privilege, the preference of doctors or the welfare of patients?
What is more important, educators preferences or quality of children education?

I understand you intended these questions to be rhetorical, but the answers you think are obvious: did you arrive at them through "pure reason," or by looking at what "democratic consensus" actually ended up with?

unexpectedvalues on How much do you believe your results?

I think this isn't the sort of post that ages well or poorly, because it isn't topical, but I think this post turned out pretty well. It gradually builds from preliminaries that most readers have probably seen before, into some pretty counterintuitive facts that aren't widely appreciated.

At the end of the post, I listed three questions and wrote that I hope to write about some of them soon. I never did, so I figured I'd use this review to briefly give my takes.

This comment [LW(p) · GW(p)] from Fabien Roger tests some of my modeling choices for robustness, and finds that the surprising results of Part IV hold up when the noise is heavier-tailed than the signal. (I'm sure there's more to be said here, but I probably don't have time to do more analysis by the end of the review period.,)
My basic take is that this really is a point in favor of well-evidenced interventions, but that the best-looking speculative interventions are nevertheless better. This is because I think "speculative" here mostly refers to partial measurement rather than noisy measurement. For example, maybe you can only foresee the first-order effects of an intervention, but not the second-order effects. If the first-order effect is a (known) quantity and the second-order effect is an (unknown) quantity $X_{2}$ , then modeling the second-order effect as zero (and thus estimating the quality of the intervention as $X_{1}$ ) isn't a noisy measurement; it's a partial measurement. It's still your best guess given the information you have.
1. I haven't thought this through very much. I expect good counter-arguments and counter-counter-arguments to exist here.
1. No -- or rather, only if the measurement is guaranteed to be exactly correct. To see this, observe that the variance of a noisy, unbiased measurement is greater than the variance of the quantity you're trying to measure (with equality only when the noise is zero), whereas the variance of a noiseless, partial measurement is less than the variance of the quantity you're trying to measure.
2. Real-world measurements are absolutely partial. They are, like, mind-bogglingly partial. This point deserves a separate post, but consider for instance the action of donating $5,000 to the Against Malaria Foundation. Maybe your measured effect from the RCT is that it'll save one life: 50 QALYs or so. But this measurement neglects the meat-eating problem [? · GW]: the expected-child you'll save will grow up to eat expected-meat from factory farms, likely causing a great amount of suffering. But then you remember: actually there's a chance that this child will have a one eight-billionth stake in determining the future of the lightcone. Oops, actually this consideration totally dominates the previous two. Does this child have better values than the average human? Again: mind-bogglingly partial!
  
  (The measurements are also, of course, noisy! RCTs are probably about as un-noisy as it gets: for example, making your best guess about the quality of an intervention by drawing inferences from uncontrolled macroeconomic data is much more noisy. So the answer is: generally both noisy and partial, but in some sense, much more partial than noisy -- though I'm not sure how much that comparison matters.)
3. The lessons of this post do not generalize to partial measurements at all! This post is entirely about noisy measurements. If you've partially measured the quality of an intervention, estimating the un-measured part using your prior will give you an estimate of intervention quality that you know is probably wrong, but the expected value of your error is zero.