LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

An exhaustive list of cosmic threats
Jordan Stone (jordan-stone) · 2025-01-09T19:59:08.368Z · comments (2)

Proof Explained for "Robust Agents Learn Causal World Model"
Dalcy (Darcy) · 2024-12-22T15:06:16.880Z · comments (0)

[link] Fragile, Robust, and Antifragile Preference Satisfaction
adamShimi · 2024-11-02T17:25:55.986Z · comments (0)

Turning up the Heat on Deceptively-Misaligned AI
J Bostock (Jemist) · 2025-01-07T00:13:28.191Z · comments (16)

[link] GPT-4o Guardrails Gone: Data Poisoning & Jailbreak-Tuning
ChengCheng (ccstan99) · 2024-11-01T00:10:50.718Z · comments (0)

Announcing the CLR Foundations Course and CLR S-Risk Seminars
JamesFaville (elephantiskon) · 2024-11-19T01:18:10.085Z · comments (0)

[link] Genesis
PeterMcCluskey · 2024-12-31T22:01:17.277Z · comments (0)

[link] From the Archives: a story
Richard_Ngo (ricraz) · 2024-12-27T16:36:50.735Z · comments (1)

[link] AI & wisdom 3: AI effects on amortised optimisation
L Rudolf L (LRudL) · 2024-10-28T21:08:56.604Z · comments (0)

[link] AI & wisdom 2: growth and amortised optimisation
L Rudolf L (LRudL) · 2024-10-28T21:07:39.449Z · comments (0)

2024 NYC Secular Solstice & Megameetup
Joe Rogero · 2024-11-12T17:46:18.674Z · comments (0)

Monthly Roundup #25: December 2024
Zvi · 2024-12-23T14:20:04.682Z · comments (3)

Reality is Fractal-Shaped
silentbob · 2024-12-17T13:52:16.946Z · comments (1)

In the Name of All That Needs Saving
pleiotroth · 2024-11-07T15:26:12.252Z · comments (2)

Advisors for Smaller Major Donors?
jefftk (jkaufman) · 2024-11-06T14:30:06.187Z · comments (2)

Beliefs and state of mind into 2025
RussellThor · 2025-01-10T22:07:01.060Z · comments (7)

Fluoridation: The RCT We Still Haven't Run (But Should)
ChristianKl · 2025-01-11T21:02:47.483Z · comments (5)

[question] Does the "ancient wisdom" argument have any validity? If a particular teaching or tradition is old, to what extent does this make it more trustworthy?
SpectrumDT · 2024-11-04T15:20:14.822Z · answers+comments (49)

[link] Can o1-preview find major mistakes amongst 59 NeurIPS '24 MLSB papers?
Abhishaike Mahajan (abhishaike-mahajan) · 2024-12-18T14:21:03.661Z · comments (0)

The Alignment Mapping Program: Forging Independent Thinkers in AI Safety - A Pilot Retrospective
Alvin Ånestrand (alvin-anestrand) · 2025-01-10T16:22:16.905Z · comments (0)

[link] AI & Liability Ideathon
Kabir Kumar (kabir-kumar) · 2024-11-26T13:54:01.820Z · comments (2)

Economic Post-ASI Transition
[deleted] · 2025-01-01T22:37:31.722Z · comments (11)

[link] AI safety content you could create
Adam Jones (domdomegg) · 2025-01-06T15:35:56.167Z · comments (0)

Word Spaghetti
Gordon Seidoh Worley (gworley) · 2024-10-23T05:39:20.105Z · comments (9)

Rebuttals for ~all criticisms of AIXI
Cole Wyeth (Amyr) · 2025-01-07T17:41:10.557Z · comments (11)

Most Minds are Irrational
Davidmanheim · 2024-12-10T09:36:33.144Z · comments (4)

Proposal to increase fertility: University parent clubs
Fluffnutt (Pear) · 2024-11-18T04:21:26.346Z · comments (3)

Using Dangerous AI, But Safely?
habryka (habryka4) · 2024-11-16T04:29:20.914Z · comments (2)

Computational functionalism probably can't explain phenomenal consciousness
EuanMcLean (euanmclean) · 2024-12-10T17:11:28.044Z · comments (34)

OpenAI defected, but we can take honest actions
Remmelt (remmelt-ellen) · 2024-10-21T08:41:25.728Z · comments (16)

Heresies in the Shadow of the Sequences
Cole Wyeth (Amyr) · 2024-11-14T05:01:11.889Z · comments (12)

[link] Building AI safety benchmark environments on themes of universal human values
Roland Pihlakas (roland-pihlakas) · 2025-01-03T04:24:36.186Z · comments (3)

[link] some questionable space launch guns
bhauth · 2024-10-13T22:52:26.418Z · comments (0)

Incredibow
jefftk (jkaufman) · 2025-01-07T03:30:02.197Z · comments (3)

[question] What is the most impressive game LLMs can play well?
Cole Wyeth (Amyr) · 2025-01-08T19:38:18.530Z · answers+comments (3)

Everything you care about is in the map
Tahp · 2024-12-17T14:05:36.824Z · comments (27)

[link] We are in a New Paradigm of AI Progress - OpenAI's o3 model makes huge gains on the toughest AI benchmarks in the world
garrison · 2024-12-22T21:45:52.026Z · comments (3)

Should you have children? All LessWrong posts about the topic
Sherrinford · 2024-11-26T23:52:44.113Z · comments (0)

A Collection of Empirical Frames about Language Models
Daniel Tan (dtch1997) · 2025-01-02T02:49:05.965Z · comments (0)

[link] A primer on machine learning in cryo-electron microscopy (cryo-EM)
Abhishaike Mahajan (abhishaike-mahajan) · 2024-12-22T15:11:58.860Z · comments (0)

EC2 Scripts
jefftk (jkaufman) · 2024-12-10T03:00:01.906Z · comments (1)

[link] A Little Depth Goes a Long Way: the Expressive Power of Log-Depth Transformers
Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-11-20T11:48:14.170Z · comments (0)

[link] Every niche event should also be a meetup
DMMF · 2024-11-19T20:47:50.053Z · comments (0)

Predicting AI Releases Through Side Channels
Reworr R (reworr-reworr) · 2025-01-07T19:06:41.584Z · comments (1)

Coin Flip
XelaP (scroogemcduck1) · 2024-12-27T11:53:01.781Z · comments (0)

Re Hanson's Grabby Aliens: Humanity is not a natural anthropic sample space
Lorec · 2024-12-09T18:07:23.510Z · comments (32)

Evolutionary prompt optimization for SAE feature visualization
neverix · 2024-11-14T13:06:49.728Z · comments (0)

[link] Don't Associate AI Safety With Activism
Eneasz · 2024-12-18T08:01:50.357Z · comments (15)

Current Attitudes Toward AI Provide Little Data Relevant to Attitudes Toward AGI
Seth Herd · 2024-11-12T18:23:53.533Z · comments (2)

Hiring a writer to co-author with me (Spencer Greenberg for ClearerThinking.org)
spencerg · 2024-10-27T17:34:50.479Z · comments (0)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

mikhail-samin on No one has the ball on 1500 Russian olympiad winners who've received HPMOR

We also have 6k more copies (18k hard-cover books) left. We have no idea what to do with them. Suggestions are welcome.

Here's a map of Russian libraries that requested copies of HPMOR, and we've sent 2126 copies to:

Sending HPMOR to random libraries is cool, but I hope someone comes up with better ways of spending the books.

xpostah on xpostah's Shortform

Pay for OpenAI API using crypto. Use USDC on Optimism rollup on ethereum.

(Worst case if you're scammed you lose less than $0.10)

http://188.245.245.248:3000/sender.html

bronson-schoen on Human takeover might be worse than AI takeover

so we can control values by controlling data.

What do you mean? As in you would filter specific data from the posttraining step? What would you be trying to prevent the model from learning specifically?

quetzal_rainbow on Daniel Tan's Shortform

We need to split "search" into more fine-grained concepts.

For example, "model has representation of the world and simulates counterfactual futures depending of its actions and selects action with the highest score over the future" is a one notion of search.

The other notion can be like this: imagine possible futures as a directed tree graph. This graph has set of axioms and derived theorems describing it. Some of the axioms/theorems are encoded in model. When model gets sensory input, it makes 2-3 inferences from combination of encoded theorems + input and selects action depending on the result of inference. While logically this situation is equivalent to some search over tree graph, mechanistically it looks like "bag of heuristics".

daniel-tan on Daniel Tan's Shortform

"Feature multiplicity" in language models.

This refers to the idea that there may be many representations of a 'feature' in a neural network.

Usually there will be one 'primary' representation, but there can also be a bunch of 'secondary' or 'dormant' representations.

If we assume the linear representation hypothesis, then there may be multiple direction in activation space that similarly produce a 'feature' in the output. E.g. the existence of 800 orthogonal steering vectors for code [LW · GW].

This is consistent with 'circuit formation' resulting in many different circuits / intermediate features [LW(p) · GW(p)], and 'circuit cleanup' happening only at grokking. Because we don't train language models till the grokking regime, 'feature multiplicity' may be the default state.

Feature multiplicity is one possible explanation for adversarial examples. In turn, adversarial defense procedures such as obfuscated adversarial training or multi-scale, multi-layer aggregation may work by removing feature multiplicity, such that the only 'remaining' feature direction is the 'primary' one.

Thanks to @Andrew Mack [LW · GW] for discussing this idea with me

wuschel-schulz on Activation space interpretability may be doomed

Really liked this post!

Just for my understanding:

You mention trans/cross-coders as possible solutions to the listed problems, but they also fall prey to issues 1 & 3, right?

Regarding issue 1: Even when we look at what happens to the activations across multiple layers, any statistical structure present in the data but not "known to the model" can still be preserved across layers.

For example: Consider a complicated curve in 2D space. If we have an MLP that simply rotates this 2D space, without any knowledge that the data falls on a curve, a Crosscoder trained on the pre-MLP & post-MLP residual stream would still decompose the curve into distinct features. Similarly, a Transcoder trained to predict the post-MLP from the pre-MLP residual stream would also use these distinct features and predict the rotated features from the non-rotated features.

Regarding issue 3: I also don't see how trans/cross-coders help here. If we have multiple layers where the {blue, red} ⊗ {square, circle} decomposition would be possible, I don't see why they would be more likely than classic SAEs to find this product structure rather than the composed representation.

quila on quila's Shortform

A big reason for this is logistics, as how you are getting to the fight can actually hamper you a lot, and this especially bites hard on offense, because it's easier to get supplies to your area than it is to get supplies to an offensive unit.

ah. for 'at optimality' which you wrote, i don't imagine it to take place on that high of a macroscopic level (the one on which 'supplies' could be transported), i think the limit is more things that look to us like the category of 'angling rays of light just right to cause distant matter to interact in such away as to create an atomic explosion, or some even more destructive reaction we don't yet know about, or to suddenly carve out a copy of itself there to start doing things locally', and also i'm not imagining the competitors being 'solid' macroscopic entities anymore, but rather being patterns imbued (and dispersed) in a relatively 'lower' level of physics (which also do not need 'supplies'). (edit: maybe this picture is wrong, at optimality you can maybe absorb the energy of such explosions / not be damaged by them, if you're not a macroscopic thing)

(i'm just exploring what it would be like to be clear, i don't think such conflicts will happen because i still expect just one optimal-level-agent to come from earth)

daniel-tan on Daniel Tan's Shortform

Why understanding planning / search might be hard

It's hypothesized that, in order to solve complex tasks, capable models perform implicit search during the forward pass. If so, we might hope to be able to recover the search representations from the model. There are examples of work that try to understand search in chess models and Sokoban models [LW · GW].

However I expect this to be hard for three reasons.

The model might just implement a bag of heuristics [LW · GW]. A patchwork collection of local decision rules might be sufficient for achieving high performance. This seems especially likely for pre-trained generative models [LW · GW].
Even if the model has a globally coherent search algorithm, it seems difficult to elucidate this without knowing the exact implementation (of which there can be many equivalent ones). For example, search over different subtrees may be parallelised [LW · GW] and subsequently merged into an overall solution.
The 'search' circuit may also not exist in a crisp form, but as a collection of many sub-components that do similar / identical things. 'Circuit cleanup' only happens in the grokking regime, and we largely do not train language models till they grok.

chris_leong on My Model of Epistemology

Well done for managing to push something out there. It's a good start, I'm sure you'll fill in some of the details with other posts over time.

mikbp on Is Musk still net-positive for humanity?

? I don't know Rosencranz.

I'm asking you because you say "Is it the case that the tech would exist without him? I think that's pretty unclear" and this, in my view, depends a lot on the answers to those questions.

Is China doing well in the EV space a bad thing?

The opposite, it is good. But if Musk did not have any influence on it, this diminishes Musk's positive impact in this field, making his impact less positive.