LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Numberwang: LLMs Doing Autonomous Research, and a Call for Input
eggsyntax · 2025-01-16T17:20:37.552Z · comments (30)

[link] New o1-like model (QwQ) beats Claude 3.5 Sonnet with only 32B parameters
Jesse Hoogland (jhoogland) · 2024-11-27T22:06:12.914Z · comments (4)

The Hessian rank bounds the learning coefficient
Lucius Bushnaq (Lblack) · 2024-08-08T20:55:36.960Z · comments (9)

Different senses in which two AIs can be “the same”
Vivek Hebbar (Vivek) · 2024-06-24T03:16:43.400Z · comments (1)

Showing SAE Latents Are Not Atomic Using Meta-SAEs
Bart Bussmann (Stuckwork) · 2024-08-24T00:56:46.048Z · comments (10)

Understanding SAE Features with the Logit Lens
Joseph Bloom (Jbloom) · 2024-03-11T00:16:57.429Z · comments (0)

When Are Circular Definitions A Problem?
johnswentworth · 2024-05-28T20:00:23.408Z · comments (15)

[link] GPT-4o System Card
Zach Stein-Perlman · 2024-08-08T20:30:52.633Z · comments (11)

Heritability: Five Battles
Steven Byrnes (steve2152) · 2025-01-14T18:21:17.756Z · comments (18)

Indecision and internalized authority figures
Kaj_Sotala · 2024-07-06T10:10:02.528Z · comments (1)

Generalized Stat Mech: The Boltzmann Approach
David Lorell · 2024-04-12T17:47:31.880Z · comments (7)

EIS XIV: Is mechanistic interpretability about to be practically useful?
scasper · 2024-10-11T22:13:51.033Z · comments (4)

Estimating Tail Risk in Neural Networks
Mark Xu (mark-xu) · 2024-09-13T20:00:06.921Z · comments (9)

Personal AI Planning
jefftk (jkaufman) · 2024-11-10T14:00:06.837Z · comments (10)

Duct Tape security
Isaac King (KingSupernova) · 2024-04-26T18:57:05.659Z · comments (11)

Why Large Bureaucratic Organizations?
johnswentworth · 2024-08-27T18:30:07.422Z · comments (52)

AI #79: Ready for Some Football
Zvi · 2024-08-29T13:30:10.902Z · comments (16)

Brief notes on the Wikipedia game
Olli Järviniemi (jarviniemi) · 2024-07-14T02:28:22.473Z · comments (9)

Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems
Joar Skalse (Logical_Lunatic) · 2024-05-17T19:13:31.380Z · comments (10)

[link] Learn to write well BEFORE you have something worth saying
eukaryote · 2024-12-29T23:42:31.906Z · comments (18)

Timaeus is hiring!
Jesse Hoogland (jhoogland) · 2024-07-12T23:42:28.651Z · comments (6)

o1-preview is pretty good at doing ML on an unknown dataset
Håvard Tveit Ihle (havard-tveit-ihle) · 2024-09-20T08:39:49.927Z · comments (1)

Chance is in the Map, not the Territory
Daniel Herrmann (Whispermute) · 2025-01-13T19:17:15.843Z · comments (17)

Retrospective: 12 [sic] Months Since MIRI
james.lucassen · 2025-01-21T02:52:06.271Z · comments (0)

Some articles in “International Security” that I enjoyed
Buck · 2025-01-31T16:23:27.061Z · comments (3)

Intricacies of Feature Geometry in Large Language Models
7vik (satvik-golechha) · 2024-12-07T18:10:51.375Z · comments (0)

[link] Anthropic leadership conversation
Zach Stein-Perlman · 2024-12-20T22:00:45.229Z · comments (17)

Stream Entry
lsusr · 2025-01-07T23:56:13.530Z · comments (7)

Friendship is transactional, unconditional friendship is insurance
Ruby · 2024-07-17T22:52:41.967Z · comments (24)

[link] Yudkowsky on The Trajectory podcast
Seth Herd · 2025-01-24T19:52:15.104Z · comments (36)

Introducing AI-Powered Audiobooks of Rational Fiction Classics
Askwho · 2024-05-04T17:32:49.719Z · comments (14)

What and Why: Developmental Interpretability of Reinforcement Learning
Garrett Baker (D0TheMath) · 2024-07-09T14:09:40.649Z · comments (4)

minutes from a human-alignment meeting
bhauth · 2024-05-24T05:01:53.904Z · comments (4)

AE Studio @ SXSW: We need more AI consciousness research (and further resources)
AE Studio (AEStudio) · 2024-03-26T20:59:09.129Z · comments (8)

"Fractal Strategy" workshop report
Raemon · 2024-04-06T21:26:53.263Z · comments (23)

[link] Open Source Automated Interpretability for Sparse Autoencoder Features
kh4dien · 2024-07-30T21:11:36.866Z · comments (1)

Inference-Time-Compute: More Faithful? A Research Note
James Chua (james-chua) · 2025-01-15T04:43:00.631Z · comments (9)

SB 1047 Is Weakened
Zvi · 2024-06-06T13:40:41.547Z · comments (4)

[link] The 2nd Demographic Transition
Maxwell Tabarrok (maxwell-tabarrok) · 2024-04-06T14:10:13.095Z · comments (17)

[link] The economics of space tethers
harsimony · 2024-08-22T16:15:22.699Z · comments (22)

Ophiology (or, how the Mamba architecture works)
Danielle Ensign (phylliida-dev) · 2024-04-09T19:31:09.975Z · comments (8)

Kessler's Second Syndrome
Jesse Hoogland (jhoogland) · 2025-01-26T07:04:17.852Z · comments (2)

SAEs are highly dataset dependent: a case study on the refusal direction
Connor Kissane (ckkissane) · 2024-11-07T05:22:18.807Z · comments (4)

Schelling game evaluations for AI control
Olli Järviniemi (jarviniemi) · 2024-10-08T12:01:24.389Z · comments (5)

SAEs (usually) Transfer Between Base and Chat Models
Connor Kissane (ckkissane) · 2024-07-18T10:29:46.138Z · comments (0)

[link] On Shifgrethor
JustisMills · 2024-10-27T15:30:13.688Z · comments (18)

[link] Paper: Open Problems in Mechanistic Interpretability
Lee Sharkey (Lee_Sharkey) · 2025-01-29T10:25:54.727Z · comments (0)

Advice to junior AI governance researchers
[deleted] · 2024-07-08T19:19:07.316Z · comments (1)

How to be an amateur polyglot
arisAlexis (arisalexis) · 2024-05-08T15:08:11.404Z · comments (16)

The Third Fundamental Question
Screwtape · 2024-11-15T04:01:33.770Z · comments (7)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

thomas-kwa on OpenAI releases deep research agent

I'm not happy about this but it seems basically priced in, so not much update on p(doom).

We will soon have Bayesian updates to make. If we observe that incentives created during end-to-end RL naturally produce goal guarding and other dangerous cognitive properties, it will be bad news. If we observe this doesn't happen, it will be good news (although not very good news because web research seems like it doesn't require the full range of agency).

Likewise, if we observe models' monitorability and interpretability start to tank as they think in neuralese, it will be bad news. If monitoring and interpretability are unaffected, good news.

Interesting times.

transhumanist_atom_understander on The Self-Reference Trap in Mathematics

Yeah, just went through this whole same line of evasion [LW(p) · GW(p)]. Alright, the Collatz conjecture will never be "proved" in this restrictive sense—and neither will the Steve conjecture or the irrationality of √2—do we care? It may still be proved according to the ordinary meaning.

nathan-helm-burger on Keeping Capital is the Challenge

How much of their original capital did the French nobility retain at the end of the French revolution?

How much capital (value of territorial extent) do chimpanzees retain now as compared to 20k years ago?

vaniver on Alexander Gietelink Oldenziel's Shortform

I haven't thought deeply about this specific case, but I think you should consider this like any other ablation study--like, what happens if you replace the SAE with a linear probe?

screwtape on Pick two: concise, comprehensive, or clear rules

Eh, I think unclear rules and high standards are fine for some purposes. Take a fiction magazine. Good ones have a high standard for what they publish, and (apart from some formatting and wordcount rules) the main rule is it has to fit the editor's taste. The same is true for scientific publications.

I understand the motivation behind this, but there is little warning that this is how the forum works.

I mildly disagree with this. The New Users Guide says

LessWrong is a pretty particular place. We strive to maintain a culture that's uncommon for web forums^[1] and to stay true to our values. Recently, many more people have been finding their way here, so I (lead admin and moderator) put together this intro to what we're about.
My hope is that if LessWrong resonates with your values and interests, this guide will help you become a valued member of community. And if LessWrong isn't the place for you, this guide will help you have a good "visit" or simply seek other pastures.

On the margin, is there room for improvement? Seems likely, but doesn't seem bad. If I was in charge I'd be tempted to open the New Users Guide with like, four bullet points that said 'This place is for aspiring rationalists, don't say false things, don't be a jerk, for examples of what we mean by that read on.' That's somewhat stylistic though.

There is no warning that trying to contribute in good faith isn't sufficient

Wait, now I'm confused. Most forums I'm aware of don't have much of a Good Faith defense. I looked up the rules for the first one I thought of, Giant In The Playground, and while it's leaning a bit more Comprehensive and Clear I don't see a place where it says if you break a rule in good faith you're fine.

In general, someone trying to contribute to a thing who but doing so badly doesn't get that much of a pass? Like, I've been politely ejected from a singing group before because I was badly off-key. Nobody doubted I was trying to sing well! It doesn't change the fact that the group wanted to have everyone singing the right notes.

I suggest that instead of making rate-limited users (who used up their rate) unable to comment at all, their additional comments should be invisible, but still visible to other rate-limited users (and users who choose to see them).

Meh. The internet is big. If the kind of thing that got someone rate-limited on LessWrong got them rate-limited or banned everywhere else, I'd be supportive of having somewhere they were allowed to post. Reddit's right over there, you know?

I think giving special emphasis to rate-limited users for rate-limited users is straightforwardly a bad idea. If someone got rate-limited, in general I assume it's because they were writing in ways the mods and/or other users thought they shouldn't do. If someone is going to stick around, I want their attention on people doing well, not doing badly. Imagine a basketball practice; if I'm a lousy shot, the coach might tell me to sit out the drill and watch a couple of the good players for few minutes. If I'm really bad, I get cut from the team. No coach is going to say, "hey, you're a lousy shot, so pay special attention to these other players who are just as bad as you."

A big component of this is I tend to think of LessWrong as a place I go to get better at a kind of mental skill, hence analogies to choir or basketball practice. You may have other goals here.

ozziegooen on Gradual Disempowerment, Shell Games and Flinches

I think my quick guess is that what's going on is something like:
- People generally have a ton of content to potentially consume and limited time, and are thus really picky.
- Researchers often have unique models and a lot of specific nuanced they care about.
- Most research of this type is really bad. Tons of people on Twitter now seem to have some big-picture theory of what AI will do to civilization.
- Researchers also have the curse of knowledge, and think their work is simpler than it is.

So basically, people aren't flinching because of bizarre and specific epistemic limitations. It's more like,
> "this seems complicated, learning it would take effort, my prior is that this is fairly useless anyway, so I'll be very quick to dismiss this."

My quick impression is that this is a brutal and highly significant limitation of this kind of research. It's just incredibly expensive for others to read and evaluate, so it's very common for it to get ignored. (Learned in part from myself trying to put a lot of similar work out there, then seeing it get ignored)

Related to this -
I'd predict that if you improved the arguments by 50%, it would lead to little extra uptake. But if you got someone really prestigious to highly recommend it, then suddenly a bunch of people would be much more convinced.

faul_sname on Daniel Kokotajlo's Shortform

And all of this is livestreamed on Twitch

Also, each agent has a bank account which they can receive donations/transfers into (I think Twitch makes this easy?) and from which they can e.g. send donations to GiveDirectly if that's what they want to do.

One minor implementation wrinkle for anyone implementing this is that "move money from a bank account to a recipient by using text fields found on the web" usually involves writing your payment info into said text fields in a way that would be visible when streaming your screen. I'm not sure if any of the popular agent frameworks have good tooling around only including specific sensitive information in the context while it is directly relevant to the model task and providing hook points on when specific pieces of information enter and leave the context - I don't see any such thing in e.g. the Aider docs - and I think that without such tooling, using payment info in a way that won't immediately be stolen by stream viewers would be a bit challenging.

mishka on The Self-Reference Trap in Mathematics

Yes, the technique of formal proofs, in effect, involves translation of high-level proofs into arithmetic.

So self-reference is fully present (that's why we have Gödel's results and other similar results).

What this implies, in particular, is that one can reduce a "real proof" to the arithmetic; this would be ugly, and one should not do it in one's informal mathematical practice; but your post is not talking about pragmatics, you are referencing "fundamental limit of self-reference".

And, certainly, there are some interesting fundamental limits of self-reference (that's why we have algorithmically undecidable problems and such). But this is different from issues of pragmatic math techniques.

What high-level abstraction buys us is a lot of structure and intuition. The constraints related to staying within arithmetic are pragmatic, and not fundamental (without high-level abstractions one loses some very powerful ways to structure things and to guide our intuition, and things stop being comprehensible to a human mind).

steve2152 on The Self-Reference Trap in Mathematics

Your post purports to conclude: “That's why [the Collatz conjecture] will never be solved”.

Do you think it would also be correct to say: “That's why [the Steve conjecture] will never be solved”?

If yes, then I think you’re using the word “solved” in an extremely strange and misleading way.

If no, then you evidently messed up, because your argument does not rely on any property of the Collatz conjecture that is not equally true of the Steve conjecture.

bhauth on grey goo is unlikely

make fewer points, selected carefully to be bulletproof, understandable to non-experts, and important to the overall thesis

That conflicts with eg:

If you replied with this, I would have said something like "then what's wrong with the designs for diamond mechanosynthesis tooltips, which don't resemble enzymes

Anyway, I already answered that in 9. diamond.