LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

The Quantum Mars Teleporter: An Empirical Test Of Personal Identity Theories
avturchin · 2025-01-22T11:48:46.071Z · comments (18)

Outlaw Code
scarcegreengrass · 2025-01-30T23:41:57.239Z · comments (1)

[link] Training Data Attribution: Examining Its Adoption & Use Cases
Deric Cheng (deric-cheng) · 2025-01-22T15:41:19.744Z · comments (0)

[link] Gradual Disempowerment: Simplified
Annapurna (jorge-velez) · 2025-02-22T16:59:39.072Z · comments (1)

[link] Metaculus Q4 AI Benchmarking: Bots Are Closing The Gap
Molly (hickman-santini) · 2025-02-19T22:42:39.055Z · comments (0)

[link] Teaching AI to reason: this year's most important story
Benjamin_Todd · 2025-02-13T17:40:02.869Z · comments (0)

Dovetail's agent foundations fellowship talks & discussion
Alex_Altair · 2025-02-13T00:49:48.854Z · comments (0)

DeepSeek-R1 for Beginners
Anton Razzhigaev (anton-razzhigaev) · 2025-02-05T18:58:16.282Z · comments (0)

The Human Alignment Problem for AIs
rife (edgar-muniz) · 2025-01-22T04:06:10.872Z · comments (5)

[question] A Simulation of Automation economics?
qbolec · 2025-02-10T08:11:04.424Z · answers+comments (1)

[link] What are the differences between AGI, transformative AI, and superintelligence?
Vishakha (vishakha-agrawal) · 2025-01-23T10:03:31.886Z · comments (3)

[link] Predation as Payment for Criticism
Benquo · 2025-01-30T01:06:27.591Z · comments (6)

AXRP Episode 38.6 - Joel Lehman on Positive Visions of AI
DanielFilan · 2025-01-24T23:00:07.562Z · comments (0)

[link] OpenAI lied about SFT vs. RLHF
sanxiyn · 2025-02-10T03:24:16.625Z · comments (2)

[link] DeepSeek Made it Even Harder for US AI Companies to Ever Reach Profitability
garrison · 2025-02-19T21:02:42.879Z · comments (1)

[link] what an efficient market feels from inside
DMMF · 2025-02-25T02:38:40.129Z · comments (0)

Transformer Dynamics: a neuro-inspired approach to MechInterp
guitchounts · 2025-02-22T21:33:23.855Z · comments (0)

Call for Applications: XLab Summer Research Fellowship
JoNeedsSleep (joanna-j-1) · 2025-02-18T19:19:20.155Z · comments (0)

Revealing alignment faking with a single prompt
Florian_Dietz · 2025-01-29T21:01:15.000Z · comments (5)

BIDA Calendar iCal Feed
jefftk (jkaufman) · 2025-02-06T01:30:07.887Z · comments (0)

SWE Automation Is Coming: Consider Selling Your Crypto
A_donor · 2025-02-13T20:17:59.227Z · comments (8)

Some Theses on Motivational and Directional Feedback
abstractapplic · 2025-02-02T22:50:04.270Z · comments (3)

[link] Introduction to Expected Value Fanaticism
Petra Kosonen · 2025-02-14T19:05:26.556Z · comments (8)

What We Can Do to Prevent Extinction by AI
Joe Rogero · 2025-02-24T17:15:07.109Z · comments (0)

Machine Unlearning in Large Language Models: A Comprehensive Survey with Empirical Insights from the Qwen 1.5 1.8B Model
Saketh Baddam (saketh-baddam) · 2025-02-01T21:26:58.171Z · comments (2)

Liron Shapira vs Ken Stanley on Doom Debates. A review
TheManxLoiner · 2025-01-24T18:01:56.646Z · comments (0)

Recursive Self-Modeling as a Plausible Mechanism for Real-time Introspection in Current Language Models
rife (edgar-muniz) · 2025-01-22T18:36:45.226Z · comments (5)

If Neuroscientists Succeed
Mordechai Rorvig (mordechai-rorvig) · 2025-02-11T15:33:09.098Z · comments (6)

[link] AISN #46: The Transition
Corin Katzke (corin-katzke) · 2025-01-23T18:09:36.858Z · comments (0)

[link] Links and short notes, 2025-01-26: Atlas Shrugged and the irreplaceable founder, pumping stations and civic pride, and thoughts on the eve of AGI
jasoncrawford · 2025-01-26T20:52:51.416Z · comments (1)

[link] Understanding Benchmarks and motivating Evaluations
markov (markovial) · 2025-02-06T01:32:49.331Z · comments (0)

How *exactly* can AI take your job in the next few years?
Ansh Juneja (ansh-juneja) · 2025-01-30T02:33:13.475Z · comments (0)

A fable on AI x-risk
bgaesop · 2025-02-18T20:15:24.933Z · comments (4)

Starting an Egan High School
Chris Wintergreen · 2025-01-26T19:02:17.658Z · comments (2)

Talking to laymen about AI development
David Steel · 2025-02-17T18:42:23.289Z · comments (0)

'High-Level Machine Intelligence' and 'Full Automation of Labor' in the AI Impacts Surveys
Jeffrey Heninger (jeffrey-heninger) · 2025-02-07T20:40:52.388Z · comments (0)

Reconceptualizing the Nothingness and Existence
Htarlov (htarlov) · 2025-01-28T20:29:44.390Z · comments (1)

[link] Are SAE features from the Base Model still meaningful to LLaVA?
Shan23Chen (shan-chen) · 2025-02-18T22:16:14.449Z · comments (2)

The Structure of Professional Revolutions
SebastianG (JohnBuridan) · 2025-02-09T13:23:01.059Z · comments (0)

[link] Progress links and short notes, 2025-02-17
jasoncrawford · 2025-02-17T19:18:29.422Z · comments (0)

Sleeping Beauty: an Accuracy-based Approach
glauberdebona · 2025-02-10T15:40:29.619Z · comments (2)

THE ARCHIVE
Jason Reid (jason-reid) · 2025-02-17T01:12:41.486Z · comments (0)

[link] Cooperation for AI safety must transcend geopolitical interference
Matrice Jacobine · 2025-02-16T18:18:01.539Z · comments (6)

[question] Supposing that the "Dead Internet Theory" is true or largely true, how can we act on that information?
SpectrumDT · 2025-01-27T16:47:01.338Z · answers+comments (5)

Exploring how OthelloGPT computes its world model
JMaar (jim-maar) · 2025-02-02T21:29:09.433Z · comments (0)

What working on AI safety taught me about B2B SaaS sales
purple fire (jack-edwards) · 2025-02-04T20:50:19.990Z · comments (12)

Conditional Importance in Toy Models of Superposition
james__p · 2025-02-02T20:35:38.655Z · comments (2)

Post-hoc reasoning in chain of thought
Kyle Cox (klye) · 2025-02-05T18:58:29.802Z · comments (0)

Make Superintelligence Loving
Davey Morse (davey-morse) · 2025-02-21T06:07:17.235Z · comments (9)

Goals don't necesserily start to crystallize the moment AI is capable enough to fake alignment
Mikhail Samin (mikhail-samin) · 2025-02-08T23:44:46.081Z · comments (0)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

joshc on Training AI to do alignment research we don’t already know how to do

It means something closer to "very subtly bad in a way that is difficult to distinguish from quality work". Where the second part is the important part.

I think my arguments still hold in this case though right?

i.e. we are training models so they try to improve their work and identify these subtle issues -- and so if they actually behave this way they will find these issues insofar as humans identify the subtle mistakes they make.

My guess is that your core mistake is here

I agree there are lots of "messy in between places," but these are also alignment failures we see in humans.

And if humans had a really long time to do safety reseach, my guess is we'd be ok. Why? Like you said, there's a messy complicated system of humans with different goals, but these systems empirically often move in reasonable and socially-beneficial directions over time (governments get set up to deal with corrupt companies, new agencies get set up to deal with issues in governments, etc)

and i expect we can make AI agents a lot more aligned than humans typically are. e.g. most humans don't actually care about the law etc but, Claude sure as hell seems to. If we have agents that sure as hell seem to care about the law and are not just pretending (they really will, in most cases, act like they care about the law) then that seems to be a good state to be in.

seth-herd on Alignment can be the ‘clean energy’ of AI

Alignment -> capabilities seems like a really useful alternate framing for advocating for alignment. I particularly like incentivizing alignment as an additional method; it can be used in parallel with pushing for alignment as a requirement or moral obligation. They're not mutually exclusive.

Alignment -> capabilities is closely related to capabilities -> alignment. The same techniques could be invented for either purpose, then serve for the other.

I've been thinking a lot about this type of dual-use from the other side. New frontiers in CoT training open up techniques that will be re-used for alignment just because it's easy and useful to keep CoT and agents on-task.

System 2 Alignment [LW · GW] gives a bunch of current and likely near-future techniques for that can be repurposed at very low cost to serve for real alignment when we get to truly goal-directed, competent agents.

OpenAI's deliberative alignment is one example, since it's probably exactly the same training method they used for CoT capabilities. It's not really helping with true alignment, just behavior control, but it could be just as easily used that way once there's a reason to worry about actual alignment of goal-directed agents.

purple-fire on Export Surplusses

This is simply incorrect, (both your and Eliezer's explanations) and shows a misunderstanding of basic macroeconomics that would be taught at an intro-level university course.

When countries run a trade surplus--positive net exports--there mechanically must be a net inflow of capital. When I export goods and services, foreign buyers need to exchange their currency for the domestic currency to purchase those exports. This causes the domestic currency to appreciate, which puts downward pressure on domestic real interest rates relative to other countries. Businesses in the exporter's country are thus able to invest more heavily in capital goods, raising their productivity and increasing long-term GDP growth.

jblack on Local Trust

In the rain forecaster example, it appears that the agent ("you") is more of an expert on Alice's calibration than Alice is. Is this intended?

jeremy-gillen on Training AI to do alignment research we don’t already know how to do

to the extent developers succeed in creating faithful simulators

There's a crux I have with Ryan which is "whether future capabilities will allow data-efficient long-horizon RL fine-tuning that generalizes well". As of last time we talked about it, Ryan says we probably will, I say we probably won't.

If we have the kind of generalizing ML that we can use to make faithful simulations, then alignment is pretty much solved. We make exact human uploads, and that's pretty much it. This is one end of the spectrum on this question.

There are weaker versions, which I think are what Ryan believes will be possible. In a slightly weaker case, you don't get something anywhere close to a human simulation, but you do get a machine that pursues the metric that you fine-tuned it to pursue, even out of distribution (with a relatively small amount of data).

But I think the evidence is against this. Long horizon tasks are currently difficult to successfully train on, unless you have dense intermediate feedback. Capabilities progress in the last decade has come from leaning heavily on dense intermediate feedback.

I expect long-horizon RL to remain pretty low data efficiency (i.e. take a lot of data before it generalizes well OOD).

@ryan_greenblatt [LW · GW]

thane-ruthenis on The Sorry State of AI X-Risk Advocacy, and Thoughts on Doing Better

My impression is that Musk's behavior is incoherent at the best of times. IIRC, he did support SB 1047. But he seems perfectly willing to stand aside now as Trump repels Biden's Executive Act and JD Vance spouts accelerationist anti-regulation rhetoric at the Paris AI Safety Summit. He seems to aim to win the AGI race instead, thinking misalignment risks aren't a concern/are solved.

lsusr on Test of the Bene Gesserit

Thank you for the kind words and the corrections. I have fixed the errors you pointed out.

lsusr on Test of the Bene Gesserit

Thank you for the kind words and the corrections. I have fixed the errors you pointed out.

lsusr on Test of the Bene Gesserit

Thank you for the kind words and the corrections. I have fixed the errors you pointed out.

jeremy-gillen on Training AI to do alignment research we don’t already know how to do

My guess is that your core mistake is here:

When I say agents are “not egregiously misaligned,” I mean they mostly perform their work earnestly – in the same way humans are mostly earnest and vaguely try to do their job. Maybe agents are a bit sycophantic, but not more than the humans whom they would replace. Therefore, if agents are consistently “not egregiously misaligned,” the situation is no worse than if humans performed their research instead.

Obviously, all agents having undergone training to look "not egregiously misaligned", will not look egregiously misaligned. You seem to be assuming that there is mostly a dichotomy between "not egregiously misaligned" and "conniving to satisfy some other set of preferences". But there are a lot of [LW · GW] messy places in between these two positions, including "I'm not really sure what I want" or <goals-that-are-highly-dependent-on-the-environment-e.g.-status-seeking>.

All AIs you train will be somewhere in this in between messy place. What you are hoping for is that if you put a group of these together, they will "self-correct" and force/modify each other to keep pursuing to the same goals-you-trained-them-to-look-like-they-wanted?

Is this basically correct? If so, this won't work just because this is absolute chaos and the goals-you-trained-them-to-look-like-they-wanted aren't enough to steer this chaotic system where you want it to go.

are these agents going to do sloppy research?

I think there were a few times where you are somewhat misreading your critics when they say "slop". It doesn't mean "bad". It means something closer to "very subtly bad in a way that is difficult to distinguish from quality work". Where the second part is the important part.

E.g. I find it difficult to use LLMs to help me do math or code weird algorithms, because they are good enough at outputting something that looks right. It feels like it takes longer to detect and fix their mistakes than it does to do it from scratch myself.