LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

next page (older posts) →

Why does LW not put much more focus on AI governance and outreach?
Severin T. Seehrich (sts) · 2025-04-12T14:24:54.197Z · comments (28)

[link] Frontier AI Models Still Fail at Basic Physical Tasks: A Manufacturing Case Study
Adam Karvonen (karvonenadam) · 2025-04-14T17:38:02.918Z · comments (4)

One-shot steering vectors cause emergent misalignment, too
Jacob Dunefsky (jacob-dunefsky) · 2025-04-14T06:40:41.503Z · comments (5)

Steelmanning heuristic arguments
Dmitry Vaintrob (dmitry-vaintrob) · 2025-04-13T01:09:33.392Z · comments (0)

Vestigial reasoning in RL
Caleb Biddulph (caleb-biddulph) · 2025-04-13T15:40:11.954Z · comments (7)

How I switched careers from software engineer to AI policy operations
Lucie Philippon (lucie-philippon) · 2025-04-13T06:37:33.507Z · comments (1)

[link] College Advice For People Like Me
henryj · 2025-04-12T14:36:46.643Z · comments (0)

Four Types of Disagreement
silentbob · 2025-04-13T11:22:38.466Z · comments (2)

Try training token-level probes
StefanHex (Stefan42) · 2025-04-14T11:56:23.191Z · comments (0)

[link] Sentinel's Global Risks Weekly Roundup #15/2025: Tariff yoyo, OpenAI slashing safety testing, Iran nuclear programme negotiations, 1K H5N1 confirmed herd infections.
NunoSempere (Radamantis) · 2025-04-14T19:11:20.977Z · comments (0)

MONA: Three Month Later - Updates and Steganography Without Optimization Pressure
David Lindner · 2025-04-12T23:15:07.964Z · comments (0)

Thoughts on the Double Impact Project
Mati_Roy (MathieuRoy) · 2025-04-13T19:07:57.687Z · comments (10)

How to evaluate control measures for LLM agents? A trajectory from today to superintelligence
Tomek Korbak (tomek-korbak) · 2025-04-14T16:45:46.584Z · comments (0)

[link] The 4-Minute Mile Effect
Parker Conley (parker-conley) · 2025-04-14T21:41:27.726Z · comments (2)

[link] Unbendable Arm as Test Case for Religious Belief
Ivan Vendrov (ivan-vendrov) · 2025-04-14T01:57:12.013Z · comments (24)

Will US tariffs push data centers for large model training offshore?
ChristianKl · 2025-04-12T12:47:12.917Z · comments (3)

The Internal Model Principle: A Straightforward Explanation
Alfred Harwood · 2025-04-12T10:58:51.479Z · comments (1)

Monthly Roundup #29: April 2025
Zvi · 2025-04-14T11:50:02.324Z · comments (4)

The Bell Curve of Bad Behavior
Screwtape · 2025-04-14T19:58:10.293Z · comments (3)

Offer: Team Conflict Counseling for AI Safety Orgs
Severin T. Seehrich (sts) · 2025-04-14T15:17:00.835Z · comments (1)

[question] What is autism?
Adam Zerner (adamzerner) · 2025-04-12T18:12:19.468Z · answers+comments (7)

Experts have it easy
beyarkay · 2025-04-12T19:32:17.158Z · comments (3)

The Last Light
Bridgett Kay (bridgett-kay) · 2025-04-14T15:41:02.745Z · comments (0)

Calling Bullshit - the Cheatsheet
Niklas Lehmann · 2025-04-12T11:43:23.822Z · comments (1)

[link] Slopworld 2035: The dangers of mediocre AI
titotal (lombertini) · 2025-04-14T13:14:08.390Z · comments (6)

What are good safety standards for open source AIs from China?
ChristianKl · 2025-04-12T13:06:16.663Z · comments (2)

[question] How likely are the USA to decay and how will it influence the AI development?
StanislavKrym · 2025-04-12T04:42:27.604Z · answers+comments (0)

What if there was a nuke in Manhattan and why that could be a good thing
Ratburn · 2025-04-15T00:19:41.844Z · comments (5)

Commitment Races are a technical problem ASI can easily solve
Knight Lee (Max Lee) · 2025-04-12T22:22:47.790Z · comments (5)

A Dissent on Honesty
eva_ · 2025-04-15T02:43:44.163Z · comments (1)

A Talmudic Rationalist Cautionary Tale
Noah Birnbaum (daniel-birnbaum) · 2025-04-15T04:11:16.972Z · comments (0)

[link] Distributed whistleblowing
samuelshadrach (xpostah) · 2025-04-12T06:36:05.952Z · comments (5)

The Structure of the Pain of Change
ReverendBayes (vedernikov-andrei) · 2025-04-13T21:51:53.823Z · comments (0)

[question] Does this game have a name?
Mis-Understandings (robert-k) · 2025-04-12T01:52:47.584Z · answers+comments (4)

Sam Altman's sister claims Sam sexually abused her -- Part 8: Timeline, continued
pythagoras5015 (pl5015) · 2025-04-14T17:42:53.705Z · comments (0)

[question] Is Local Order a Clue to Universal Entropy? How a Failed Professor Searches for a 'Sacred Motivational Order'
P. João (gabriel-brito) · 2025-04-12T13:39:55.857Z · answers+comments (2)

Creating 'Making God': a Feature Documentary on risks from AGI
Connor Axiotes (connor-axiotes-1) · 2025-04-15T02:56:09.206Z · comments (0)

The Era of the Dividual—are we falling apart?
James Stephen Brown (james-brown) · 2025-04-12T22:35:56.593Z · comments (2)

Self propagating story.
Canaletto (weightt-an) · 2025-04-12T12:32:21.312Z · comments (0)

Луна Лавгуд и Комната Тайн, Часть 4
Kongo Landwalker (kongo-landwalker) · 2025-04-13T20:55:03.281Z · comments (0)

ACX Spring Meetup 2025 @ Klang Valley, Malaysia
Yi-Yang (yiyang) · 2025-04-12T07:31:16.434Z · comments (0)

Луна Лавгуд и Комната Тайн, Часть 3
Kongo Landwalker (kongo-landwalker) · 2025-04-12T19:20:15.846Z · comments (0)

Intro to Multi-Agent Safety
james__p · 2025-04-13T17:40:41.128Z · comments (0)

Луна Лавгуд и Комната Тайн, Часть 4
Kongo Landwalker (kongo-landwalker) · 2025-04-14T00:10:36.028Z · comments (0)

Sam Altman's sister claims Sam sexually abused her -- Part 7: Timeline, continued
pythagoras5015 (pl5015) · 2025-04-14T17:43:28.897Z · comments (0)

Religious Persistence: A Missing Primitive for Robust Alignment
lauriewired · 2025-04-14T22:03:45.868Z · comments (1)

Correcting Deceptive Alignment using a Deontological Approach
JeaniceK · 2025-04-14T22:07:57.860Z · comments (0)

$500 bounty for best short-form fiction about our near future world; $100 for recommending winning piece: new “Art of Near Future World” quarterly art project
Ramon Gonzalez (ramon-gonzalez) · 2025-04-15T00:46:10.637Z · comments (0)

Sam Altman's sister claims Sam sexually abused her -- Part 4: Timeline, continued
pythagoras5015 (pl5015) · 2025-04-13T23:41:55.411Z · comments (0)

Lightning Talks!
nathandunkerley · 2025-04-14T20:39:17.593Z · comments (0)

next page (older posts) →

Archive

Recent comments

neel-nanda-1 on Frontier AI Models Still Fail at Basic Physical Tasks: A Manufacturing Case Study

My understanding is that there was a separate image model in historical vlms like flamingo but that it passed on a vector representation of the image not text

samuelshadrach on [Letter] Chinese Quickstart

Thanks for taking time to reply!

Yes OpenAI realtime API is really cool. When speaking to realtime API, I start each sentence with two words indicating what I want it to do. It's clunky but it works. "Translate Chinese, what is the time?" "Reply Chinese, how are you?" Ideally yes I could write an app to prepend the instruction audio to each sentence.

If I had this as higher priority I'd actually want to setup this Twilio app.

jiro on A Dissent on Honesty

He asks “How interested are you in Widgets?” He has learnt from previous job interviews that, if he answers honestly, the interviewer will think he is any of lying, insane, or too weird to deal with, and not hire him, even though this is not in the best financial interests of the company, were they fully informed.

By the standard "intentionally or knowingly cause the other person to have false beliefs", answering 'honestly' would be lying, and answering in a toned down way would not (because it maximizes the truth of the belief that the interviewer gets).

nick_tarleton on Unbendable Arm as Test Case for Religious Belief

(I have successfully done Unbendable Arm after Valentine showed me in person, though he didn't explain any of the mechanics. My experience of it didn't involve visualization, but felt like placing my fingertips on the wall across the room and resolving that they'd stay there. Contra jimmy's comment [LW(p) · GW(p)], IIRC I initially held my arm wrong without any cueing.)

Strongly related: Believing In [LW · GW]. From that post:

My guess is that for lack of good concepts for distinguishing “believing in” from deception, LessWrongers, EAs, and “nerds” in general are often both too harsh on folks doing positive-sum “believing in,” and too lax on folks doing deception. (The “too lax” happens because many can tell there’s a “believing in”-shaped gap in their notions of e.g. “don’t say better things about your start-up than a reasonable outside observer would,” but they can’t tell its exact shape, so they loosen their “don’t deceive” in general.)

I feel like this post is similarly too lax on, not deception, but propositional-and-false religious beliefs.

jenn on jenn's Shortform

this week's meetup is on the train to crazy town [? · GW]. it was fun putting together all the readings and discussion questions, and i'm optimistic about how the meetup's going to turn out! (i mean, in general, i don't run meetups i'm not optimistic about, so i guess that's not saying much.) im slightly worried about some folks coming in and just being like "this metaphor is entirely unproductive and sucks", should consider how to frame the meetup productively to such folks.

i think one of my strengths as an organizer is that ive read sooooo much stuff and so its relatively easy for me to pull together cohesive readings for any meetup. but ultimately im not sure if it's like, the most important work, to e.g. put together a bibliography of the crazy town idea and its various appearances since 2021. still, it's fun to do.

elityre on Eli's shortform feed

For the same reasons 'training an agent on a constitution that says to care about ' does not, at arbitrary capability levels, produce an agent that cares about $x$

Ok, but I'm trying to ask why not.

Here's the argument that I would make for why not, followed by why I'm skeptical of it right now.

New options for the AI will open up at high capability levels that were not available at lower capability levels. This could in principle lead to undefined behavior that deviates from what we intended.

More specifically, if it's the case that if...

The best / easiest-for-SGD-to-find way to compute corrigible outputs (as evaluated by the AI) is to reinforce an internal proxy measure that is correlated with corrigibility (as evaluated by the AI) in distribution, instead of to reinforce circuits that implement corrigibility more-or-less directly.
When the AI gains new options unlocked by new advanced capabilities, that proxy measure comes apart from corrigibility (as evaluated by the AI), in the limit of capabilities, so that the poxy measure is almost uncorrelated with corrigibility

...then the resulting system will not end up corrigible.

(Is this the argument that you would give, or is there another reason why you expect that "training an agent on a constitution that says to care about $x$ ' does not, at arbitrary capability levels, produce an agent that cares about $x$ "?)

But, at the moment, I'm skeptical of the above line of argument for several reasons.

I'm skeptical of the first premise, that the best way that SGD can find to produce corrigible (as evaluated by the AI) is to reinforce a proxy measure.
- I understand that natural selection, when shaping humans for inclusive genetic fitness, instilled in them a bunch of proxy-drives. But I think this analogy is misleading in several ways.
- Most relevantly, there's a genetic bottleneck, so evolution could only shape human behavior by selecting over genomes, and genomes don't encode that much knowledge about the world. If humans were born into the world with detailed world models, that included the concept of inclusive genetic fitness baked in, evolution would absolutely shaped humans to be inclusive fitness maximizers. AIs are "born into the world" with expansive world models that already include concepts like corrigibility (indeed, if they didn't, Constitutional AI wouldn't work at all). So it would be surprising if SGD opted to reinforce proxy measures instead of relying on the concepts directly.
We would run the constitutional AI reinforcement process continuously, in parallel with the capability improvements from the RL training.
- AI's capabilities increase, it will gain new options. If the AI is steering based on proxy measures, some of those options will involved the proxy coming apart from the target of the proxy. But when that starts to happen, the constitutional AI loop will exert an optimization pressure on the AI's internals to hit the target, not just the proxies.

Is this the main argument? What are other reasons to think that 'training an agent on a constitution that says to care about $x$ ' does not, at arbitrary capability levels, produce an agent that cares about $x$ ?

lblack on Lucius Bushnaq's Shortform

Nope. Try it out. If you attempt to split the activation vector into 1050 vectors for animals + attributes, you can't get the dictionary activations to equal the feature activations , $c_{i}^{'} (x)$ .

kairos_ on Mo Putera's Shortform

I believe the Scramblers from blindsight weren’t self aware, which means they couldn’t think about their own interactions with the world.

As I recall the crew was giving one of the Scramblers a series of cognitive tests. It aced all the tests that had to do with numbers and spatial reasoning, but failed a test that required the testee to be self aware.

thane-ruthenis on johnswentworth's Shortform

Oh, if you're in the business of compiling a comprehensive taxonomy of ways the current AI thing may be fake, you should also add:

Vibe coders and "10x'd engineers", who (on this model) would be falling into one of the failure modes outlined here [LW · GW]: producing applications/features that didn't need to exist, creating pointless code bloat (which helpfully show up in productivity metrics like "volume of code produced" or "number of commits"), or "automatically generating" entire codebases in a way that feels magical, then spending so much time bugfixing them it eats up ~all perceived productivity gains.
e/acc and other Twitter AI fans, who act like they're bleeding-edge transhumanist visionaries/analysts/business gurus/startup founders, but who are just shitposters/attention-seekers who will wander off and never look back the moment the hype dies down.

asta7k on Reactions to METR task length paper are insane

What are your current AGI timelines?