LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

next page (older posts) →

[link] AI 2027: What Superintelligence Looks Like
Daniel Kokotajlo (daniel-kokotajlo) · 2025-04-03T16:23:44.619Z · comments (128)

LessWrong has been acquired by EA
habryka (habryka4) · 2025-04-01T13:09:11.153Z · comments (45)

VDT: a solution to decision theory
L Rudolf L (LRudL) · 2025-04-01T21:04:09.509Z · comments (18)

[link] Playing in the Creek
Hastings (hastings-greer) · 2025-04-10T17:39:28.883Z · comments (6)

[link] Thoughts on AI 2027
Max Harms (max-harms) · 2025-04-09T21:26:23.926Z · comments (43)

Why Have Sentence Lengths Decreased?
Arjun Panickssery (arjun-panickssery) · 2025-04-03T17:50:29.962Z · comments (50)

Short Timelines Don't Devalue Long Horizon Research
Vladimir_Nesov · 2025-04-09T00:42:07.324Z · comments (23)

Alignment Faking Revisited: Improved Classifiers and Open Source Extensions
John Hughes (john-hughes) · 2025-04-08T17:32:55.315Z · comments (11)

OpenAI #12: Battle of the Board Redux
Zvi · 2025-03-31T15:50:02.156Z · comments (1)

Learned pain as a leading cause of chronic pain
SoerenMind · 2025-04-09T11:57:58.523Z · comments (13)

New Cause Area Proposal
CallumMcDougall (TheMcDouglas) · 2025-04-01T07:12:34.360Z · comments (4)

Downstream applications as validation of interpretability progress
Sam Marks (samuel-marks) · 2025-03-31T01:35:02.722Z · comments (1)

AI 2027: Responses
Zvi · 2025-04-08T12:50:02.197Z · comments (3)

Among Us: A Sandbox for Agentic Deception
7vik (satvik-golechha) · 2025-04-05T06:24:49.000Z · comments (4)

The Lizardman and the Black Hat Bobcat
Screwtape · 2025-04-06T19:02:01.238Z · comments (13)

How I talk to those above me
Maxwell Peterson (maxwell-peterson) · 2025-03-30T06:54:59.869Z · comments (13)

Show, not tell: GPT-4o is more opinionated in images than in text
Daniel Tan (dtch1997) · 2025-04-02T08:51:02.571Z · comments (29)

How To Believe False Things
Eneasz · 2025-04-02T16:28:29.055Z · comments (10)

A Slow Guide to Confronting Doom
Ruby · 2025-04-06T02:10:56.483Z · comments (20)

Keltham's Lectures in Project Lawful
Morpheus · 2025-04-01T10:39:47.973Z · comments (3)

You will crash your car in front of my house within the next week
Richard Korzekwa (Grothor) · 2025-04-01T21:43:21.472Z · comments (6)

Announcing ILIAD2: ODYSSEY
Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2025-04-03T17:01:06.004Z · comments (1)

PauseAI and E/Acc Should Switch Sides
WillPetillo · 2025-04-01T23:25:51.265Z · comments (6)

How training-gamers might function (and win)
Vivek Hebbar (Vivek) · 2025-04-11T21:26:18.669Z · comments (4)

[link] New Paper: Infra-Bayesian Decision-Estimation Theory
Vanessa Kosoy (vanessa-kosoy) · 2025-04-10T09:17:38.966Z · comments (4)

[link] birds and mammals independently evolved intelligence
bhauth · 2025-04-08T20:00:05.100Z · comments (23)

Why does LW not put much more focus on AI governance and outreach?
Severin T. Seehrich (sts) · 2025-04-12T14:24:54.197Z · comments (28)

[link] Frontier AI Models Still Fail at Basic Physical Tasks: A Manufacturing Case Study
Adam Karvonen (karvonenadam) · 2025-04-14T17:38:02.918Z · comments (4)

One-shot steering vectors cause emergent misalignment, too
Jacob Dunefsky (jacob-dunefsky) · 2025-04-14T06:40:41.503Z · comments (5)

Disempowerment spirals as a likely mechanism for existential catastrophe
Raymond D · 2025-04-10T14:37:58.301Z · comments (4)

I'm resigning as Meetup Czar. What's next?
Screwtape · 2025-04-02T00:30:42.110Z · comments (2)

Will compute bottlenecks prevent a software intelligence explosion?
Tom Davidson (tom-davidson-1) · 2025-04-04T17:41:37.088Z · comments (2)

AI 2027: Dwarkesh’s Podcast with Daniel Kokotajlo and Scott Alexander
Zvi · 2025-04-07T13:40:05.944Z · comments (2)

AI CoT Reasoning Is Often Unfaithful
Zvi · 2025-04-04T14:50:05.538Z · comments (4)

[link] Google DeepMind: An Approach to Technical AGI Safety and Security
Rohin Shah (rohinmshah) · 2025-04-05T22:00:14.803Z · comments (12)

Renormalization Roadmap
Lauren Greenspan (LaurenGreenspan) · 2025-03-31T20:34:16.352Z · comments (7)

[link] How Gay is the Vatican?
rba · 2025-04-06T21:27:50.530Z · comments (32)

LLM AGI will have memory, and memory changes alignment
Seth Herd · 2025-04-04T14:59:13.070Z · comments (9)

Steelmanning heuristic arguments
Dmitry Vaintrob (dmitry-vaintrob) · 2025-04-13T01:09:33.392Z · comments (0)

Housing Roundup #11
Zvi · 2025-04-01T16:30:03.694Z · comments (1)

Consider showering
bohaska (Bohaska) · 2025-04-01T23:54:26.714Z · comments (15)

My "infohazards small working group" Signal Chat may have encountered minor leaks
Linch · 2025-04-02T01:03:05.311Z · comments (0)

OpenAI Responses API changes models' behavior
Jan Betley (jan-betley) · 2025-04-11T13:27:29.942Z · comments (6)

Alignment faking CTFs: Apply to my MATS stream
joshc (joshua-clymer) · 2025-04-04T16:29:02.070Z · comments (0)

AI #110: Of Course You Know…
Zvi · 2025-04-03T13:10:05.674Z · comments (9)

Introducing BenchBench: An Industry Standard Benchmark for AI Strength
Jozdien · 2025-04-02T02:11:41.555Z · comments (0)

We’re not prepared for an AI market crash
Remmelt (remmelt-ellen) · 2025-04-01T04:33:55.040Z · comments (12)

On Google’s Safety Plan
Zvi · 2025-04-11T12:51:12.112Z · comments (6)

A collection of approaches to confronting doom, and my thoughts on them
Ruby · 2025-04-06T02:11:31.271Z · comments (18)

Reactions to METR task length paper are insane
Cole Wyeth (Amyr) · 2025-04-10T17:13:36.428Z · comments (41)

next page (older posts) →

Archive

Recent comments

neel-nanda-1 on Frontier AI Models Still Fail at Basic Physical Tasks: A Manufacturing Case Study

My understanding is that there was a separate image model in historical vlms like flamingo but that it passed on a vector representation of the image not text

samuelshadrach on [Letter] Chinese Quickstart

Thanks for taking time to reply!

Yes OpenAI realtime API is really cool. When speaking to realtime API, I start each sentence with two words indicating what I want it to do. It's clunky but it works. "Translate Chinese, what is the time?" "Reply Chinese, how are you?" Ideally yes I could write an app to prepend the instruction audio to each sentence.

If I had this as higher priority I'd actually want to setup this Twilio app.

jiro on A Dissent on Honesty

He asks “How interested are you in Widgets?” He has learnt from previous job interviews that, if he answers honestly, the interviewer will think he is any of lying, insane, or too weird to deal with, and not hire him, even though this is not in the best financial interests of the company, were they fully informed.

By the standard "intentionally or knowingly cause the other person to have false beliefs", answering 'honestly' would be lying, and answering in a toned down way would not (because it maximizes the truth of the belief that the interviewer gets).

nick_tarleton on Unbendable Arm as Test Case for Religious Belief

(I have successfully done Unbendable Arm after Valentine showed me in person, though he didn't explain any of the mechanics. My experience of it didn't involve visualization, but felt like placing my fingertips on the wall across the room and resolving that they'd stay there. Contra jimmy's comment [LW(p) · GW(p)], IIRC I initially held my arm wrong without any cueing.)

Strongly related: Believing In [LW · GW]. From that post:

My guess is that for lack of good concepts for distinguishing “believing in” from deception, LessWrongers, EAs, and “nerds” in general are often both too harsh on folks doing positive-sum “believing in,” and too lax on folks doing deception. (The “too lax” happens because many can tell there’s a “believing in”-shaped gap in their notions of e.g. “don’t say better things about your start-up than a reasonable outside observer would,” but they can’t tell its exact shape, so they loosen their “don’t deceive” in general.)

I feel like this post is similarly too lax on, not deception, but propositional-and-false religious beliefs.

jenn on jenn's Shortform

this week's meetup is on the train to crazy town [? · GW]. it was fun putting together all the readings and discussion questions, and i'm optimistic about how the meetup's going to turn out! (i mean, in general, i don't run meetups i'm not optimistic about, so i guess that's not saying much.) im slightly worried about some folks coming in and just being like "this metaphor is entirely unproductive and sucks", should consider how to frame the meetup productively to such folks.

i think one of my strengths as an organizer is that ive read sooooo much stuff and so its relatively easy for me to pull together cohesive readings for any meetup. but ultimately im not sure if it's like, the most important work, to e.g. put together a bibliography of the crazy town idea and its various appearances since 2021. still, it's fun to do.

elityre on Eli's shortform feed

For the same reasons 'training an agent on a constitution that says to care about ' does not, at arbitrary capability levels, produce an agent that cares about $x$

Ok, but I'm trying to ask why not.

Here's the argument that I would make for why not, followed by why I'm skeptical of it right now.

New options for the AI will open up at high capability levels that were not available at lower capability levels. This could in principle lead to undefined behavior that deviates from what we intended.

More specifically, if it's the case that if...

The best / easiest-for-SGD-to-find way to compute corrigible outputs (as evaluated by the AI) is to reinforce an internal proxy measure that is correlated with corrigibility (as evaluated by the AI) in distribution, instead of to reinforce circuits that implement corrigibility more-or-less directly.
When the AI gains new options unlocked by new advanced capabilities, that proxy measure comes apart from corrigibility (as evaluated by the AI), in the limit of capabilities, so that the poxy measure is almost uncorrelated with corrigibility

...then the resulting system will not end up corrigible.

(Is this the argument that you would give, or is there another reason why you expect that "training an agent on a constitution that says to care about $x$ ' does not, at arbitrary capability levels, produce an agent that cares about $x$ "?)

But, at the moment, I'm skeptical of the above line of argument for several reasons.

I'm skeptical of the first premise, that the best way that SGD can find to produce corrigible (as evaluated by the AI) is to reinforce a proxy measure.
- I understand that natural selection, when shaping humans for inclusive genetic fitness, instilled in them a bunch of proxy-drives. But I think this analogy is misleading in several ways.
- Most relevantly, there's a genetic bottleneck, so evolution could only shape human behavior by selecting over genomes, and genomes don't encode that much knowledge about the world. If humans were born into the world with detailed world models, that included the concept of inclusive genetic fitness baked in, evolution would absolutely shaped humans to be inclusive fitness maximizers. AIs are "born into the world" with expansive world models that already include concepts like corrigibility (indeed, if they didn't, Constitutional AI wouldn't work at all). So it would be surprising if SGD opted to reinforce proxy measures instead of relying on the concepts directly.
We would run the constitutional AI reinforcement process continuously, in parallel with the capability improvements from the RL training.
- AI's capabilities increase, it will gain new options. If the AI is steering based on proxy measures, some of those options will involved the proxy coming apart from the target of the proxy. But when that starts to happen, the constitutional AI loop will exert an optimization pressure on the AI's internals to hit the target, not just the proxies.

Is this the main argument? What are other reasons to think that 'training an agent on a constitution that says to care about $x$ ' does not, at arbitrary capability levels, produce an agent that cares about $x$ ?

lblack on Lucius Bushnaq's Shortform

Nope. Try it out. If you attempt to split the activation vector into 1050 vectors for animals + attributes, you can't get the dictionary activations to equal the feature activations , $c_{i}^{'} (x)$ .

kairos_ on Mo Putera's Shortform

I believe the Scramblers from blindsight weren’t self aware, which means they couldn’t think about their own interactions with the world.

As I recall the crew was giving one of the Scramblers a series of cognitive tests. It aced all the tests that had to do with numbers and spatial reasoning, but failed a test that required the testee to be self aware.

thane-ruthenis on johnswentworth's Shortform

Oh, if you're in the business of compiling a comprehensive taxonomy of ways the current AI thing may be fake, you should also add:

Vibe coders and "10x'd engineers", who (on this model) would be falling into one of the failure modes outlined here [LW · GW]: producing applications/features that didn't need to exist, creating pointless code bloat (which helpfully show up in productivity metrics like "volume of code produced" or "number of commits"), or "automatically generating" entire codebases in a way that feels magical, then spending so much time bugfixing them it eats up ~all perceived productivity gains.
e/acc and other Twitter AI fans, who act like they're bleeding-edge transhumanist visionaries/analysts/business gurus/startup founders, but who are just shitposters/attention-seekers who will wander off and never look back the moment the hype dies down.

asta7k on Reactions to METR task length paper are insane

What are your current AGI timelines?