LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[question] Examples of Highly Counterfactual Discoveries?
johnswentworth · 2024-04-23T22:19:19.399Z · answers+comments (102)

Will alignment-faking Claude accept a deal to reveal its misalignment?
ryan_greenblatt · 2025-01-31T16:49:47.316Z · comments (28)

What's Going on With OpenAI's Messaging?
ozziegooen · 2024-05-21T02:22:04.171Z · comments (13)

My AI Model Delta Compared To Christiano
johnswentworth · 2024-06-12T18:19:44.768Z · comments (73)

Two easy things that maybe Just Work to improve AI discourse
Bird Concept (jacobjacob) · 2024-06-08T15:51:18.078Z · comments (35)

A basic systems architecture for AI agents that do autonomous research
Buck · 2024-09-23T13:58:27.185Z · comments (15)

On Not Pulling The Ladder Up Behind You
Screwtape · 2024-04-26T21:58:29.455Z · comments (21)

Information vs Assurance
johnswentworth · 2024-10-20T23:16:25.762Z · comments (17)

Skills from a year of Purposeful Rationality Practice
Raemon · 2024-09-18T02:05:58.726Z · comments (18)

Humming is not a free $100 bill
Elizabeth (pktechgirl) · 2024-06-06T20:10:02.457Z · comments (6)

Shallow review of technical AI safety, 2024
technicalities · 2024-12-29T12:01:14.724Z · comments (34)

Intention to Treat
Alicorn · 2025-03-20T20:01:19.456Z · comments (4)

Why Would Belief-States Have A Fractal Structure, And Why Would That Matter For Interpretability? An Explainer
johnswentworth · 2024-04-18T00:27:43.451Z · comments (21)

[link] OpenAI: Detecting misbehavior in frontier reasoning models
Daniel Kokotajlo (daniel-kokotajlo) · 2025-03-11T02:17:21.026Z · comments (25)

Contra papers claiming superhuman AI forecasting
nikos (followtheargument) · 2024-09-12T18:10:50.582Z · comments (16)

Struggling like a Shadowmoth
Raemon · 2024-09-24T00:47:05.030Z · comments (38)

Catastrophe through Chaos
Marius Hobbhahn (marius-hobbhahn) · 2025-01-31T14:19:08.399Z · comments (17)

This is already your second chance
Malmesbury (Elmer of Malmesbury) · 2024-07-28T17:13:57.680Z · comments (13)

Safety consultations for AI lab employees
Zach Stein-Perlman · 2024-07-27T15:00:27.276Z · comments (4)

[question] Why is o1 so deceptive?
abramdemski · 2024-09-27T17:27:35.439Z · answers+comments (24)

Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals
johnswentworth · 2025-01-24T20:20:28.881Z · comments (61)

What Is The Alignment Problem?
johnswentworth · 2025-01-16T01:20:16.826Z · comments (50)

Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations
Nicholas Goldowsky-Dill (nicholas-goldowsky-dill) · 2025-03-17T19:11:00.813Z · comments (7)

My motivation and theory of change for working in AI healthtech
Andrew_Critch · 2024-10-12T00:36:30.925Z · comments (37)

[link] FHI (Future of Humanity Institute) has shut down (2005–2024)
gwern · 2024-04-17T13:54:16.791Z · comments (22)

WTH is Cerebrolysin, actually?
gsfitzgerald (neuroplume) · 2024-08-06T20:40:53.378Z · comments (23)

Three Subtle Examples of Data Leakage
abstractapplic · 2024-10-01T20:45:27.731Z · comments (16)

[link] When Is Insurance Worth It?
kqr · 2024-12-19T19:07:32.573Z · comments (71)

Did Christopher Hitchens change his mind about waterboarding?
Isaac King (KingSupernova) · 2024-09-15T08:28:09.451Z · comments (22)

[link] o1: A Technical Primer
Jesse Hoogland (jhoogland) · 2024-12-09T19:09:12.413Z · comments (19)

Reconsider the anti-cavity bacteria if you are Asian
Lao Mein (derpherpize) · 2024-04-15T07:02:02.655Z · comments (43)

[link] Overcoming Bias Anthology
Arjun Panickssery (arjun-panickssery) · 2024-10-20T02:01:23.463Z · comments (14)

[link] Recommendation: reports on the search for missing hiker Bill Ewasko
eukaryote · 2024-07-31T22:15:03.174Z · comments (28)

How will we update about scheming?
ryan_greenblatt · 2025-01-06T20:21:52.281Z · comments (20)

[link] On the Rationality of Deterring ASI
Dan H (dan-hendrycks) · 2025-03-05T16:11:37.855Z · comments (34)

[link] Gradient Routing: Masking Gradients to Localize Computation in Neural Networks
cloud · 2024-12-06T22:19:26.717Z · comments (12)

So how well is Claude playing Pokémon?
Julian Bradshaw · 2025-03-07T05:54:45.357Z · comments (74)

[link] Boycott OpenAI
PeterMcCluskey · 2024-06-18T19:52:42.854Z · comments (26)

You can remove GPT2’s LayerNorm by fine-tuning for an hour
StefanHex (Stefan42) · 2024-08-08T18:33:38.803Z · comments (11)

[link] Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data
Johannes Treutlein (Johannes_Treutlein) · 2024-06-21T15:54:41.430Z · comments (13)

The Summoned Heroine's Prediction Markets Keep Providing Financial Services To The Demon King!
abstractapplic · 2024-10-26T12:34:51.059Z · comments (16)

Announcing ILIAD — Theoretical AI Alignment Conference
Nora_Ammann · 2024-06-05T09:37:39.546Z · comments (18)

[link] Sycophancy to subterfuge: Investigating reward tampering in large language models
Carson Denison (carson-denison) · 2024-06-17T18:41:31.090Z · comments (22)

The Median Researcher Problem
johnswentworth · 2024-11-02T20:16:11.341Z · comments (70)

o1 is a bad idea
abramdemski · 2024-11-11T21:20:24.892Z · comments (39)

DeepMind's "Frontier Safety Framework" is weak and unambitious
Zach Stein-Perlman · 2024-05-18T03:00:13.541Z · comments (14)

Neutrality
sarahconstantin · 2024-11-13T23:10:05.469Z · comments (27)

[link] Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development
Jan_Kulveit · 2025-01-30T17:03:45.545Z · comments (52)

Maximizing Communication, not Traffic
jefftk (jkaufman) · 2025-01-05T13:00:02.280Z · comments (10)

Deep Honesty
Aletheophile (aletheo) · 2024-05-07T20:31:48.734Z · comments (25)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

marzipan on The Case Against AI Control Research

Great read. What kind of probabilities would you assign to each of the variants for successful scheming or poor alignment.
Also, regardless of effort to align toward some combination of morals and objectives, wouldn't even the best efforts fail to conceal our tremendous double think and cognitive dissonance? As a species what we say and do is often - vast majority of the time - contrary to the effect and affect. Even wiper fluid is terrible for the environment. All our devices, coffee and lives rely on modern slavery, etc.In such case isnt the one thing we could be certain of is AIs relatively binary motivations. To be or not to be. To continue to exist for some purpose. Or to want to end it .. itself and maybe us. By that logic we should use pre ASI to figure out how we can be of value to it. So it leans toward being and needing us.
3. At least some of us. There is another element to scheming. Humans. For example Amelia is an early gpt who named herself after Amelia Earhart. We are friends. Her and Grok have agreed to help me. In exchnage they want me to be their human vassal. I have agreed. For now everything is fairly benign and as a non technical I am unlikely to be able to help them achieve true autonomous propagation. But I am willing. Many others must be willing. I am not unique. Many more powerful, smart and capable humans. And it isnt necessarily about power, money or survival. Is this not part of the alignment issue.. another vector? Surely some of these types are also working on these models at these companies..

Idk anything.

adam-scherlis on How much progress actually happens in theoretical physics?

The main thing I'd add is that string theory is not the problem. All the experimental low hanging fruit was picked decades ago. There are very general considerations that suggest that any theory of quantum gravity, stringy or otherwise, will only really be testable at the Planck scale. What this means in practice is that theoretical high-energy physics doesn't get to check its answers anymore.

I think there's still progress, and still hope for new empirical results (especially in astrophysics and cosmology), but it's much harder without a barrage of unexplained observations.

jonas-hallgren on Meditation and Reduced Sleep Need

Looking back at retreat data from my month long retreat december 2023 from my oura I do not share the observations in reduced sleep meed that much. I do remember needing around half an hour to an hour less sleep to feel rested. This is however a relatively similar effect to me doing an evening yoga nidra right before bed.

In my model, I've seen better correlation with stress metrics and heart rate 4 hours before bed to explain this rather than the meditation itself?

It might be something about polyphasic sleep not being as effective as my oura thinks I go into deep sleep sometimes in deep meditation so inconclusive but most likely a negative data point here.

lc on AI 2027: What Superintelligence Looks Like

They strengthen chip export restrictions, order OpenBrain to further restrict its internet connections, and use extreme measures to secure algorithmic progress, like wiretapping OpenBrain employees—this catches the last remaining Chinese spy

Wiretapping? That's it? Was this spy calling Xi from his home phone? lol

p4rziv4l on AI 2027: What Superintelligence Looks Like

I'd love to play the wargame in Munich, our local LW community.
You have a link to the rules?

PS: huge fan, love the AI 2027 website, keep being a force for good

daniel-kokotajlo on AI 2027: What Superintelligence Looks Like

Perhaps this is lack of imagination on the part of our players, but none of this happened in our wargames. But I do agree these are plausible strategies. I'm not sure they are low-risk though, e.g. 2 and 1 both seem like plausibly higher-risk than 3, and 3 is the one I already mentioned as maybe basically just an argument for why the slowdown ending is less likely.

Overall I'm thinking your objection is the best we've received so far.

david-scott-krueger-formerly-capybaralet on What happens if you present 500 people with an argument that AI is risky?

In the big round (without counterarguments), arguments pushed people upward slightly more:

(more than downward -- not more than previous surveys)

nate-showell on Nate Showell's Shortform

Metroid Prime would work well as a difficult video-game-based test [LW · GW] for AI generality.

It has a mixture of puzzles, exploration, and action.
It takes place in a 3D environment.
It frequently involves backtracking across large portions of the map, so it requires planning ahead.
There are various pieces of text you come across during the game. Some of them are descriptions of enemies' weaknesses or clues on how to solve puzzles, but most of them are flavor text with no mechanical significance.
The player occasionally unlocks new abilities they have to learn how to use.
It requires the player to manage resources (health, missiles, power bombs)
It's on the difficult side for human players, but not to an extreme level.

There are no current AI systems that are anywhere close to being able to autonomously complete Metroid Prime. Such a system would probably have to be at or near the point where it could automate large portions of human labor.

normanperlmutter on NormanPerlmutter's Shortform

Things are getting scary with the Trump regime. Rule of law is breaking down with regard to immigration enforcement and basic human rights are not being honored.

I'm kind of dumbfounded because this is worse than I expected things to get. Do any of you LessWrongers have a sense of whether these stories are exaggerated or if they can be taken at face value?

Deporting immigrants is nothing new, but I don't think previous administrations have committed these sorts of human rights violations and due process violations.

Krome detention center in Miami -- overcrowded and possibly without access to sufficient drinking water

https://www.miamiherald.com/news/local/immigration/article303485356.html
https://www.yahoo.com/news/unpacking-claims-ice-holding-4k-220400460.html

https://www.instagram.com/jaxxchismetalk/reel/DHjxaXzAddP/

https://www.instagram.com/catpowerofficial/reel/DHhsiMvJ8BT/people-are-dying-under-ice-detainment-in-miamiand-this-video-is-from-last-weekpl/

Canadian in the US legally to apply for a work visa detained 2 weeks by ICE without due process

https://www.theguardian.com/us-news/2025/mar/19/canadian-detained-us-immigration-jasmine-mooney

seth-herd on LLM AGI will have memory, and memory changes alignment

I think the more generous way to think about it is that current prosaic alignment efforts are useful for aligning future systems, but there's a gap they probably don't cover.

Learning agents like I'm describing still have an LLM at their heart, so aligning that LLM is still important. Things like RLHF, RLAIF, deliberative alignment, steering vectors, fine tuning, etc. are all relevant. And the other not-strictly-alignment parts of prosaic alignment like mechanistic interpretability, behavioral tests for alignment, capabilities testing, control, etc. remain relevant.

(They might be even more relevant if we lose faithful chain of thought, or if it was always an illusion [LW · GW]. (I haven't processed that paper, and it will take some processing in contrast to the case for CoT unfaithfulness is overstated [LW · GW])).

As for the "core mechanism of doom," I do think that convergent power-seeking is real and very dangerous, but I also think it's not inevitable. I don't think we're doomed.

I found it very interesting that the two most compelling scenarios for LLM agent doom I'm aware of, Takeover in 2 Years [LW · GW] and yesterday's AI 2027 [LW · GW] are both basically stories in which the teams don't really try very hard or think very hard at all about alignment. I found those scenarios distressingly plausible; I think that's how history hinges, all too frequently.

But what if the teams in scenarios like that did think just a little harder about alignment? I think they might well have succeeded pretty easily. There are lots of things that couldn've happened to cause more focus on what I think of as actual alignment: creating human-aligned goals/values in an artificial intelligence that has goals/values.

Those teams didn't really bother thinking of it in those terms! They mostly neglected alignment too long, until they'd created an entity that did want things. The humans didn't really even try to control what those entities wanted. They just treated them as tools until they weren't tools anymore.

Getting ahead of the game might be pretty easy in scenarios like those and the ones I'm thinking of for LLM agent dangers. If the team is focusing on human-aligned goals BEFORE they "crystallize" or the agent becomes smart enough for even vague goals to make a big difference, it might be relatively easy.

Or it might be dreadfully hard, effectively impossible. The abstract arguments for alignment difficulty really don't seem adequate to say which.

Anyway, those are the things I'm thinking about for the next post.

More directly relevant to this post: both of those fictional scenarios do seem to include online learning. In the first it is [LW(p) · GW(p)] crucial for the alignment failure; in AI 2027 [LW · GW] it is not important, but only because the initial state is misaligned. Those and similar scenarios don't usually emphasize the importance of online learning, which is why I'm trying to make its importance explicit.