LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

next page (older posts) →

There is way too much serendipity
Malmesbury (Elmer of Malmesbury) · 2024-01-19T19:37:57.068Z · comments (56)

[link] Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
evhub · 2024-01-12T19:51:01.021Z · comments (94)

Gentleness and the artificial Other
Joe Carlsmith (joekc) · 2024-01-02T18:21:34.746Z · comments (33)

The case for ensuring that powerful AIs are controlled
ryan_greenblatt · 2024-01-24T16:11:51.354Z · comments (66)

MIRI 2024 Mission and Strategy Update
Malo (malo) · 2024-01-05T00:20:54.169Z · comments (44)

Toward A Mathematical Framework for Computation in Superposition
Dmitry Vaintrob (dmitry-vaintrob) · 2024-01-18T21:06:57.040Z · comments (17)

This might be the last AI Safety Camp
Remmelt (remmelt-ellen) · 2024-01-24T09:33:29.438Z · comments (33)

The impossible problem of due process
mingyuan · 2024-01-16T05:18:33.415Z · comments (63)

Introducing Alignment Stress-Testing at Anthropic
evhub · 2024-01-12T23:51:25.875Z · comments (23)

Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI
Jeremy Gillen (jeremy-gillen) · 2024-01-26T07:22:06.370Z · comments (60)

What’s up with LLMs representing XORs of arbitrary features?
Sam Marks (samuel-marks) · 2024-01-03T19:44:33.162Z · comments (61)

Apologizing is a Core Rationalist Skill
johnswentworth · 2024-01-02T17:47:35.950Z · comments (42)

[link] Making every researcher seek grants is a broken model
jasoncrawford · 2024-01-26T16:06:26.688Z · comments (41)

What good is G-factor if you're dumped in the woods? A field report from a camp counselor.
Hastings (hastings-greer) · 2024-01-12T13:17:23.829Z · comments (22)

Deep atheism and AI risk
Joe Carlsmith (joekc) · 2024-01-04T18:58:47.745Z · comments (22)

Processor clock speeds are not how fast AIs think
Ege Erdil (ege-erdil) · 2024-01-29T14:39:38.050Z · comments (55)

Notice When People Are Directionally Correct
Chris_Leong · 2024-01-14T14:12:37.090Z · comments (7)

The case for training frontier AIs on Sumerian-only corpus
Alexandre Variengien (alexandre-variengien) · 2024-01-15T16:40:22.011Z · comments (14)

An even deeper atheism
Joe Carlsmith (joekc) · 2024-01-11T17:28:31.843Z · comments (47)

A Shutdown Problem Proposal
johnswentworth · 2024-01-21T18:12:48.664Z · comments (61)

[link] Steering Llama-2 with contrastive activation additions
Nina Rimsky (NinaR) · 2024-01-02T00:47:04.621Z · comments (29)

Why I take short timelines seriously
NicholasKees (nick_kees) · 2024-01-28T22:27:21.098Z · comments (29)

Four visions of Transformative AI success
Steven Byrnes (steve2152) · 2024-01-17T20:45:46.976Z · comments (22)

[link] Practically A Book Review: Appendix to "Nonlinear's Evidence: Debunking False and Misleading Claims" (ThingOfThings)
tailcalled · 2024-01-03T17:07:13.990Z · comments (25)

[link] Gender Exploration
sapphire (deluks917) · 2024-01-14T18:57:32.893Z · comments (23)

Being nicer than Clippy
Joe Carlsmith (joekc) · 2024-01-16T19:44:23.893Z · comments (23)

' petertodd'’s last stand: The final days of open GPT-3 research
mwatkins · 2024-01-22T18:47:00.710Z · comments (15)

The case for more ambitious language model evals
Jozdien · 2024-01-30T00:01:13.876Z · comments (25)

2023 in AI predictions
jessicata (jessica.liu.taylor) · 2024-01-01T05:23:42.514Z · comments (34)

Deceptive AI ≠ Deceptively-aligned AI
Steven Byrnes (steve2152) · 2024-01-07T16:55:13.761Z · comments (19)

[link] Almost everyone I’ve met would be well-served thinking more about what to focus on
Henrik Karlsson (henrik-karlsson) · 2024-01-05T21:01:27.861Z · comments (8)

[link] RAND report finds no effect of current LLMs on viability of bioterrorism attacks
StellaAthena · 2024-01-25T19:17:30.493Z · comments (14)

On the abolition of man
Joe Carlsmith (joekc) · 2024-01-18T18:17:06.201Z · comments (18)

The Aspiring Rationalist Congregation
maia · 2024-01-10T22:52:54.298Z · comments (23)

Catching AIs red-handed
ryan_greenblatt · 2024-01-05T17:43:10.948Z · comments (18)

Sparse Autoencoders Work on Attention Layer Outputs
Connor Kissane (ckkissane) · 2024-01-16T00:26:14.767Z · comments (5)

An Introduction To The Mandelbrot Set That Doesn't Mention Complex Numbers
Yitz (yitz) · 2024-01-17T09:48:07.930Z · comments (11)

Survey of 2,778 AI authors: six parts in pictures
KatjaGrace · 2024-01-06T04:43:34.590Z · comments (1)

[link] Palworld development blog post
bhauth · 2024-01-28T05:56:19.984Z · comments (12)

Some Vacation Photos
johnswentworth · 2024-01-04T17:15:01.187Z · comments (0)

Universal Love Integration Test: Hitler
Raemon · 2024-01-10T23:55:35.526Z · comments (65)

When "yang" goes wrong
Joe Carlsmith (joekc) · 2024-01-08T16:35:50.607Z · comments (6)

[link] The True Story of How GPT-2 Became Maximally Lewd
Writer · 2024-01-18T21:03:08.167Z · comments (7)

[link] [Repost] The Copenhagen Interpretation of Ethics
mesaoptimizer · 2024-01-25T15:20:08.162Z · comments (3)

Epistemic Hell
rogersbacon · 2024-01-27T17:13:09.578Z · comments (20)

[link] InterLab – a toolkit for experiments with multi-agent interactions
Tomáš Gavenčiak (tomas-gavenciak) · 2024-01-22T18:23:35.661Z · comments (0)

OpenAI's Preparedness Framework: Praise & Recommendations
Akash (akash-wasil) · 2024-01-02T16:20:04.249Z · comments (1)

[link] The Perceptron Controversy
Yuxi_Liu · 2024-01-10T23:07:23.341Z · comments (18)

[question] Will quantum randomness affect the 2028 election?
Thomas Kwa (thomas-kwa) · 2024-01-24T22:54:30.800Z · answers+comments (48)

We need a Science of Evals
Marius Hobbhahn (marius-hobbhahn) · 2024-01-22T20:30:39.493Z · comments (13)

next page (older posts) →

^{^}

Say Alice has lived her whole life in a room with a single button. People from the outside told her pressing the button would create nice paintings. Throughout her life, they provided an exhaustive array of proofs and confirmations of this fact. Unbeknownst to her, this was all an elaborate scheme, and in reality pressing the button destroys nice paintings. Alice, liking paintings, regularly presses the button.
A naive application of Vanessa's criterion would impute Alice the goal of destroying paintings. To avoid this, we somehow need to integrate over all possible worlds Alice can find herself in, and realize that, when you are presented with an exhaustive array of proofs and confirmations that the button creates paintings, it is on average more likely for the button to create paintings than destroy them.
But we face a decision. Either we fix a prior to do this that we will use for all agents, in which case all agents with a different prior will look silly to us. Or we somehow try to extract the agent's prior, and we're back at ontology identification.

(Disclaimer: This was SOTA understanding a year ago, unsure if it still is now.)

LessWrong 2.0 Reader

Archive

Recent comments