LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

next page (older posts) →

SolidGoldMagikarp (plus, prompt generation)
Jessica Rumbelow (jessica-cooper) · 2023-02-05T22:02:35.854Z · comments (204)

Focus on the places where you feel shocked everyone's dropping the ball
So8res · 2023-02-02T00:27:55.687Z · comments (61)

Bing Chat is blatantly, aggressively misaligned
evhub · 2023-02-15T05:29:45.262Z · comments (170)

Noting an error in Inadequate Equilibria
Matthew Barnett (matthew-barnett) · 2023-02-08T01:33:33.715Z · comments (56)

Please don't throw your mind away
TsviBT · 2023-02-15T21:41:05.988Z · comments (44)

Cyborgism
NicholasKees (nick_kees) · 2023-02-10T14:47:48.172Z · comments (46)

[link] Childhoods of exceptional people
Henrik Karlsson (henrik-karlsson) · 2023-02-06T17:27:09.596Z · comments (62)

Fucking Goddamn Basics of Rationalist Discourse
LoganStrohl (BrienneYudkowsky) · 2023-02-04T01:47:32.578Z · comments (97)

[link] I hired 5 people to sit behind me and make me productive for a month
Simon Berens (sberens) · 2023-02-05T01:19:39.182Z · comments (81)

You Don't Exist, Duncan
[DEACTIVATED] Duncan Sabien (Duncan_Sabien) · 2023-02-02T08:37:01.049Z · comments (107)

[link] AGI in sight: our look at the game board
Andrea_Miotti (AndreaM) · 2023-02-18T22:17:44.364Z · comments (135)

Elements of Rationalist Discourse
Rob Bensinger (RobbBB) · 2023-02-12T07:58:42.479Z · comments (47)

Cognitive Emulation: A Naive AI Safety Proposal
Connor Leahy (NPCollapse) · 2023-02-25T19:35:02.409Z · comments (45)

AI alignment researchers don't (seem to) stack
So8res · 2023-02-21T00:48:25.186Z · comments (40)

EigenKarma: trust at scale
Henrik Karlsson (henrik-karlsson) · 2023-02-08T18:52:24.490Z · comments (50)

Why Are Bacteria So Simple?
aysja · 2023-02-06T03:00:31.837Z · comments (33)

AI #1: Sydney and Bing
Zvi · 2023-02-21T14:00:00.480Z · comments (44)

My understanding of Anthropic strategy
Swimmer963 (Miranda Dixon-Luinenburg) (Swimmer963) · 2023-02-15T01:56:40.961Z · comments (31)

[link] Parametrically retargetable decision-makers tend to seek power
TurnTrout · 2023-02-18T18:41:38.740Z · comments (9)

[link] [Link] A community alert about Ziz
DanielFilan · 2023-02-24T00:06:00.027Z · comments (126)

Big Mac Subsidy?
jefftk (jkaufman) · 2023-02-23T04:00:03.996Z · comments (24)

Stop posting prompt injections on Twitter and calling it "misalignment"
lc · 2023-02-19T02:21:44.061Z · comments (9)

[link] We Found An Neuron in GPT-2
Joseph Miller (Josephm) · 2023-02-11T18:27:29.410Z · comments (22)

Full Transcript: Eliezer Yudkowsky on the Bankless podcast
remember · 2023-02-23T12:34:19.523Z · comments (89)

[link] Anomalous tokens reveal the original identities of Instruct models
janus · 2023-02-09T01:30:56.609Z · comments (16)

Modal Fixpoint Cooperation without Löb's Theorem
Andrew_Critch · 2023-02-05T00:58:40.975Z · comments (32)

Pretraining Language Models with Human Preferences
Tomek Korbak (tomek-korbak) · 2023-02-21T17:57:09.774Z · comments (18)

"Rationalist Discourse" Is Like "Physicist Motors"
Zack_M_Davis · 2023-02-26T05:58:29.249Z · comments (152)

Evaluations (of new AI Safety researchers) can be noisy
LawrenceC (LawChan) · 2023-02-05T04:15:02.117Z · comments (10)

Hashing out long-standing disagreements seems low-value to me
So8res · 2023-02-16T06:20:00.899Z · comments (34)

Recommendation: Bug Bounties and Responsible Disclosure for Advanced ML Systems
Vaniver · 2023-02-17T20:11:39.255Z · comments (11)

In Defense of Chatbot Romance
Kaj_Sotala · 2023-02-11T14:30:05.696Z · comments (52)

There are (probably) no superhuman Go AIs: strong human players beat the strongest AIs
Taran · 2023-02-19T12:25:52.212Z · comments (33)

A proposed method for forecasting transformative AI
Matthew Barnett (matthew-barnett) · 2023-02-10T19:34:01.358Z · comments (21)

There are no coherence theorems
Dan H (dan-hendrycks) · 2023-02-20T21:25:48.478Z · comments (114)

One-layer transformers aren’t equivalent to a set of skip-trigrams
Buck · 2023-02-17T17:26:13.819Z · comments (10)

GPT-175bee
Adam Scherlis (adam-scherlis) · 2023-02-08T18:58:01.364Z · comments (13)

On Investigating Conspiracy Theories
Zvi · 2023-02-20T12:50:00.891Z · comments (38)

The public supports regulating AI for safety
Zach Stein-Perlman · 2023-02-17T04:10:03.307Z · comments (9)

The Open Agency Model
Eric Drexler · 2023-02-22T10:35:12.316Z · comments (18)

Bing chat is the AI fire alarm
Ratios · 2023-02-17T06:51:51.551Z · comments (62)

GPT-4 Predictions
Stephen McAleese (stephen-mcaleese) · 2023-02-17T23:20:24.696Z · comments (27)

SolidGoldMagikarp II: technical details and more recent findings
mwatkins · 2023-02-06T19:09:01.406Z · comments (45)

A Way To Be Okay
[DEACTIVATED] Duncan Sabien (Duncan_Sabien) · 2023-02-19T20:27:10.061Z · comments (36)

Conflict Theory of Bounded Distrust
Zack_M_Davis · 2023-02-12T05:30:30.760Z · comments (29)

I don't think MIRI "gave up"
Raemon · 2023-02-03T00:26:07.552Z · comments (64)

[link] Sam Altman: "Planning for AGI and beyond"
LawrenceC (LawChan) · 2023-02-24T20:28:00.430Z · comments (54)

Cyborg Periods: There will be multiple AI transitions
Jan_Kulveit · 2023-02-22T16:09:04.858Z · comments (9)

Don't accelerate problems you're trying to solve
Andrea_Miotti (AndreaM) · 2023-02-15T18:11:30.595Z · comments (26)

H5N1
Zvi · 2023-02-13T12:50:00.694Z · comments (1)

next page (older posts) →

^{^}

Say Alice has lived her whole life in a room with a single button. People from the outside told her pressing the button would create nice paintings. Throughout her life, they provided an exhaustive array of proofs and confirmations of this fact. Unbeknownst to her, this was all an elaborate scheme, and in reality pressing the button destroys nice paintings. Alice, liking paintings, regularly presses the button.
A naive application of Vanessa's criterion would impute Alice the goal of destroying paintings. To avoid this, we somehow need to integrate over all possible worlds Alice can find herself in, and realize that, when you are presented with an exhaustive array of proofs and confirmations that the button creates paintings, it is on average more likely for the button to create paintings than destroy them.
But we face a decision. Either we fix a prior to do this that we will use for all agents, in which case all agents with a different prior will look silly to us. Or we somehow try to extract the agent's prior, and we're back at ontology identification.

(Disclaimer: This was SOTA understanding a year ago, unsure if it still is now.)

LessWrong 2.0 Reader

Archive

Recent comments