LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

next page (older posts) →

Feedbackloop-first Rationality
Raemon · 2023-08-07T17:58:56.349Z · comments (65)

A list of core AI safety problems and how I hope to solve them
davidad · 2023-08-26T15:12:18.484Z · comments (26)

[link] ARC Evals new report: Evaluating Language-Model Agents on Realistic Autonomous Tasks
Beth Barnes (beth-barnes) · 2023-08-01T18:30:57.068Z · comments (12)

The "public debate" about AI is confusing for the general public and for policymakers because it is a three-sided debate
Adam David Long (adam-david-long-1) · 2023-08-01T00:08:30.908Z · comments (30)

6 non-obvious mental health issues specific to AI safety
Igor Ivanov (igor-ivanov) · 2023-08-18T15:46:09.938Z · comments (24)

Password-locked models: a stress case for capabilities evaluation
Fabien Roger (Fabien) · 2023-08-03T14:53:12.459Z · comments (14)

Responses to apparent rationalist confusions about game / decision theory
Anthony DiGiovanni (antimonyanthony) · 2023-08-30T22:02:12.218Z · comments (14)

Inflection.ai is a major AGI lab
nikola (nikolaisalreadytaken) · 2023-08-09T01:05:54.604Z · comments (13)

The U.S. is becoming less stable
lc · 2023-08-18T21:13:11.909Z · comments (66)

[link] Ten Thousand Years of Solitude
agp (antonio-papa) · 2023-08-15T17:45:34.556Z · comments (17)

Book Launch: "The Carving of Reality," Best of LessWrong vol. III
Raemon · 2023-08-16T23:52:12.518Z · comments (22)

Invulnerable Incomplete Preferences: A Formal Statement
Sami Petersen (sami-petersen) · 2023-08-30T21:59:36.186Z · comments (32)

[link] Report on Frontier Model Training
YafahEdelman (yafah-edelman-1) · 2023-08-30T20:02:46.317Z · comments (21)

[link] Introducing the Center for AI Policy (& we're hiring!)
Thomas Larsen (thomas-larsen) · 2023-08-28T21:17:11.703Z · comments (50)

[link] When discussing AI risks, talk about capabilities, not intelligence
Vika · 2023-08-11T13:38:48.844Z · comments (7)

Assume Bad Faith
Zack_M_Davis · 2023-08-25T17:36:32.678Z · comments (52)

Summary of and Thoughts on the Hotz/Yudkowsky Debate
Zvi · 2023-08-16T16:50:02.808Z · comments (47)

Biosecurity Culture, Computer Security Culture
jefftk (jkaufman) · 2023-08-30T16:40:03.101Z · comments (10)

A Theory of Laughter
Steven Byrnes (steve2152) · 2023-08-23T15:05:59.694Z · comments (13)

What's A "Market"?
johnswentworth · 2023-08-08T23:29:24.722Z · comments (16)

[link] Biological Anchors: The Trick that Might or Might Not Work
Scott Alexander (Yvain) · 2023-08-12T00:53:30.159Z · comments (3)

[link] LTFF and EAIF are unusually funding-constrained right now
Linch · 2023-08-30T01:03:30.321Z · comments (24)

Problems with Robin Hanson's Quillette Article On AI
DaemonicSigil · 2023-08-06T22:13:43.654Z · comments (33)

We Should Prepare for a Larger Representation of Academia in AI Safety
Leon Lang (leon-lang) · 2023-08-13T18:03:19.799Z · comments (13)

[question] Exercise: Solve "Thinking Physics"
Raemon · 2023-08-01T00:44:48.975Z · answers+comments (23)

Dating Roundup #1: This is Why You’re Single
Zvi · 2023-08-29T12:50:04.964Z · comments (27)

My checklist for publishing a blog post
Steven Byrnes (steve2152) · 2023-08-15T15:04:56.219Z · comments (6)

Decomposing independent generalizations in neural networks via Hessian analysis
Dmitry Vaintrob (dmitry-vaintrob) · 2023-08-14T17:04:40.071Z · comments (3)

Stepping down as moderator on LW
Kaj_Sotala · 2023-08-14T10:46:58.163Z · comments (1)

Long-Term Future Fund: April 2023 grant recommendations
abergal · 2023-08-02T07:54:49.083Z · comments (3)

AI pause/governance advocacy might be net-negative, especially without focus on explaining the x-risk
Mikhail Samin (mikhail-samin) · 2023-08-27T23:05:01.718Z · comments (9)

The Low-Hanging Fruit Prior and sloped valleys in the loss landscape
Dmitry Vaintrob (dmitry-vaintrob) · 2023-08-23T21:12:58.599Z · comments (1)

The Economics of the Asteroid Deflection Problem (Dominant Assurance Contracts)
moyamo · 2023-08-29T18:28:54.015Z · comments (70)

The God of Humanity, and the God of the Robot Utilitarians
Raemon · 2023-08-24T08:27:57.396Z · comments (12)

An Interpretability Illusion for Activation Patching of Arbitrary Subspaces
Georg Lange (GeorgLange) · 2023-08-29T01:04:18.688Z · comments (4)

Computational Thread Art
CallumMcDougall (TheMcDouglas) · 2023-08-06T21:42:30.306Z · comments (2)

Digital brains beat biological ones because diffusion is too slow
GeneSmith · 2023-08-26T02:22:25.014Z · comments (21)

A plea for more funding shortfall transparency
porby · 2023-08-07T21:33:11.912Z · comments (4)

[link] A Proof of Löb's Theorem using Computability Theory
jessicata (jessica.liu.taylor) · 2023-08-16T18:57:41.048Z · comments (0)

3 levels of threat obfuscation
HoldenKarnofsky · 2023-08-02T14:58:32.506Z · comments (14)

[link] Barriers to Mechanistic Interpretability for AGI Safety
Connor Leahy (NPCollapse) · 2023-08-29T10:56:45.639Z · comments (13)

Modulating sycophancy in an RLHF model via activation steering
Nina Rimsky (NinaR) · 2023-08-09T07:06:50.859Z · comments (20)

Managing risks of our own work
Beth Barnes (beth-barnes) · 2023-08-18T00:41:30.832Z · comments (0)

State of Generally Available Self-Driving
jefftk (jkaufman) · 2023-08-22T18:50:01.166Z · comments (6)

next page (older posts) →

^{^}

Say Alice has lived her whole life in a room with a single button. People from the outside told her pressing the button would create nice paintings. Throughout her life, they provided an exhaustive array of proofs and confirmations of this fact. Unbeknownst to her, this was all an elaborate scheme, and in reality pressing the button destroys nice paintings. Alice, liking paintings, regularly presses the button.
A naive application of Vanessa's criterion would impute Alice the goal of destroying paintings. To avoid this, we somehow need to integrate over all possible worlds Alice can find herself in, and realize that, when you are presented with an exhaustive array of proofs and confirmations that the button creates paintings, it is on average more likely for the button to create paintings than destroy them.
But we face a decision. Either we fix a prior to do this that we will use for all agents, in which case all agents with a different prior will look silly to us. Or we somehow try to extract the agent's prior, and we're back at ontology identification.

(Disclaimer: This was SOTA understanding a year ago, unsure if it still is now.)

LessWrong 2.0 Reader

Archive

Recent comments