LessWrong 2.0 Reader

View: New · Old · Top

next page (older posts) →

How useful is mechanistic interpretability?
ryan_greenblatt · 2023-12-01T02:54:53.488Z · comments (53)

[question] Is OpenAI losing money on each request?
thenoviceoof · 2023-12-01T03:27:23.929Z · answers+comments (8)

Reinforcement Learning using Layered Morphology (RLLM)
MiguelDev (whitehatStoic) · 2023-12-01T05:18:58.162Z · comments (0)

Reality is whatever you can get away with.
sometimesperson (fake-name) · 2023-12-01T07:50:00.382Z · comments (0)

How useful for alignment-relevant work are AIs with short-term goals? (Section 2.2.4.3 of "Scheming AIs")
Joe Carlsmith (joekc) · 2023-12-01T14:51:04.624Z · comments (1)

Worlds where I wouldn't worry about AI risk
adekcz (michal-keda) · 2023-12-01T16:06:54.199Z · comments (0)

[link] Why Did NEPA Peak in 2016?
Maxwell Tabarrok (maxwell-tabarrok) · 2023-12-01T16:18:35.435Z · comments (0)

Thoughts on “AI is easy to control” by Pope & Belrose
Steven Byrnes (steve2152) · 2023-12-01T17:30:52.720Z · comments (55)

Kolmogorov Complexity Lays Bare the Soul
jakej (jake-jenks) · 2023-12-01T18:29:57.379Z · comments (8)

[link] Researchers and writers can apply for proxy access to the GPT-3.5 base model (code-davinci-002)
ampdot · 2023-12-01T18:48:01.406Z · comments (0)

[link] Queuing theory: Benefits of operating at 60% capacity
ampdot · 2023-12-01T18:48:01.426Z · comments (4)

[link] Carving up problems at their joints
Jakub Smékal (jakub-smekal) · 2023-12-01T18:48:46.510Z · comments (0)

[link] Specification Gaming: How AI Can Turn Your Wishes Against You [RA Video]
Writer · 2023-12-01T19:30:58.304Z · comments (0)

Please Bet On My Quantified Self Decision Markets
niplav · 2023-12-01T20:07:38.284Z · comments (6)

Benchmarking Bowtie2 Threading
jefftk (jkaufman) · 2023-12-01T20:20:05.593Z · comments (0)

[question] Could there be "natural impact regularization" or "impact regularization by default"?
tailcalled · 2023-12-01T22:01:46.062Z · answers+comments (6)

Complex systems research as a field (and its relevance to AI Alignment)
Nora_Ammann · 2023-12-01T22:10:25.801Z · comments (9)

MATS Summer 2023 Retrospective
Rocket (utilistrutil) · 2023-12-01T23:29:47.958Z · comments (34)

South Bay Pre-Holiday Gathering
IS (is) · 2023-12-02T03:21:12.904Z · comments (2)

Protecting against sudden capability jumps during training
nikola (nikolaisalreadytaken) · 2023-12-02T04:22:21.315Z · comments (0)

2023 Unofficial LessWrong Census/Survey
Screwtape · 2023-12-02T04:41:51.418Z · comments (81)

[question] What is known about invariants in self-modifying systems?
mishka · 2023-12-02T05:04:19.299Z · answers+comments (2)

List of strategies for mitigating deceptive alignment
joshc (joshua-clymer) · 2023-12-02T05:56:50.867Z · comments (2)

After Alignment — Dialogue between RogerDearnaley and Seth Herd
RogerDearnaley (roger-d-1) · 2023-12-02T06:03:17.456Z · comments (2)

Out-of-distribution Bioattacks
jefftk (jkaufman) · 2023-12-02T12:20:05.626Z · comments (15)

Taking Into Account Sentient Non-Humans in AI Ambitious Value Learning: Sentientist Coherent Extrapolated Volition
Adrià Moret (Adrià R. Moret) · 2023-12-02T14:07:29.992Z · comments (31)

The Method of Loci: With some brief remarks, including transformers and evaluating AIs
Bill Benzon (bill-benzon) · 2023-12-02T14:36:47.077Z · comments (0)

The goal-guarding hypothesis (Section 2.3.1.1 of "Scheming AIs")
Joe Carlsmith (joekc) · 2023-12-02T15:20:28.152Z · comments (1)

Sherlockian Abduction Master List
Cole Wyeth (Amyr) · 2023-12-02T22:10:21.848Z · comments (31)

Quick takes on "AI is easy to control"
So8res · 2023-12-02T22:31:45.683Z · comments (49)

Book Review: 1948 by Benny Morris
Yair Halberstadt (yair-halberstadt) · 2023-12-03T10:29:16.696Z · comments (9)

The benefits and risks of optimism (about AI safety)
Karl von Wendt · 2023-12-03T12:45:12.269Z · comments (6)

[question] How do you do post mortems?
matto · 2023-12-03T14:46:03.521Z · answers+comments (2)

Does scheming lead to adequate future empowerment? (Section 2.3.1.2 of "Scheming AIs")
Joe Carlsmith (joekc) · 2023-12-03T18:32:42.748Z · comments (0)

[link] The Witness
Richard_Ngo (ricraz) · 2023-12-03T22:27:16.248Z · comments (4)

[link] Meditations on Mot
Richard_Ngo (ricraz) · 2023-12-04T00:19:19.522Z · comments (11)

[link] Nietzsche's Morality in Plain English
Arjun Panickssery (arjun-panickssery) · 2023-12-04T00:57:42.839Z · comments (13)

[link] the micro-fulfillment cambrian explosion
bhauth · 2023-12-04T01:15:34.342Z · comments (5)

Disappointing Table Refinishing
jefftk (jkaufman) · 2023-12-04T02:50:07.914Z · comments (3)

FTL travel summary
Isaac King (KingSupernova) · 2023-12-04T05:17:21.422Z · comments (3)

A call for a quantitative report card for AI bioterrorism threat models
Juno (translunar) · 2023-12-04T06:35:14.489Z · comments (0)

[link] Hashmarks: Privacy-Preserving Benchmarks for High-Stakes AI Evaluation
Paul Bricman (paulbricman) · 2023-12-04T07:31:48.726Z · comments (6)

South Bay Meetup 12/9
David Friedman (david-friedman) · 2023-12-04T07:32:26.619Z · comments (0)

[Valence series] 1. Introduction
Steven Byrnes (steve2152) · 2023-12-04T15:40:21.274Z · comments (14)

Updates to Open Phil’s career development and transition funding program
abergal · 2023-12-04T18:10:29.394Z · comments (0)

6. The Mutable Values Problem in Value Learning and CEV
RogerDearnaley (roger-d-1) · 2023-12-04T18:31:22.080Z · comments (0)

Non-classic stories about scheming (Section 2.3.2 of "Scheming AIs")
Joe Carlsmith (joekc) · 2023-12-04T18:44:32.825Z · comments (0)

Planning in LLMs: Insights from AlphaGo
jco · 2023-12-04T18:48:57.508Z · comments (10)

Agents which are EU-maximizing as a group are not EU-maximizing individually
Mlxa · 2023-12-04T18:49:08.708Z · comments (2)

Mechanistic interpretability through clustering
Alistair Fraser (alistair-fraser) · 2023-12-04T18:49:26.777Z · comments (0)

next page (older posts) →

^{^}

Say Alice has lived her whole life in a room with a single button. People from the outside told her pressing the button would create nice paintings. Throughout her life, they provided an exhaustive array of proofs and confirmations of this fact. Unbeknownst to her, this was all an elaborate scheme, and in reality pressing the button destroys nice paintings. Alice, liking paintings, regularly presses the button.
A naive application of Vanessa's criterion would impute Alice the goal of destroying paintings. To avoid this, we somehow need to integrate over all possible worlds Alice can find herself in, and realize that, when you are presented with an exhaustive array of proofs and confirmations that the button creates paintings, it is on average more likely for the button to create paintings than destroy them.
But we face a decision. Either we fix a prior to do this that we will use for all agents, in which case all agents with a different prior will look silly to us. Or we somehow try to extract the agent's prior, and we're back at ontology identification.

(Disclaimer: This was SOTA understanding a year ago, unsure if it still is now.)

LessWrong 2.0 Reader

Archive

Recent comments