LessWrong 2.0 Reader

View: New · Old · Top

next page (older posts) →

[link] New Tool: the Residual Stream Viewer
AdamYedidia (babybeluga) · 2023-10-01T00:49:51.965Z · comments (7)

"Absence of Evidence is Not Evidence of Absence" As a Limit
transhumanist_atom_understander · 2023-10-01T08:15:28.852Z · comments (1)

[link] AI Safety Impact Markets: Your Charity Evaluator for AI Safety
Dawn Drescher (Telofy) · 2023-10-01T10:47:06.952Z · comments (5)

[link] Fifty Flips
abstractapplic · 2023-10-01T15:30:43.268Z · comments (14)

[link] Join AISafety.info's Distillation Hackathon (Oct 6-9th)
smallsilo (monstrologies) · 2023-10-01T18:43:43.359Z · comments (0)

[question] Looking for study
Robert Feinstein (robert-feinstein) · 2023-10-01T19:52:25.481Z · answers+comments (0)

AI Alignment Breakthroughs this Week [new substack]
Logan Zoellner (logan-zoellner) · 2023-10-01T22:13:48.589Z · comments (8)

Revisiting the Manifold Hypothesis
Aidan Rocke (aidanrocke) · 2023-10-01T23:55:56.704Z · comments (19)

Instrumental Convergence and human extinction.
Spiritus Dei (spiritus-dei) · 2023-10-02T00:41:29.952Z · comments (3)

Why I got the smallpox vaccine in 2023
joec · 2023-10-02T05:11:41.249Z · comments (6)

A Mathematical Model for Simulators
lukemarks (marc/er) · 2023-10-02T06:46:31.702Z · comments (0)

[link] Linkpost: They Studied Dishonesty. Was Their Work a Lie?
Linch · 2023-10-02T08:10:51.857Z · comments (12)

The 99% principle for personal problems
Kaj_Sotala · 2023-10-02T08:20:07.379Z · comments (20)

Direction of Fit
NicholasKees (nick_kees) · 2023-10-02T12:34:24.385Z · comments (0)

[link] energy landscapes of experts
bhauth · 2023-10-02T14:08:32.370Z · comments (2)

Will early transformative AIs primarily use text? [Manifold question]
Fabien Roger (Fabien) · 2023-10-02T15:05:07.279Z · comments (0)

A counterexample for measurable factor spaces
Matthias G. Mayer (matthias-georg-mayer) · 2023-10-02T15:16:48.418Z · comments (0)

Expectations for Gemini: hopefully not a big deal
Maxime Riché (maxime-riche) · 2023-10-02T15:38:32.834Z · comments (5)

Population After a Catastrophe
Stan Pinsent (stan-pinsent) · 2023-10-02T16:06:56.614Z · comments (5)

Thomas Kwa's MIRI research experience
Thomas Kwa (thomas-kwa) · 2023-10-02T16:42:37.886Z · comments (52)

[link] Dall-E 3
p.b. · 2023-10-02T20:33:18.294Z · comments (9)

My Mid-Career Transition into Biosecurity
jefftk (jkaufman) · 2023-10-02T21:20:06.768Z · comments (4)

Some Quick Follow-Up Experiments to “Taken out of context: On measuring situational awareness in LLMs”
miles · 2023-10-03T02:22:00.199Z · comments (0)

Early Experiments in Reward Model Interpretation Using Sparse Autoencoders
lukemarks (marc/er) · 2023-10-03T07:45:15.228Z · comments (0)

Mech Interp Challenge: October - Deciphering the Sorted List Model
CallumMcDougall (TheMcDouglas) · 2023-10-03T10:57:29.598Z · comments (0)

Why We Use Money? - A Walrasian View
Savio Coelho (Will_Crowley) · 2023-10-03T12:02:37.312Z · comments (3)

Monthly Roundup #11: October 2023
Zvi · 2023-10-03T14:10:01.686Z · comments (12)

[question] Potential alignment targets for a sovereign superintelligent AI
Paul Colognese (paul-colognese) · 2023-10-03T15:09:59.529Z · answers+comments (4)

What would it mean to understand how a large language model (LLM) works? Some quick notes.
Bill Benzon (bill-benzon) · 2023-10-03T15:11:13.508Z · comments (4)

[link] Metaculus Announces Forecasting Tournament to Evaluate Focused Research Organizations, in Partnership With the Federation of American Scientists
ChristianWilliams · 2023-10-03T16:44:17.620Z · comments (0)

[link] Testing and Automation for Intelligent Systems.
Sai Kiran Kammari (sai-kiran-kammari) · 2023-10-03T17:51:08.796Z · comments (0)

[question] Current AI safety techniques?
Zach Stein-Perlman · 2023-10-03T19:30:54.481Z · answers+comments (2)

OpenAI-Microsoft partnership
Zach Stein-Perlman · 2023-10-03T20:01:44.795Z · comments (18)

When to Get the Booster?
jefftk (jkaufman) · 2023-10-03T21:00:12.813Z · comments (15)

AXRP Episode 25 - Cooperative AI with Caspar Oesterheld
DanielFilan · 2023-10-03T21:50:07.552Z · comments (0)

[question] Who determines whether an alignment proposal is the definitive alignment solution?
MiguelDev (whitehatStoic) · 2023-10-03T22:39:23.700Z · answers+comments (6)

[link] [Link] Bay Area Winter Solstice 2023
tcheasdfjkl · 2023-10-04T02:19:56.284Z · comments (3)

Graphical tensor notation for interpretability
Jordan Taylor (Nadroj) · 2023-10-04T08:04:33.341Z · comments (11)

[question] What are some examples of AIs instantiating the 'nearest unblocked strategy problem'?
EJT (ElliottThornley) · 2023-10-04T11:05:34.537Z · answers+comments (4)

[link] Why a Mars colony would lead to a first strike situation
Remmelt (remmelt-ellen) · 2023-10-04T11:29:53.679Z · comments (8)

Entanglement and intuition about words and meaning
Bill Benzon (bill-benzon) · 2023-10-04T14:16:29.713Z · comments (0)

[question] What evidence is there of LLM's containing world models?
Chris_Leong · 2023-10-04T14:33:19.178Z · answers+comments (17)

I don’t find the lie detection results that surprising (by an author of the paper)
JanB (JanBrauner) · 2023-10-04T17:10:51.262Z · comments (8)

[link] AISN #23: New OpenAI Models, News from Anthropic, and Representation Engineering
aogara (Aidan O'Gara) · 2023-10-04T17:37:19.564Z · comments (2)

rationalistic probability(litterally just throwing shit out there)
NotaSprayer ASprayer (notasprayer-asprayer) · 2023-10-04T17:46:24.500Z · comments (8)

[question] Using Reinforcement Learning to try to control the heating of a building (district heating)
Tony Karlsson (tony-karlsson) · 2023-10-04T17:47:17.294Z · answers+comments (5)

The 5 Pillars of Happiness
Gabi QUENE · 2023-10-04T17:50:40.633Z · comments (5)

Safeguarding Humanity: Ensuring AI Remains a Servant, Not a Master
kgldeshapriya · 2023-10-04T17:52:40.436Z · comments (2)

[link] Open Philanthropy is hiring for multiple roles across our Global Catastrophic Risks teams
aarongertler · 2023-10-04T18:04:25.388Z · comments (0)

PortAudio M1 Latency
jefftk (jkaufman) · 2023-10-04T19:10:13.021Z · comments (5)

next page (older posts) →

^{^}

Say Alice has lived her whole life in a room with a single button. People from the outside told her pressing the button would create nice paintings. Throughout her life, they provided an exhaustive array of proofs and confirmations of this fact. Unbeknownst to her, this was all an elaborate scheme, and in reality pressing the button destroys nice paintings. Alice, liking paintings, regularly presses the button.
A naive application of Vanessa's criterion would impute Alice the goal of destroying paintings. To avoid this, we somehow need to integrate over all possible worlds Alice can find herself in, and realize that, when you are presented with an exhaustive array of proofs and confirmations that the button creates paintings, it is on average more likely for the button to create paintings than destroy them.
But we face a decision. Either we fix a prior to do this that we will use for all agents, in which case all agents with a different prior will look silly to us. Or we somehow try to extract the agent's prior, and we're back at ontology identification.

(Disclaimer: This was SOTA understanding a year ago, unsure if it still is now.)

LessWrong 2.0 Reader

Archive

Recent comments