LessWrong 2.0 Reader

View: New · Old · Top

next page (older posts) →

Sequencing Swabs
jefftk (jkaufman) · 2024-02-01T01:50:17.938Z · comments (1)

Drone Wars Endgame
RussellThor · 2024-02-01T02:30:46.161Z · comments (71)

PIBBSS Speaker events comings up in February
DusanDNesic · 2024-02-01T03:28:24.971Z · comments (2)

Increasingly vague interpersonal welfare comparisons
MichaelStJules · 2024-02-01T06:45:30.160Z · comments (0)

[link] Some Notes on Ethics
Pareto Optimal · 2024-02-01T10:18:44.502Z · comments (0)

AI #49: Bioweapon Testing Begins
Zvi · 2024-02-01T15:30:04.690Z · comments (11)

Putting multimodal LLMs to the Tetris test
Lovre · 2024-02-01T16:02:12.367Z · comments (5)

Managing risks while trying to do good
Wei Dai (Wei_Dai) · 2024-02-01T18:08:46.506Z · comments (26)

The economy is mostly newbs (strat predictions)
lukehmiles (lcmgcd) · 2024-02-01T19:15:49.420Z · comments (6)

On Not Requiring Vaccination
jefftk (jkaufman) · 2024-02-01T19:20:12.657Z · comments (21)

Wrong answer bias
lukehmiles (lcmgcd) · 2024-02-01T20:05:38.573Z · comments (24)

[link] OpenAI report also finds no effect of current LLMs on viability of bioterrorism attacks
lberglund (brglnd) · 2024-02-01T20:18:19.418Z · comments (4)

Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis
RogerDearnaley (roger-d-1) · 2024-02-01T21:15:56.968Z · comments (15)

[link] Davidad's Provably Safe AI Architecture - ARIA's Programme Thesis
simeon_c (WayZ) · 2024-02-01T21:30:44.090Z · comments (17)

[link] Evaluating Stability of Unreflective Alignment
james.lucassen · 2024-02-01T22:15:40.902Z · comments (3)

[link] Running a Prediction Market Mafia Game
Arjun Panickssery (arjun-panickssery) · 2024-02-01T23:24:27.659Z · comments (5)

[link] Soft Prompts for Evaluation: Measuring Conditional Distance of Capabilities
porby · 2024-02-02T05:49:11.189Z · comments (1)

Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small
Joseph Bloom (Jbloom) · 2024-02-02T06:54:53.392Z · comments (37)

Types of subjective welfare
MichaelStJules · 2024-02-02T09:56:34.284Z · comments (3)

[link] Manifold Markets
PeterMcCluskey · 2024-02-02T17:48:36.630Z · comments (9)

[link] Solving alignment isn't enough for a flourishing future
mic (michael-chen) · 2024-02-02T18:23:00.643Z · comments (0)

What Failure Looks Like is not an existential risk (and alignment is not the solution)
otto.barten (otto-barten) · 2024-02-02T18:59:38.346Z · comments (12)

[link] Most experts believe COVID-19 was probably not a lab leak
DanielFilan · 2024-02-02T19:28:00.319Z · comments (89)

On Dwarkesh’s 3rd Podcast With Tyler Cowen
Zvi · 2024-02-02T19:30:05.974Z · comments (9)

Voting Results for the 2022 Review
Ben Pace (Benito) · 2024-02-02T20:34:59.768Z · comments (3)

Survey for alignment researchers!
Cameron Berg (cameron-berg) · 2024-02-02T20:41:44.323Z · comments (11)

Announcing the London Initiative for Safe AI (LISA)
James Fox · 2024-02-02T23:17:47.011Z · comments (0)

Why do we need RLHF? Imitation, Inverse RL, and the role of reward
Ran W (ran-wei) · 2024-02-03T04:00:27.069Z · comments (0)

Attention SAEs Scale to GPT-2 Small
Connor Kissane (ckkissane) · 2024-02-03T06:50:22.583Z · comments (4)

Why I no longer identify as transhumanist
Kaj_Sotala · 2024-02-03T12:00:04.389Z · comments (33)

Finite Factored Sets to Bayes Nets Part 2
J Bostock (Jemist) · 2024-02-03T12:25:41.444Z · comments (0)

[link] Practicing my Handwriting in 1439
Maxwell Tabarrok (maxwell-tabarrok) · 2024-02-03T13:21:37.331Z · comments (0)

Attitudes about Applied Rationality
Camille Berger (Camille Berger) · 2024-02-03T14:42:22.770Z · comments (18)

[link] The Journal of Dangerous Ideas
rogersbacon · 2024-02-03T15:40:18.992Z · comments (4)

My thoughts on the Beff Jezos - Connor Leahy debate
Ariel Kwiatkowski (ariel-kwiatkowski) · 2024-02-03T19:47:08.326Z · comments (23)

Brute Force Manufactured Consensus is Hiding the Crime of the Century
Roko · 2024-02-03T20:36:59.806Z · comments (156)

A sketch of acausal trade in practice
Richard_Ngo (ricraz) · 2024-02-04T00:32:54.622Z · comments (4)

Personal predictions
Daniele De Nuntiis (daniele-de-nuntiis) · 2024-02-04T03:59:28.537Z · comments (2)

Vitalia Rationality Meetup
veronica (morerational) · 2024-02-04T19:46:12.701Z · comments (0)

EA/ACX/LW February Santa Cruz Meetup
madmail · 2024-02-04T23:26:00.688Z · comments (0)

Noticing Panic
Cole Wyeth (Amyr) · 2024-02-05T03:45:51.794Z · comments (8)

A thought experiment for comparing "biological" vs "digital" intelligence increase/explosion
Super AGI (super-agi) · 2024-02-05T04:57:18.211Z · comments (3)

[question] How has internalising a post-AGI world affected your current choices?
yanni kyriacos (yanni) · 2024-02-05T05:43:14.082Z · answers+comments (8)

Safe Stasis Fallacy
Davidmanheim · 2024-02-05T10:54:44.061Z · comments (2)

AI alignment as a translation problem
Roman Leventov · 2024-02-05T14:14:15.060Z · comments (2)

Implementing activation steering
Annah (annah) · 2024-02-05T17:51:55.851Z · comments (5)

Value learning in the absence of ground truth
Joel_Saarinen (joel_saarinen) · 2024-02-05T18:56:02.260Z · comments (8)

[link] Things You’re Allowed to Do: University Edition
Saul Munn (saul-munn) · 2024-02-06T00:36:11.690Z · comments (13)

Toy models of AI control for concentrated catastrophe prevention
Fabien Roger (Fabien) · 2024-02-06T01:38:19.865Z · comments (2)

Selfish AI Inevitable
Davey Morse (davey-morse) · 2024-02-06T04:29:07.874Z · comments (0)

next page (older posts) →

^{^}

Say Alice has lived her whole life in a room with a single button. People from the outside told her pressing the button would create nice paintings. Throughout her life, they provided an exhaustive array of proofs and confirmations of this fact. Unbeknownst to her, this was all an elaborate scheme, and in reality pressing the button destroys nice paintings. Alice, liking paintings, regularly presses the button.
A naive application of Vanessa's criterion would impute Alice the goal of destroying paintings. To avoid this, we somehow need to integrate over all possible worlds Alice can find herself in, and realize that, when you are presented with an exhaustive array of proofs and confirmations that the button creates paintings, it is on average more likely for the button to create paintings than destroy them.
But we face a decision. Either we fix a prior to do this that we will use for all agents, in which case all agents with a different prior will look silly to us. Or we somehow try to extract the agent's prior, and we're back at ontology identification.

(Disclaimer: This was SOTA understanding a year ago, unsure if it still is now.)

LessWrong 2.0 Reader

Archive

Recent comments