LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

A hundredth of a bit of extra entropy
Adam Scherlis (adam-scherlis) · 2022-12-24T21:12:41.517Z · comments (4)

Reflections on my 5-month alignment upskilling grant
Jay Bailey · 2022-12-27T10:51:49.872Z · comments (4)

Three reasons to cooperate
paulfchristiano · 2022-12-24T17:40:01.114Z · comments (14)

Results from a survey on tool use and workflows in alignment research
jacquesthibs (jacques-thibodeau) · 2022-12-19T15:19:52.560Z · comments (2)

An Open Agency Architecture for Safe Transformative AI
davidad · 2022-12-20T13:04:06.409Z · comments (22)

Probably good projects for the AI safety ecosystem
Ryan Kidd (ryankidd44) · 2022-12-05T02:26:41.623Z · comments (31)

MrBeast's Squid Game Tricked Me
lsusr · 2022-12-03T05:50:02.339Z · comments (1)

10 Years of LessWrong
JohnBuridan · 2022-12-30T17:15:17.498Z · comments (2)

«Boundaries», Part 3b: Alignment problems in terms of boundaries
Andrew_Critch · 2022-12-14T22:34:41.443Z · comments (7)

[question] Who are some prominent reasonable people who are confident that AI won't kill everyone?
Optimization Process · 2022-12-05T09:12:41.797Z · answers+comments (54)

AI Safety Seems Hard to Measure
HoldenKarnofsky · 2022-12-08T19:50:07.352Z · comments (6)

On sincerity
Joe Carlsmith (joekc) · 2022-12-23T17:13:09.478Z · comments (6)

The True Spirit of Solstice?
Raemon · 2022-12-19T08:00:30.273Z · comments (31)

[link] Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)
LawrenceC (LawChan) · 2022-12-16T22:12:54.461Z · comments (11)

Proper scoring rules don’t guarantee predicting fixed points
Johannes Treutlein (Johannes_Treutlein) · 2022-12-16T18:22:23.547Z · comments (8)

It's time to worry about online privacy again
Malmesbury (Elmer of Malmesbury) · 2022-12-25T21:05:30.977Z · comments (23)

AI Neorealism: a threat model & success criterion for existential safety
davidad · 2022-12-15T13:42:11.072Z · comments (1)

Can we efficiently explain model behaviors?
paulfchristiano · 2022-12-16T19:40:06.327Z · comments (3)

AGI Timelines in Governance: Different Strategies for Different Timeframes
simeon_c (WayZ) · 2022-12-19T21:31:25.746Z · comments (28)

Systems of Survival
Vaniver · 2022-12-09T05:13:53.064Z · comments (5)

Key Mostly Outward-Facing Facts From the Story of VaccinateCA
Zvi · 2022-12-14T13:30:00.831Z · comments (2)

[link] Summary of a new study on out-group hate (and how to fix it)
DirectedEvolution (AllAmericanBreakfast) · 2022-12-04T01:53:32.490Z · comments (30)

Update on Harvard AI Safety Team and MIT AI Alignment
Xander Davies (xanderdavies) · 2022-12-02T00:56:45.596Z · comments (4)

Verification Is Not Easier Than Generation In General
johnswentworth · 2022-12-06T05:20:48.744Z · comments (27)

[link] Predicting GPU performance
Marius Hobbhahn (marius-hobbhahn) · 2022-12-14T16:27:23.923Z · comments (26)

Notice when you stop reading right before you understand
just_browsing · 2022-12-20T05:09:43.224Z · comments (6)

The Meditation on Winter
Raemon · 2022-12-25T16:12:10.039Z · comments (3)

CIRL Corrigibility is Fragile
Rachel Freedman (rachelAF) · 2022-12-21T01:40:50.232Z · comments (9)

High-level hopes for AI alignment
HoldenKarnofsky · 2022-12-15T18:00:15.625Z · comments (3)

YCombinator fraud rates
Xodarap · 2022-12-25T19:21:52.829Z · comments (3)

[link] Concrete Steps to Get Started in Transformer Mechanistic Interpretability
Neel Nanda (neel-nanda-1) · 2022-12-25T22:21:49.686Z · comments (7)

MIRI's "Death with Dignity" in 60 seconds.
Cleo Nardo (strawberry calm) · 2022-12-06T17:18:58.387Z · comments (4)

My thoughts on OpenAI's alignment plan
Akash (akash-wasil) · 2022-12-30T19:33:15.019Z · comments (3)

[link] Formalization as suspension of intuition
adamShimi · 2022-12-11T15:16:44.319Z · comments (18)

In defense of probably wrong mechanistic models
evhub · 2022-12-06T23:24:20.707Z · comments (10)

Reframing inner alignment
davidad · 2022-12-11T13:53:23.195Z · comments (13)

Air-gapping evaluation and support
Ryan Kidd (ryankidd44) · 2022-12-26T22:52:29.881Z · comments (1)

Announcing: The Independent AI Safety Registry
Shoshannah Tekofsky (DarkSym) · 2022-12-26T21:22:18.381Z · comments (9)

The "Minimal Latents" Approach to Natural Abstractions
johnswentworth · 2022-12-20T01:22:25.101Z · comments (24)

Take 13: RLHF bad, conditioning good.
Charlie Steiner · 2022-12-22T10:44:06.359Z · comments (4)

Nook Nature
[DEACTIVATED] Duncan Sabien (Duncan_Sabien) · 2022-12-05T04:10:37.797Z · comments (18)

My AGI safety research—2022 review, ’23 plans
Steven Byrnes (steve2152) · 2022-12-14T15:15:52.473Z · comments (10)

Positive values seem more robust and lasting than prohibitions
TurnTrout · 2022-12-17T21:43:31.627Z · comments (13)

[link] My Reservations about Discovering Latent Knowledge (Burns, Ye, et al)
Robert_AIZI · 2022-12-27T17:27:02.225Z · comments (0)

Take 7: You should talk about "the human's utility function" less.
Charlie Steiner · 2022-12-08T08:14:17.275Z · comments (22)

Next Level Seinfeld
Zvi · 2022-12-19T13:30:00.538Z · comments (8)

China Covid #4
Zvi · 2022-12-22T16:30:00.919Z · comments (2)

[link] Think wider about the root causes of progress
jasoncrawford · 2022-12-21T20:05:46.986Z · comments (11)

Looking Back on Posts From 2022
Zvi · 2022-12-26T13:20:00.745Z · comments (8)

Applications open for AGI Safety Fundamentals: Alignment Course
Richard_Ngo (ricraz) · 2022-12-13T18:31:55.068Z · comments (0)

← previous page (newer posts) · next page (older posts) →

^{^}

The sum of those choices probably contained a lot of information about my mind, just not information that humans are attuned to detecting. Base models learn to gleam information about authors because this is useful to next token prediction.

Also note that using base models for this kind of experiment avoids the issue of the RLHF-persona being unwilling to speculate or decoupled from the true beliefs of the underlying simulator.

^{^}

To be clear, it also included {some beliefs that I don't have}, and {some that I {{hadn't so far and probably wouldn't have otherwise} spent cognitive resources on considering}, but would agree with on reflection}

It also included some highly eccentric beliefs that I wouldn't agree with, like wanting to accelerate the 'entropic death of the universe.' (Though I can see a possible rationale: wanting to end suffering sooner. I'm deeply sympathetic to that and I think it's tragic if the current understanding of reality is true such that suffering will probably continue until the universe ends, even if only in 'small portions of the universe' (which are still quantitatively vast compared to life on Earth).

I don't think 'accelerating entropy' is the optimal way to reduce suffering, though, for the independently-sufficient-reasons {I don't think we can actually accelerate it in a significant or universe-wide way} and {the local lightcone would be more effectively used on things like reducing suffering directly via acausal trade [LW(p) · GW(p)]}.

LessWrong 2.0 Reader

Archive

Recent comments