LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

The new ruling philosophy regarding AI
Mitchell_Porter · 2024-11-11T13:28:24.476Z · comments (0)

5. Open Corrigibility Questions
Max Harms (max-harms) · 2024-06-10T14:09:20.777Z · comments (0)

Big-endian is better than little-endian
Menotim · 2024-04-29T02:30:48.053Z · comments (17)

[link] Why Recursion Pharmaceuticals abandoned cell painting for brightfield imaging
Abhishaike Mahajan (abhishaike-mahajan) · 2024-11-05T14:51:41.310Z · comments (1)

[link] What fuels your ambition?
Cissy · 2024-01-31T18:30:53.274Z · comments (1)

[link] GDP per capita in 2050
Hauke Hillebrandt (hauke-hillebrandt) · 2024-05-06T15:14:30.934Z · comments (8)

[question] Where to find reliable reviews of AI products?
Elizabeth (pktechgirl) · 2024-09-17T23:48:25.899Z · answers+comments (6)

Scorable Functions: A Format for Algorithmic Forecasting
ozziegooen · 2024-05-21T04:14:11.749Z · comments (0)

[LDSL#4] Root cause analysis versus effect size estimation
tailcalled · 2024-08-11T16:12:14.604Z · comments (0)

[link] A new process for mapping discussions
Nathan Young · 2024-09-30T08:57:20.029Z · comments (8)

Aligning AI Safety Projects with a Republican Administration
Deric Cheng (deric-cheng) · 2024-11-21T22:12:27.502Z · comments (1)

DPO/PPO-RLHF on LLMs incentivizes sycophancy, exaggeration and deceptive hallucination, but not misaligned powerseeking
tailcalled · 2024-06-10T21:20:11.938Z · comments (13)

[question] What Other Lines of Work are Safe from AI Automation?
RogerDearnaley (roger-d-1) · 2024-07-11T10:01:12.616Z · answers+comments (35)

Examples of How I Use LLMs
jefftk (jkaufman) · 2024-10-14T17:10:04.597Z · comments (2)

[link] AI & wisdom 1: wisdom, amortised optimisation, and AI
L Rudolf L (LRudL) · 2024-10-28T21:02:51.215Z · comments (0)

NYC Congestion Pricing: Early Days
Zvi · 2025-01-14T14:00:07.445Z · comments (0)

Disagreement on AGI Suggests It’s Near
tangerine · 2025-01-07T20:42:43.456Z · comments (15)

You can validly be seen and validated by a chatbot
Kaj_Sotala · 2024-12-20T12:00:03.015Z · comments (3)

Compositionality and Ambiguity: Latent Co-occurrence and Interpretable Subspaces
Matthew A. Clarke (Antigone) · 2024-12-20T15:16:51.857Z · comments (0)

Mini Go: Gateway Game
jefftk (jkaufman) · 2025-01-14T03:30:02.020Z · comments (1)

Corrigibility's Desirability is Timing-Sensitive
RobertM (T3t) · 2024-12-26T22:24:17.435Z · comments (4)

Monthly Roundup #19: June 2024
Zvi · 2024-06-25T12:00:03.333Z · comments (9)

[question] Which things were you surprised to learn are metaphors?
Gordon Seidoh Worley (gworley) · 2024-11-22T03:46:02.845Z · answers+comments (18)

Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs
Daniel Lee (daniel-lee) · 2024-09-06T02:28:41.954Z · comments (0)

[link] Arithmetic Models: Better Than You Think
kqr · 2024-10-26T09:42:07.185Z · comments (4)

Option control
Joe Carlsmith (joekc) · 2024-11-04T17:54:03.073Z · comments (0)

[link] Our Digital and Biological Children
Eneasz · 2024-10-24T18:36:38.719Z · comments (0)

[link] If-Then Commitments for AI Risk Reduction [by Holden Karnofsky]
habryka (habryka4) · 2024-09-13T19:38:53.194Z · comments (0)

Trading Candy
jefftk (jkaufman) · 2024-11-01T01:10:08.024Z · comments (4)

Reading More Each Day: A Simple $35 Tool
aysajan · 2024-07-24T13:54:04.290Z · comments (2)

Distinguishing ways AI can be "concentrated"
Matthew Barnett (matthew-barnett) · 2024-10-21T22:21:13.666Z · comments (2)

Childhood and Education Roundup #6: College Edition
Zvi · 2024-06-26T11:40:03.990Z · comments (8)

Tackling Moloch: How YouCongress Offers a Novel Coordination Mechanism
Hector Perez Arenas (hector-perez-arenas) · 2024-05-15T23:13:48.501Z · comments (9)

Concrete Methods for Heuristic Estimation on Neural Networks
Oliver Daniels (oliver-daniels-koch) · 2024-11-14T05:07:55.240Z · comments (0)

[link] ML Safety Research Advice - GabeM
Gabe M (gabe-mukobi) · 2024-07-23T01:45:42.288Z · comments (2)

DIY RLHF: A simple implementation for hands on experience
Mike Vaiana (mike-vaiana) · 2024-07-10T12:07:03.047Z · comments (0)

AI #64: Feel the Mundane Utility
Zvi · 2024-05-16T15:20:02.956Z · comments (11)

Can quantised autoencoders find and interpret circuits in language models?
charlieoneill (kingchucky211) · 2024-03-24T20:05:50.125Z · comments (4)

Auditing LMs with counterfactual search: a tool for control and ELK
Jacob Pfau (jacob-pfau) · 2024-02-20T00:02:09.575Z · comments (6)

{Book Summary} The Art of Gathering
Tristan Williams (tristan-williams) · 2024-04-16T10:48:41.528Z · comments (0)

[link] AI Safety at the Frontier: Paper Highlights, August '24
gasteigerjo · 2024-09-03T19:17:24.850Z · comments (0)

Cryonics p(success) estimates are only weakly associated with interest in pursuing cryonics in the LW 2023 Survey
Andy_McKenzie · 2024-02-29T14:47:28.613Z · comments (6)

An explanation of evil in an organized world
KatjaGrace · 2024-05-02T05:20:06.240Z · comments (9)

An Affordable CO2 Monitor
Pretentious Penguin (dylan-mahoney) · 2024-03-21T03:06:53.255Z · comments (1)

Collection (Part 6 of "The Sense Of Physical Necessity")
LoganStrohl (BrienneYudkowsky) · 2024-03-14T21:37:00.160Z · comments (0)

Towards Quantitative AI Risk Management
Henry Papadatos (henry) · 2024-10-16T19:26:48.817Z · comments (1)

Aggregative principles approximate utilitarian principles
Cleo Nardo (strawberry calm) · 2024-06-12T16:27:22.179Z · comments (3)

Cicadas, Anthropic, and the bilateral alignment problem
kromem · 2024-05-22T11:09:56.469Z · comments (6)

AI #65: I Spy With My AI
Zvi · 2024-05-23T12:40:02.793Z · comments (7)

[link] Quick Thoughts on Scaling Monosemanticity
Joel Burget (joel-burget) · 2024-05-23T16:22:48.035Z · comments (1)

← previous page (newer posts) · next page (older posts) →

^{^}

Not that this is the only barrier, but it felt like the most fundamental one, the one I was so confused about that I had no idea where to start on designing something that surmounted it.

^{^}

My understanding seems to change every day (in the last.. 4 days?), these are today's notes. If you'd find a less drafty version useful, lmk

^{^}

Control can be used as an "extra line of defense" in case alignment fails since Control doesn't depend on alignment

^{^}

Typically we'd trust the weaker model simply because it's not smart enough to "scheme", but maybe this is only the plan to start off, before we have better reasons to trust the model (read on)

^{^}

Training against scheming/lying might result in a model that is more capable of avoiding the ways we noticed that behaviour in the first place

LessWrong 2.0 Reader

Archive

Recent comments