LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

2022 Unofficial LessWrong General Census
Screwtape · 2023-01-30T18:36:30.616Z · comments (33)

Contrast Pairs Drive the Empirical Performance of Contrast Consistent Search (CCS)
Scott Emmons · 2023-05-31T17:09:02.288Z · comments (1)

Meta-level adversarial evaluation of oversight techniques might allow robust measurement of their adequacy
Buck · 2023-07-26T17:02:56.456Z · comments (19)

A summary of every "Highlights from the Sequences" post
Akash (akash-wasil) · 2022-07-15T23:01:04.392Z · comments (7)

DALL-E by OpenAI
Daniel Kokotajlo (daniel-kokotajlo) · 2021-01-05T20:05:46.718Z · comments (20)

Tessellating Hills: a toy model for demons in imperfect search
DaemonicSigil · 2020-02-20T00:12:50.125Z · comments (18)

Closing Notes on Nonlinear Investigation
Ben Pace (Benito) · 2023-09-15T22:44:58.488Z · comments (47)

[link] Seven lessons I didn't learn from election day
Eric Neyman (UnexpectedValues) · 2024-11-14T18:39:07.053Z · comments (33)

[link] CIV: a story
Richard_Ngo (ricraz) · 2024-06-15T22:36:50.415Z · comments (6)

Clarifying “What failure looks like”
Sam Clarke · 2020-09-20T20:40:48.295Z · comments (14)

The Road to Mazedom
Zvi · 2020-01-18T14:10:00.846Z · comments (26)

Given the Restrict Act, Don’t Ban TikTok
Zvi · 2023-04-04T14:40:03.162Z · comments (9)

Slack Has Positive Externalities For Groups
johnswentworth · 2021-07-29T15:03:25.929Z · comments (11)

Learn the mathematical structure, not the conceptual structure
Adam Shai (adam-shai) · 2023-03-01T22:24:19.451Z · comments (35)

Would we even want AI to solve all our problems?
So8res · 2023-04-21T18:04:11.636Z · comments (15)

The "Think It Faster" Exercise
Raemon · 2024-12-11T19:14:10.427Z · comments (13)

How To Make Prediction Markets Useful For Alignment Work
johnswentworth · 2022-10-18T19:01:01.292Z · comments (18)

[link] ARC paper: Formalizing the presumption of independence
Erik Jenner (ejenner) · 2022-11-20T01:22:55.110Z · comments (2)

[link] Announcing Epoch: A research organization investigating the road to Transformative AI
Jsevillamol · 2022-06-27T13:55:51.451Z · comments (2)

[link] the Giga Press was a mistake
bhauth · 2024-08-21T04:51:24.150Z · comments (26)

Luna Lovegood and the Chamber of Secrets - Part 2
lsusr · 2020-11-30T08:12:07.238Z · comments (5)

Being Productive With Chronic Health Conditions
lynettebye · 2020-11-05T00:22:50.871Z · comments (7)

Jitters No Evidence of Stupidity in RL
1a3orn · 2021-09-16T22:43:57.972Z · comments (18)

[question] Lying to chess players for alignment
Zane · 2023-10-25T17:47:15.033Z · answers+comments (54)

Shah (DeepMind) and Leahy (Conjecture) Discuss Alignment Cruxes
OliviaJ (olivia-jimenez-1) · 2023-05-01T16:47:41.655Z · comments (10)

Does SGD Produce Deceptive Alignment?
Mark Xu (mark-xu) · 2020-11-06T23:48:09.667Z · comments (9)

Omicron Post #8
Zvi · 2021-12-20T23:10:01.630Z · comments (33)

[link] Atoms to Agents Proto-Lectures
johnswentworth · 2023-09-22T06:22:05.456Z · comments (14)

Deceptive AI ≠ Deceptively-aligned AI
Steven Byrnes (steve2152) · 2024-01-07T16:55:13.761Z · comments (19)

The case for unlearning that removes information from LLM weights
Fabien Roger (Fabien) · 2024-10-14T14:08:04.775Z · comments (15)

[link] Compact Proofs of Model Performance via Mechanistic Interpretability
LawrenceC (LawChan) · 2024-06-24T19:27:21.214Z · comments (3)

Public Transit is not Infinitely Safe
jefftk (jkaufman) · 2023-06-20T18:40:02.011Z · comments (34)

OpenAI's Sora is an agent
CBiddulph (caleb-biddulph) · 2024-02-16T07:35:52.171Z · comments (25)

A circuit for Python docstrings in a 4-layer attention-only transformer
StefanHex (Stefan42) · 2023-02-20T19:35:14.027Z · comments (8)

[question] What are the strongest arguments for very short timelines?
Kaj_Sotala · 2024-12-23T09:38:56.905Z · answers+comments (74)

The purposeful drunkard
Dmitry Vaintrob (dmitry-vaintrob) · 2025-01-12T12:27:51.952Z · comments (9)

A breakdown of AI capability levels focused on AI R&D labor acceleration
ryan_greenblatt · 2024-12-22T20:56:00.298Z · comments (5)

The Telephone Theorem: Information At A Distance Is Mediated By Deterministic Constraints
johnswentworth · 2021-08-31T16:50:13.483Z · comments (26)

[link] New blog: Planned Obsolescence
Ajeya Cotra (ajeya-cotra) · 2023-03-27T19:46:25.429Z · comments (7)

Exercise: Taboo "Should"
johnswentworth · 2021-01-22T21:02:46.649Z · comments (28)

[link] Almost everyone I’ve met would be well-served thinking more about what to focus on
Henrik Karlsson (henrik-karlsson) · 2024-01-05T21:01:27.861Z · comments (8)

Fixed Point: a love story
Richard_Ngo (ricraz) · 2023-07-08T13:56:54.807Z · comments (2)

[link] Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant
Olli Järviniemi (jarviniemi) · 2024-05-06T07:07:05.019Z · comments (13)

When Someone Tells You They're Lying, Believe Them
ymeskhout · 2023-07-14T00:31:48.168Z · comments (3)

Counting arguments provide no evidence for AI doom
Nora Belrose (nora-belrose) · 2024-02-27T23:03:49.296Z · comments (188)

I am the Golden Gate Bridge
Zvi · 2024-05-27T14:40:03.216Z · comments (6)

Here's the exit.
Valentine · 2022-11-21T18:07:23.607Z · comments (178)

Announcing FAR Labs, an AI safety coworking space
Ben Goldhaber (bgold) · 2023-09-29T16:52:37.753Z · comments (0)

Apply to the ML for Alignment Bootcamp (MLAB) in Berkeley [Jan 3 - Jan 22]
habryka (habryka4) · 2021-11-03T18:22:58.879Z · comments (4)

A shot at the diamond-alignment problem
TurnTrout · 2022-10-06T18:29:10.586Z · comments (67)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

chris_leong on Charlie Steiner's Shortform

I think I saw someone arguing that their particular capability benchmark was good for evaluating the capability, but of limited use for training the capability because their task only covered a small fraction of that domain.

whestler on A Novel Emergence of Meta-Awareness in LLM Fine-Tuning

This is surprising to me. Is it possible that the kind of introspection you describe isn't whats happening here?

The first line is generic and could be used for any explanation of a pattern.
The second line might use the fact that the first line started with a "H" plus the fact that the initial message starts with "Hello" to deduce the rest.

I'd love to see this capability tested with a more unusual word than "Hello" (which often gets used as example or testing code to print "Hello World") and without the initial message beginning with the answer to the acrostic.

eggsyntax on Numberwang: LLMs Doing Autonomous Research, and a Call for Input

I'll probably play around with it a bit tomorrow.

Terrific, I'm excited to hear about your results! I definitely wouldn't be surprised if my results could be improved on significantly, although I'll be somewhat surprised if you get as high as 70% from Sonnet (I'd put maybe 30% credence on getting it to average that high in a day or two of trying).

jkaufman on Tax Price Gouging?

A new air purifier is $150, but mine have been hanging around my house collecting dust and viruses; I don't think a used air purifier would have gone for $150 pre-emergency. Let's say the used value was $75. To get the same benefit as selling for $300 with no surcharge I'd need to charge $525: 2x my $300, less the $75 used value.

But I agree: the air purifiers situation is still improved when moving from the status quo (illegal) to the proposal (taxed). My point with that footnote is that the proposal still does some to discourage supply increases relative to a world without this regulation.

eggsyntax on Numberwang: LLMs Doing Autonomous Research, and a Call for Input

Always a bit embarrassing when you inform someone about their own paper ;)

My vague memories:

Thanks, a lot of that matches patterns I've seen as well. If you do anything further along these lines, I'd love to know about it!

the-gears-to-ascension on Charlie Steiner's Shortform

What will you do if nobody makes a successful case?

the-gears-to-ascension on Introducing the WeirdML Benchmark

This is a capabilities game. It is neither alignment or safety. To the degree it's forecasting, it helps cause the thing it forecasts.

jeroen-willems on [Fiction] [Comic] Effective Altruism and Rationality meet at a Secular Solstice afterparty

Not me assuming kratom was a made-up word haha.

Awesome comic! You captured the recurring traits really really well.

xpym on What Is The Alignment Problem?

I bet the person says “no”.

I agree, but I think it's important to mention issues like social desirability bias and strategic self-deception here, coupled with the fact that most people just aren't particularly good at introspection.

it’s conflicting desires, not conflicting values

It's both, our minds employ desires in service of pursuing our (often conflicting) values.

Insofar as different values are conflicting, that conflict has already long ago been resolved, and the resolution is: the action which best accords with the person’s values, in this instance, is to get up.

I'd rather put it as a routine conflict eventually getting resolved in a predictable way.

Another example: if someone says “I want to act in accordance with my values” or “I don’t always act in accordance with my values”, we recognize these as two substantive claims. The first is not a tautology, and the second is not a self-contradiction.

Indeed, but I claim that those statements actually mean "I want my value conflicts to resolve in the way I endorse" and "I don’t always endorse the way my value conflicts resolve".

cole-wyeth on What is the most impressive game LLMs can play well?

This seems possible - according to this article almost every model got crushed by the easiest Stockfish: https://dynomight.net/chess/
But at the end he links to his second attempt which experimented with fine tuning and prompting, eventually getting decent performance against weak Stockfish. Actually he notes that lists of legal moves are actively harmful, which may partially explain the original example with random agents.

A cursory glance at publications on the topic seems to indicate that LLMs can make valid moves and somehow represent the board state (which seems to follow), but are still weak players even after significant effort designing prompts.

Can you share any more definitive evidence?