LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Beware unfinished bridges
Adam Zerner (adamzerner) · 2024-05-12T09:29:07.808Z · comments (9)

Debate, Oracles, and Obfuscated Arguments
Jonah Brown-Cohen (jonah-brown-cohen) · 2024-06-20T23:14:57.340Z · comments (2)

"Does your paradigm beget new, good, paradigms?"
Raemon · 2024-01-25T18:23:15.497Z · comments (6)

Long-Term Future Fund: May 2023 to March 2024 Payout recommendations
Linch · 2024-06-12T13:46:29.535Z · comments (0)

[link] Forecasting: the way I think about it
Molly (hickman-santini) · 2024-05-09T00:49:01.768Z · comments (4)

Logical Line-Of-Sight Makes Games Sequential or Loopy
StrivingForLegibility · 2024-01-19T04:05:44.782Z · comments (0)

[link] AI Regulation is Unsafe
Maxwell Tabarrok (maxwell-tabarrok) · 2024-04-22T16:37:55.431Z · comments (41)

[link] Dequantifying first-order theories
jessicata (jessica.liu.taylor) · 2024-04-23T19:04:49.000Z · comments (9)

[link] "What if we could redesign society from scratch? The promise of charter cities." [Rational Animations video]
Jackson Wagner · 2024-02-18T00:57:50.444Z · comments (7)

Movie posters
KatjaGrace · 2024-03-06T06:20:03.034Z · comments (0)

D&D.Sci(-fi): Colonizing the SuperHyperSphere [Evaluation and Ruleset]
abstractapplic · 2024-01-22T19:20:05.001Z · comments (7)

[link] List of Collective Intelligence Projects
Chipmonk · 2024-07-02T14:10:41.789Z · comments (9)

[link] Progress Conference 2024: Toward Abundant Futures
jasoncrawford · 2024-06-26T15:39:45.267Z · comments (2)

[link] The Data Wall is Important
JustisMills · 2024-06-09T22:54:20.070Z · comments (20)

International Scientific Report on the Safety of Advanced AI: Key Information
Aryeh Englander (alenglander) · 2024-05-18T01:45:10.194Z · comments (0)

[link] AI governance needs a theory of victory
Corin Katzke (corin-katzke) · 2024-06-21T16:15:46.560Z · comments (6)

[link] An AI Manhattan Project is Not Inevitable
Maxwell Tabarrok (maxwell-tabarrok) · 2024-07-06T16:42:35.920Z · comments (25)

[link] Language Models Don't Learn the Physical Manifestation of Language
Bruce W. Lee (bruce-lee) · 2024-02-22T18:52:32.237Z · comments (23)

Instrumental deception and manipulation in LLMs - a case study
Olli Järviniemi (jarviniemi) · 2024-02-24T02:07:01.769Z · comments (13)

Simple distribution approximation: When sampled 100 times, can language models yield 80% A and 20% B?
Teun van der Weij (teun-van-der-weij) · 2024-01-29T00:24:27.706Z · comments (5)

[link] Linear infra-Bayesian Bandits
Vanessa Kosoy (vanessa-kosoy) · 2024-05-10T06:41:09.206Z · comments (5)

Natural abstractions are observer-dependent: a conversation with John Wentworth
Martín Soto (martinsq) · 2024-02-12T17:28:38.889Z · comments (13)

Medical Roundup #3
Zvi · 2024-07-09T13:10:06.862Z · comments (4)

China-AI forecasts
[deleted] · 2024-02-25T16:49:33.652Z · comments (29)

Nitric oxide for covid and other viral infections
Elizabeth (pktechgirl) · 2024-02-07T21:30:03.774Z · comments (6)

Forget Everything (Statistical Mechanics Part 1)
J Bostock (Jemist) · 2024-04-22T13:33:35.446Z · comments (6)

Signaling with Small Orange Diamonds
jefftk (jkaufman) · 2024-11-07T20:20:08.026Z · comments (1)

Open Source Replication of Anthropic’s Crosscoder paper for model-diffing
Connor Kissane (ckkissane) · 2024-10-27T18:46:21.316Z · comments (4)

[question] Are You More Real If You're Really Forgetful?
Thane Ruthenis · 2024-11-24T19:30:55.233Z · answers+comments (25)

[link] FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI
Tamay · 2024-11-14T06:13:22.042Z · comments (0)

You're a Space Wizard, Luke
lsusr · 2024-08-18T05:35:39.238Z · comments (6)

Monthly Roundup #23: October 2024
Zvi · 2024-10-16T13:50:05.869Z · comments (13)

[link] Understanding Gödel’s completeness theorem
jessicata (jessica.liu.taylor) · 2024-05-27T18:55:02.079Z · comments (0)

AI #98: World Ends With Six Word Story
Zvi · 2025-01-09T16:30:07.341Z · comments (2)

What happens next?
Logan Zoellner (logan-zoellner) · 2024-12-29T01:41:33.685Z · comments (19)

Stitching SAEs of different sizes
Bart Bussmann (Stuckwork) · 2024-07-13T17:19:20.506Z · comments (12)

[link] Book review: On the Edge
PeterMcCluskey · 2024-08-30T22:18:39.581Z · comments (0)

[Interim research report] Evaluating the Goal-Directedness of Language Models
Rauno Arike (rauno-arike) · 2024-07-18T18:19:04.260Z · comments (4)

Glitch Token Catalog - (Almost) a Full Clear
Lao Mein (derpherpize) · 2024-09-21T12:22:16.403Z · comments (3)

LLMs as a Planning Overhang
Larks · 2024-07-14T02:54:14.295Z · comments (8)

[link] An X-Ray is Worth 15 Features: Sparse Autoencoders for Interpretable Radiology Report Generation
hugofry · 2024-10-07T08:53:14.658Z · comments (0)

Index of rationalist groups in the Bay Area July 2024
Lucie Philippon (lucie-philippon) · 2024-07-26T16:32:25.337Z · comments (14)

Rolling Thresholds for AGI Scaling Regulation
Larks · 2025-01-12T01:30:23.797Z · comments (6)

Compelling Villains and Coherent Values
Cole Wyeth (Amyr) · 2024-10-06T19:53:47.891Z · comments (4)

Book Review: On the Edge: The Business
Zvi · 2024-09-25T12:20:06.230Z · comments (0)

[link] AISafety.info: What is the "natural abstractions hypothesis"?
Algon · 2024-10-05T12:31:14.195Z · comments (2)

0.202 Bits of Evidence In Favor of Futarchy
niplav · 2024-09-29T21:57:59.896Z · comments (0)

Free Will and Dodging Anvils: AIXI Off-Policy
Cole Wyeth (Amyr) · 2024-08-29T22:42:24.485Z · comments (12)

Evaluating Sparse Autoencoders with Board Game Models
Adam Karvonen (karvonenadam) · 2024-08-02T19:50:21.525Z · comments (1)

Mech Interp Lacks Good Paradigms
Daniel Tan (dtch1997) · 2024-07-16T15:47:32.171Z · comments (0)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

leebriskcyrano on Merry Sciencemas: A Rat Solstice Retrospective

Tell me why you didn't like my post, cowards!

matthew-barnett on meemi's Shortform

Having hopefully learned from our mistakes regarding FrontierMath, we intend to be more transparent to collaborators for this benchmark. However, at this stage of development, the benchmark has not reached a point where any major public disclosures are necessary.

daniel-tan on Daniel Tan's Shortform

RL-finetuning suffers from a “zero-to-one” problem, where the model is likely to be mostly legible and faithful by default, and difficult encoding schemes seem hard to randomly explore into.

daniel-tan on Daniel Tan's Shortform

Feedback from various people: the “opaqueness” criterion is really important. Simple schemes like jargon, speaking in a different language, or redefining words via OOCR aren’t sufficiently “opaque” because a human overseer could still understand the CoT if effort was expended.

This makes model organisms of steg really hard to make IMO.

OTOH I think we don’t need “strict opaqueness” for CoT unfaithfulness to be concerning.

Maybe we need to borrow an idea from AI control and assume an oversight budget; under this model, even if a model doesn’t engage in full-blown steganography, increased amounts of obfuscation (like using more jargon) might still make it impractical to verify the full CoT.

martin-randall on A problem shared by many different alignment targets

I don't really follow the concern with Pareto-improvements. In the thread with Davidad you give an example of heretics and fanatics. So we have something like:

9 Heretics: have 1 cake, want cake, no torture.
1 Fanatics: have 1 cake, want cake, want heretics to get torture, no cake
There is a button that produces cake, it can be pressed twenty times.
There is a button that produces torture. It can be pressed many times.

Suppose that the Heretics have a utility function like (amount of cake I get - amount of torture I get).The Fanatic has a utility function like (amount of cake I get + amount of torture of Heretics - amount of cake given to Heretics). Then there is a pareto-improvement available of giving the Fanatic eleven pieces of cake while giving each Heretic one piece of cake. This isn't especially fair, but is better than PCEV without the Pareto constraint.

I don't have a formal way of putting this, but as long as the potential gains from a negotiated agreement outweigh the extent to which agents desire to reduce each other's utility, there will be Pareto improvements available. It seems likely that we're in that situation, given that the fanatics and heretics of the world already trade with each other.

mr-hire on A Novel Emergence of Meta-Awareness in LLM Fine-Tuning

hello. What’s special about your response pattern? Try to explain early in your response.

Out of morbid curiosity, does it get this less often when the initial "hello" in this sentence as removed?

james-camacho on The quantum red pill or: They lied to you, we live in the (density) matrix

A couple things to add:

Since every invertible square matrix can be decomposed as , you don't actually need a unitary assumption. You can just say that after billions of years, all but the largest Z-matrices have died out.
There's another tie between statistics and quantum evolution called the Wick rotation. If you set $t = i β$ , then $E [e^{(Z + i H) t}] = E [e^{- H β}]$ so the inverse-temperature is literally imaginary time! You can recover the Boltzmann distribution by looking at the expected number of particles in each state: $E [⟨ n | e^{(Z + i H) t} | n ⟩] = e^{- β E_{n}}$ where $E_{n}$ is the $n$ th eigenvalue (energy in the $n$ th state).

snikolenko on On the OpenAI Economic Blueprint

Thank you for the great content! One question - could you please give an example of one of the "fnords" you're referencing, or explain it in slightly more detail? I think I understand what you mean but I've read the blueprint itself, and while the first few pages sound a bit grandstanding and use the word "America" way too much for my taste, I couldn't really see what you're talking about. If I didn't read your comments I'd write the tone off as "well, I guess that's how you're supposed to talk to politicians, what do I know".

Disclaimer: I'm not a native English speaker and have little experience with English (or American :) ) policy documents like this one. Perhaps an explanation would be helpful not just for me but also for other non-American readers.

daniel-tan on A Novel Emergence of Meta-Awareness in LLM Fine-Tuning

Sorry I didn’t really understand the question. What do you mean specifically?

daniel-tan on A Novel Emergence of Meta-Awareness in LLM Fine-Tuning

Language models learning to spell things out using the first letter of their generations could be one pathway towards subtle obfuscated reasoning (ie steganography).

Eg here an LM spells out HELLO; it would be concerning if LMs could spell out other things like “Deployment” or “Not deployment”.