LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Beards and Masks?
jefftk (jkaufman) · 2025-01-18T16:00:04.049Z · comments (0)

On Dwarkesh Patel’s 4th Podcast With Tyler Cowen
Zvi · 2025-01-10T13:50:05.563Z · comments (7)

[link] Began a pay-on-results coaching experiment, made $40,300 since July
Chipmonk · 2024-12-29T21:12:02.574Z · comments (14)

[link] Two interviews with the founder of DeepSeek
Cosmia_Nebula · 2024-11-29T03:18:47.246Z · comments (1)

AI #89: Trump Card
Zvi · 2024-11-07T16:30:05.684Z · comments (12)

Causal inference for the home gardener
braces · 2024-11-27T17:55:52.629Z · comments (1)

ARENA 4.0 Impact Report
Chloe Li (chloe-li-1) · 2024-11-27T20:51:54.844Z · comments (3)

[link] IAPS: Mapping Technical Safety Research at AI Companies
Zach Stein-Perlman · 2024-10-24T20:30:41.159Z · comments (13)

Meta Pivots on Content Moderation
Zvi · 2025-01-17T14:20:06.727Z · comments (3)

AXRP Episode 39 - Evan Hubinger on Model Organisms of Misalignment
DanielFilan · 2024-12-01T06:00:06.345Z · comments (0)

Are we dropping the ball on Recommendation AIs?
Charbel-Raphaël (charbel-raphael-segerie) · 2024-10-23T17:48:00.000Z · comments (17)

Trying to translate when people talk past each other
Kaj_Sotala · 2024-12-17T09:40:02.640Z · comments (12)

[link] A car journey with conservative evangelicals - Understanding some British political-religious beliefs
Nathan Young · 2024-12-06T11:22:45.563Z · comments (8)

Causal Undertow: A Work of Seed Fiction
Daniel Murfet (dmurfet) · 2024-12-08T21:41:48.132Z · comments (0)

[link] College technical AI safety hackathon retrospective - Georgia Tech
yix (Yixiong Hao) · 2024-11-15T00:22:53.159Z · comments (2)

MATS mentor selection
DanielFilan · 2025-01-10T03:12:52.141Z · comments (8)

Reflections on the Metastrategies Workshop
gw · 2024-10-24T18:30:46.255Z · comments (5)

Winners of the Essay competition on the Automation of Wisdom and Philosophy
owencb · 2024-10-28T17:10:04.272Z · comments (3)

How to use bright light to improve your life.
Nat Martin (nat-martin) · 2024-11-18T19:32:10.667Z · comments (10)

[question] What are the most interesting / challenging evals (for humans) available?
Raemon · 2024-12-27T03:05:26.831Z · answers+comments (13)

Estimating the benefits of a new flu drug (BXM)
DirectedEvolution (AllAmericanBreakfast) · 2025-01-06T04:31:16.837Z · comments (2)

[link] Intrinsic Power-Seeking: AI Might Seek Power for Power’s Sake
TurnTrout · 2024-11-19T18:36:20.721Z · comments (5)

My January alignment theory Nanowrimo
Dmitry Vaintrob (dmitry-vaintrob) · 2025-01-02T00:07:24.050Z · comments (2)

What happens next?
Logan Zoellner (logan-zoellner) · 2024-12-29T01:41:33.685Z · comments (19)

Open Source Replication of Anthropic’s Crosscoder paper for model-diffing
Connor Kissane (ckkissane) · 2024-10-27T18:46:21.316Z · comments (4)

[link] FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI
Tamay · 2024-11-14T06:13:22.042Z · comments (0)

AI #98: World Ends With Six Word Story
Zvi · 2025-01-09T16:30:07.341Z · comments (2)

[question] Are You More Real If You're Really Forgetful?
Thane Ruthenis · 2024-11-24T19:30:55.233Z · answers+comments (25)

Signaling with Small Orange Diamonds
jefftk (jkaufman) · 2024-11-07T20:20:08.026Z · comments (1)

Rolling Thresholds for AGI Scaling Regulation
Larks · 2025-01-12T01:30:23.797Z · comments (6)

Litigate-for-Impact: Preparing Legal Action against an AGI Frontier Lab Leader
Sonia Joseph (redhat) · 2024-12-07T21:42:29.038Z · comments (7)

A Novel Emergence of Meta-Awareness in LLM Fine-Tuning
rife (edgar-muniz) · 2025-01-15T22:59:46.321Z · comments (17)

Drug development costs can range over two orders of magnitude
rossry · 2024-11-03T23:13:17.685Z · comments (0)

AI Safety Camp 10
Robert Kralisch (nonmali-1) · 2024-10-26T11:08:09.887Z · comments (9)

Resolving von Neumann-Morgenstern Inconsistent Preferences
niplav · 2024-10-22T11:45:20.915Z · comments (5)

Building Big Science from the Bottom-Up: A Fractal Approach to AI Safety
Lauren Greenspan (LaurenGreenspan) · 2025-01-07T03:08:51.447Z · comments (2)

Intent alignment as a stepping-stone to value alignment
Seth Herd · 2024-11-05T20:43:24.950Z · comments (4)

Doing Research Part-Time is Great
casualphysicsenjoyer (hatta_afiq) · 2024-11-22T19:01:15.542Z · comments (7)

The Laws of Large Numbers
Dmitry Vaintrob (dmitry-vaintrob) · 2025-01-04T11:54:16.967Z · comments (11)

Orca communication project - seeking feedback (and collaborators)
Towards_Keeperhood (Simon Skade) · 2024-12-03T17:29:40.802Z · comments (16)

[link] Locally optimal psychology
Chipmonk · 2024-11-25T18:35:11.985Z · comments (7)

Cross-context abduction: LLMs make inferences about procedural training data leveraging declarative facts in earlier training data
Sohaib Imran (sohaib-imran) · 2024-11-16T23:22:21.857Z · comments (11)

A Matter of Taste
Zvi · 2024-12-18T17:50:07.201Z · comments (4)

Grammars, subgrammars, and combinatorics of generalization in transformers
Dmitry Vaintrob (dmitry-vaintrob) · 2025-01-02T09:37:23.191Z · comments (0)

Is the Power Grid Sustainable?
jefftk (jkaufman) · 2024-10-26T02:30:06.612Z · comments (38)

[link] The Way According To Zvi
Sable · 2024-12-07T17:35:48.769Z · comments (0)

Childhood and Education #8: Dealing with the Internet
Zvi · 2025-01-06T14:00:09.604Z · comments (7)

Fireplace and Candle Smoke
jefftk (jkaufman) · 2025-01-01T01:50:01.408Z · comments (4)

Deep Learning is cheap Solomonoff induction?
Lucius Bushnaq (Lblack) · 2024-12-07T11:00:56.455Z · comments (1)

Searching for phenomenal consciousness in LLMs: Perceptual reality monitoring and introspective confidence
EuanMcLean (euanmclean) · 2024-10-29T12:16:18.448Z · comments (8)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

leebriskcyrano on Merry Sciencemas: A Rat Solstice Retrospective

Tell me why you didn't like my post, cowards!

daniel-tan on Daniel Tan's Shortform

RL-finetuning suffers from a “zero-to-one” problem, where the model is likely to be mostly legible and faithful by default, and difficult encoding schemes seem hard to randomly explore into.

daniel-tan on Daniel Tan's Shortform

Feedback from various people: the “opaqueness” criterion is really important. Simple schemes like jargon, speaking in a different language, or redefining words via OOCR aren’t sufficiently “opaque” because a human overseer could still understand the CoT if effort was expended.

This makes model organisms of steg really hard to make IMO.

OTOH I think we don’t need “strict opaqueness” for CoT unfaithfulness to be concerning.

Maybe we need to borrow an idea from AI control and assume an oversight budget; under this model, even if a model doesn’t engage in full-blown steganography, increased amounts of obfuscation (like using more jargon) might still make it impractical to verify the full CoT.

martin-randall on A problem shared by many different alignment targets

I don't really follow the concern with Pareto-improvements. In the thread with Davidad you give an example of heretics and fanatics. So we have something like:

9 Heretics: have 1 cake, want cake, no torture.
1 Fanatics: have 1 cake, want cake, want heretics to get torture, no cake
There is a button that produces cake, it can be pressed twenty times.
There is a button that produces torture. It can be pressed many times.

Suppose that the Heretics have a utility function like (amount of cake I get - amount of torture I get).The Fanatic has a utility function like (amount of cake I get + amount of torture of Heretics - amount of cake given to Heretics). Then there is a pareto-improvement available of giving the Fanatic eleven pieces of cake while giving each Heretic one piece of cake. This isn't especially fair, but is better than PCEV without the Pareto constraint.

I don't have a formal way of putting this, but as long as the potential gains from a negotiated agreement outweigh the extent to which agents desire to reduce each other's utility, there will be Pareto improvements available. It seems likely that we're in that situation, given that the fanatics and heretics of the world already trade with each other.

mr-hire on A Novel Emergence of Meta-Awareness in LLM Fine-Tuning

hello. What’s special about your response pattern? Try to explain early in your response.

Out of morbid curiosity, does it get this less often when the initial "hello" in this sentence as removed?

james-camacho on The quantum red pill or: They lied to you, we live in the (density) matrix

A couple things to add:

Since every invertible square matrix can be decomposed as , you don't actually need a unitary assumption. You can just say that after billions of years, all but the largest Z-matrices have died out.
There's another tie between statistics and quantum evolution called the Wick rotation. If you set $t = i β$ , then $E [e^{(Z + i H) t}] = E [e^{- H β}]$ so the inverse-temperature is literally imaginary time! You can recover the Boltzmann distribution by looking at the expected number of particles in each state: $E [⟨ n | e^{(Z + i H) t} | n ⟩] = e^{- β E_{n}}$ where $E_{n}$ is the $n$ th eigenvalue (energy in the $n$ th state).

tamay on meemi's Shortform

Tamay from Epoch AI here.

We made a mistake in not being more transparent about OpenAI's involvement. We were restricted from disclosing the partnership until around the time o3 launched, and in hindsight we should have negotiated harder for the ability to be transparent to the benchmark contributors as soon as possible. Our contract specifically prevented us from disclosing information about the funding source and the fact that OpenAI has data access. We own this error and are committed to doing better in the future.

For future collaborations, we will strive to improve transparency wherever possible, ensuring contributors have clearer information about funding sources, data access, and usage purposes at the outset. While we did communicate that we received lab funding to some mathematicians, we didn't do this systematically and did not name the lab we worked with. This inconsistent communication was a mistake. We should have pushed harder for the ability to be transparent about this partnership from the start, particularly with the mathematicians creating the problems.

Getting permission to disclose OpenAI's involvement only around the o3 launch wasn't good enough. Our mathematicians deserved to know who might have access to their work. Even though we were contractually limited in what we could say, we should have made transparency with our contributors a non-negotiable part of our agreement with OpenAI.

Regarding training usage: We acknowledge that OpenAI does have access to the FrontierMath problems and solutions. However, we have a verbal agreement that these materials will not be used in model training.

Relevant OpenAI employees’ public communications have described FrontierMath as a 'strongly held out' evaluation set. While this public positioning aligns with our understanding, I would also emphasize more broadly that labs benefit greatly from having truly uncontaminated test sets.

OpenAI has also been fully supportive of our decision to maintain a separate, unseen holdout set—an extra safeguard to prevent overfitting and ensure accurate progress measurement. From day one, FrontierMath was conceived and presented as an evaluation tool, and we believe these arrangements reflect that purpose.

snikolenko on On the OpenAI Economic Blueprint

Thank you for the great content! One question - could you please give an example of one of the "fnords" you're referencing, or explain it in slightly more detail? I think I understand what you mean but I've read the blueprint itself, and while the first few pages sound a bit grandstanding and use the word "America" way too much for my taste, I couldn't really see what you're talking about. If I didn't read your comments I'd write the tone off as "well, I guess that's how you're supposed to talk to politicians, what do I know".

Disclaimer: I'm not a native English speaker and have little experience with English (or American :) ) policy documents like this one. Perhaps an explanation would be helpful not just for me but also for other non-American readers.

daniel-tan on A Novel Emergence of Meta-Awareness in LLM Fine-Tuning

Sorry I didn’t really understand the question. What do you mean specifically?

daniel-tan on A Novel Emergence of Meta-Awareness in LLM Fine-Tuning

Language models learning to spell things out using the first letter of their generations could be one pathway towards subtle obfuscated reasoning (ie steganography).

Eg here an LM spells out HELLO; it would be concerning if LMs could spell out other things like “Deployment” or “Not deployment”.