LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

DunCon @Lighthaven
Duncan Sabien (Deactivated) (Duncan_Sabien) · 2024-09-29T04:56:27.205Z · comments (0)

AXRP Episode 39 - Evan Hubinger on Model Organisms of Misalignment
DanielFilan · 2024-12-01T06:00:06.345Z · comments (0)

Trying to translate when people talk past each other
Kaj_Sotala · 2024-12-17T09:40:02.640Z · comments (12)

[link] Point of Failure: Semiconductor-Grade Quartz
Annapurna (jorge-velez) · 2024-09-30T15:57:40.495Z · comments (8)

Reflections on the Metastrategies Workshop
gw · 2024-10-24T18:30:46.255Z · comments (5)

D&D Sci Coliseum: Arena of Data
aphyer · 2024-10-18T22:02:54.305Z · comments (23)

Causal Undertow: A Work of Seed Fiction
Daniel Murfet (dmurfet) · 2024-12-08T21:41:48.132Z · comments (0)

Are we dropping the ball on Recommendation AIs?
Charbel-Raphaël (charbel-raphael-segerie) · 2024-10-23T17:48:00.000Z · comments (17)

[link] Intrinsic Power-Seeking: AI Might Seek Power for Power’s Sake
TurnTrout · 2024-11-19T18:36:20.721Z · comments (5)

[question] Implications of China's recession on AGI development?
Eric Neyman (UnexpectedValues) · 2024-09-28T01:12:36.443Z · answers+comments (3)

2025 Color Trends
sarahconstantin · 2024-10-07T21:20:03.962Z · comments (7)

Winners of the Essay competition on the Automation of Wisdom and Philosophy
owencb · 2024-10-28T17:10:04.272Z · comments (3)

How to use bright light to improve your life.
Nat Martin (nat-martin) · 2024-11-18T19:32:10.667Z · comments (10)

[Linkpost] Play with SAEs on Llama 3
Tom McGrath · 2024-09-25T22:35:44.824Z · comments (2)

ARENA 4.0 Impact Report
Chloe Li (chloe-li-1) · 2024-11-27T20:51:54.844Z · comments (3)

[question] Are You More Real If You're Really Forgetful?
Thane Ruthenis · 2024-11-24T19:30:55.233Z · answers+comments (25)

[link] FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI
Tamay · 2024-11-14T06:13:22.042Z · comments (0)

Signaling with Small Orange Diamonds
jefftk (jkaufman) · 2024-11-07T20:20:08.026Z · comments (1)

[link] College technical AI safety hackathon retrospective - Georgia Tech
yix (Yixiong Hao) · 2024-11-15T00:22:53.159Z · comments (2)

Open Source Replication of Anthropic’s Crosscoder paper for model-diffing
Connor Kissane (ckkissane) · 2024-10-27T18:46:21.316Z · comments (4)

Monthly Roundup #23: October 2024
Zvi · 2024-10-16T13:50:05.869Z · comments (13)

Drug development costs can range over two orders of magnitude
rossry · 2024-11-03T23:13:17.685Z · comments (0)

[link] Characterizing stable regions in the residual stream of LLMs
Jett Janiak (jett) · 2024-09-26T13:44:58.792Z · comments (4)

Litigate-for-Impact: Preparing Legal Action against an AGI Frontier Lab Leader
Sonia Joseph (redhat) · 2024-12-07T21:42:29.038Z · comments (7)

AI Safety Camp 10
Robert Kralisch (nonmali-1) · 2024-10-26T11:08:09.887Z · comments (9)

Book Review: On the Edge: The Business
Zvi · 2024-09-25T12:20:06.230Z · comments (0)

0.202 Bits of Evidence In Favor of Futarchy
niplav · 2024-09-29T21:57:59.896Z · comments (0)

Analysis of Global AI Governance Strategies
Sammy Martin (SDM) · 2024-12-04T10:45:25.311Z · comments (10)

[link] An X-Ray is Worth 15 Features: Sparse Autoencoders for Interpretable Radiology Report Generation
hugofry · 2024-10-07T08:53:14.658Z · comments (0)

Compelling Villains and Coherent Values
Cole Wyeth (Amyr) · 2024-10-06T19:53:47.891Z · comments (4)

5 ways to improve CoT faithfulness
CBiddulph (caleb-biddulph) · 2024-10-05T20:17:12.637Z · comments (39)

[link] AISafety.info: What is the "natural abstractions hypothesis"?
Algon · 2024-10-05T12:31:14.195Z · comments (2)

Doing Research Part-Time is Great
casualphysicsenjoyer (hatta_afiq) · 2024-11-22T19:01:15.542Z · comments (7)

[link] Locally optimal psychology
Chipmonk · 2024-11-25T18:35:11.985Z · comments (7)

[link] A Percentage Model of a Person
Sable · 2024-10-12T17:55:07.560Z · comments (3)

The murderous shortcut: a toy model of instrumental convergence
Thomas Kwa (thomas-kwa) · 2024-10-02T06:48:06.787Z · comments (0)

OODA your OODA Loop
Raemon · 2024-10-11T00:50:48.119Z · comments (3)

COT Scaling implies slower takeoff speeds
Logan Zoellner (logan-zoellner) · 2024-09-28T16:20:00.320Z · comments (56)

Eye contact is effortless when you’re no longer emotionally blocked on it
Chipmonk · 2024-09-27T21:47:01.970Z · comments (24)

Exploring SAE features in LLMs with definition trees and token lists
mwatkins · 2024-10-04T22:15:28.108Z · comments (5)

LASR Labs Spring 2025 applications are open!
Erin Robertson · 2024-10-04T13:44:20.524Z · comments (0)

[question] If I have some money, whom should I donate it to in order to reduce expected P(doom) the most?
KvmanThinking (avery-liu) · 2024-10-03T11:31:19.974Z · answers+comments (36)

[link] Review: Good Strategy, Bad Strategy
L Rudolf L (LRudL) · 2024-12-21T17:17:04.342Z · comments (0)

Searching for phenomenal consciousness in LLMs: Perceptual reality monitoring and introspective confidence
EuanMcLean (euanmclean) · 2024-10-29T12:16:18.448Z · comments (8)

[question] What are the most interesting / challenging evals (for humans) available?
Raemon · 2024-12-27T03:05:26.831Z · answers+comments (8)

[link] The Way According To Zvi
Sable · 2024-12-07T17:35:48.769Z · comments (0)

Is the Power Grid Sustainable?
jefftk (jkaufman) · 2024-10-26T02:30:06.612Z · comments (38)

A Matter of Taste
Zvi · 2024-12-18T17:50:07.201Z · comments (4)

[link] Big tech transitions are slow (with implications for AI)
jasoncrawford · 2024-10-24T14:25:06.873Z · comments (16)

Cross-context abduction: LLMs make inferences about procedural training data leveraging declarative facts in earlier training data
Sohaib Imran (sohaib-imran) · 2024-11-16T23:22:21.857Z · comments (11)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

darmani on If all trade is voluntary, then what is "exploitation?"

Can you give some examplse of something contained by my definition which you think shouldn't be considered exploitation?

tsvibt on The Field of AI Alignment: A Postmortem, and What To Do About It

The flinches aren't structureless particulars. Rather, they involve warping various perceptions. Those warped perceptions generalize a lot, causing other flaws to be hidden.

As a toy example, you could imagine someone attached to the idea of AI boxing. At first they say it's impossible to break out / trick you / know about the world / whatever. Then you convince them otherwise--that the AI can do RSI internally, and superhumanly solve computer hacking / protein folding / persuasion / etc. But they are attached to AI boxing. So they warp their perception, clamping "can an AI be very superhumanly capable" to "no". That clamping causes them to also not see the flaws in the plan "we'll deploy our AIs in a staged manner, see how they behave, and then recall them if they behave poorly", because they don't think RSI is feasible, they don't think extreme persuasion is feasible, etc.

A more real example is, say, people thinking of "structures for decision making", e.g. constitutions. You explain that these things, they are not reflectively stable. And now this person can't understand reflective stability in general, so they don't understand why steering vectors won't work, or why lesioning won't work, etc.

Another real but perhaps more controversial example: {detecting deception, retargeting the search, CoT monitoring, lesioning bad thoughts, basically anything using RL} all fail because creativity starts with illegible concomitants to legible reasoning.

(This post seems to be somewhat illegible, but if anyone wants to see more real examples of aspects of mind that people fail to remember, see https://tsvibt.blogspot.com/2023/03/the-fraught-voyage-of-aligned-novelty.html)

daniel-murfet on leogao's Shortform

It's a question of resolution. Just looking at things for vibes is a pretty good way of filtering wheat from chaff, but you don't give scarce resources like jobs or grants to every grain of wheat that comes along. When I sit on a hiring committee, the discussions around the table are usually some mix of status markers and people having done the hard work of reading papers more or less carefully (this consuming time in greater-than-linear proportion to distance from your own fields of expertise). Usually (unless nepotism is involved) someone who has done that homework can wield more power than they otherwise would at that table, because people respect strong arguments and understand that status markers aren't everything.

Still, at the end of day, an Annals paper is an Annals paper. It's also true that to pass some of the early filters you either need (a) someone who speaks up strongly for you or (b) pass the status marker tests.

I am sometimes in a position these days of trying to bridge the academic status system and the Berkeley-centric AI safety status system, e.g. by arguing to a high status mathematician that someone with illegible (to them) status is actually approximately equivalent in "worthiness of being paid attention to" as someone they know with legible status. Small increases in legibility can have outsize effects in how easy my life is in those conversations.

Otherwise it's entirely down to me putting social capital on the table ("you think I'm serious, I think this person is very serious"). I'm happy to do this and continue doing this, but it's not easily scalable, because it depends on my personal relationships.

christiankl on Why don't we currently have AI agents?

Which one's do you see as the top ones?

daniel-murfet on Cognitive Work and AI Safety: A Thermodynamic Perspective

To be clear, I am not arguing that evolution is an example of what I'm talking about. The analogy to thermodynamics in what I wrote is straightforwardly correct, no need to introduce KT-complexity and muddy the waters; what I am calling work is literally work.

cole-wyeth on The Engineering Argument Fallacy: Why Technological Success Doesn't Validate Physics

Somewhat interesting point, but perhaps could have been made much briefer.

The verbose writing style makes me wonder if an LLM was used in the writing process?

feepingcreature on Greedy-Advantage-Aware RLHF

Human: "Look, can't you just be normal about this?"

GAA-optimized agent: "Actually-"

Hm, I guess this wouldn't work if the agent still learns an internalized RL methodology? Or would it? Say we have a base model, not much need for GAA because it's just doing token pred. We go into some sort of (distilled?) RL-based cot instruct tuning, GAA means it picks up abnormal rewards from the signal more slowly, ie. it doesn't do the classic boat-spinning-in-circles thing (good test?), but if it internalizes RL at some point its mesaoptimizer wouldn't be so limited, and that's a general technique so GAA wouldn't prevent it? Still, seems like a good first line of defense.

jenniferrm on Acknowledging Background Information with P(Q|I)

Haha! I really hope I don't have to start running everything I write through a slave Mentat to avoid avoidable errors. What a deontic double bind that'd be <3

lsusr on Review: Planecrash

I really like your post. Good how-to manuals like yours are rare and precious.

jenniferrm on Acknowledging Background Information with P(Q|I)

My current "background I" (maybe not the one from 2017, but one I would tend to deploy here in 2024) includes something like: "Kolmogorov complexity is a cool ideal, but it is formally uncomputable in theory unless you have a halting oracle laying around in your cardboard box in your garage labeled Time Travel Stuff, and Solomonoff Induction is not tractably approximably sampled by extant techniques that aren't just highly skilled MCMC".