LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Open Source Replication of Anthropic’s Crosscoder paper for model-diffing
Connor Kissane (ckkissane) · 2024-10-27T18:46:21.316Z · comments (1)

0.202 Bits of Evidence In Favor of Futarchy
niplav · 2024-09-29T21:57:59.896Z · comments (0)

Book Review: On the Edge: The Business
Zvi · 2024-09-25T12:20:06.230Z · comments (0)

[link] An X-Ray is Worth 15 Features: Sparse Autoencoders for Interpretable Radiology Report Generation
hugofry · 2024-10-07T08:53:14.658Z · comments (0)

[link] Characterizing stable regions in the residual stream of LLMs
Jett Janiak (jett) · 2024-09-26T13:44:58.792Z · comments (4)

[link] An AI Manhattan Project is Not Inevitable
Maxwell Tabarrok (maxwell-tabarrok) · 2024-07-06T16:42:35.920Z · comments (25)

[question] What progress have we made on automated auditing?
LawrenceC (LawChan) · 2024-07-06T01:49:43.714Z · answers+comments (1)

D&D.Sci: Whom Shall You Call?
abstractapplic · 2024-07-05T20:53:37.010Z · comments (6)

[link] FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI
Tamay · 2024-11-14T06:13:22.042Z · comments (0)

Dialogue on What It Means For Something to Have A Function/Purpose
johnswentworth · 2024-07-15T16:28:56.609Z · comments (5)

Compelling Villains and Coherent Values
Cole Wyeth (Amyr) · 2024-10-06T19:53:47.891Z · comments (4)

A New Class of Glitch Tokens - BPE Subtoken Artifacts (BSA)
Lao Mein (derpherpize) · 2024-09-20T13:13:26.181Z · comments (7)

Free Will and Dodging Anvils: AIXI Off-Policy
Cole Wyeth (Amyr) · 2024-08-29T22:42:24.485Z · comments (12)

The murderous shortcut: a toy model of instrumental convergence
Thomas Kwa (thomas-kwa) · 2024-10-02T06:48:06.787Z · comments (0)

[link] Twitter thread on AI takeover scenarios
Richard_Ngo (ricraz) · 2024-07-31T00:24:33.866Z · comments (0)

[link] I didn't have to avoid you; I was just insecure
Chipmonk · 2024-08-17T16:41:50.237Z · comments (7)

[link] A Percentage Model of a Person
Sable · 2024-10-12T17:55:07.560Z · comments (3)

[link] Turning 22 in the Pre-Apocalypse
testingthewaters · 2024-08-22T20:28:25.794Z · comments (14)

I'm creating a deep dive podcast episode about the original Leverage Research - would you like to take part?
spencerg · 2024-09-22T14:03:22.164Z · comments (2)

Exploring SAE features in LLMs with definition trees and token lists
mwatkins · 2024-10-04T22:15:28.108Z · comments (5)

LASR Labs Spring 2025 applications are open!
Erin Robertson · 2024-10-04T13:44:20.524Z · comments (0)

OODA your OODA Loop
Raemon · 2024-10-11T00:50:48.119Z · comments (3)

Turning Your Back On Traffic
jefftk (jkaufman) · 2024-07-17T01:00:08.627Z · comments (7)

Glitch Token Catalog - (Almost) a Full Clear
Lao Mein (derpherpize) · 2024-09-21T12:22:16.403Z · comments (3)

AI Safety Camp 10
Robert Kralisch (nonmali-1) · 2024-10-26T11:08:09.887Z · comments (7)

COT Scaling implies slower takeoff speeds
Logan Zoellner (logan-zoellner) · 2024-09-28T16:20:00.320Z · comments (56)

Distinguish worst-case analysis from instrumental training-gaming
Olli Järviniemi (jarviniemi) · 2024-09-05T19:13:34.443Z · comments (0)

But Where do the Variables of my Causal Model come from?
Dalcy (Darcy) · 2024-08-09T22:07:57.395Z · comments (1)

Eye contact is effortless when you’re no longer emotionally blocked on it
Chipmonk · 2024-09-27T21:47:01.970Z · comments (24)

An anti-inductive sequence
Viliam · 2024-08-14T12:28:54.226Z · comments (10)

[question] What are your cruxes for imprecise probabilities / decision rules?
Anthony DiGiovanni (antimonyanthony) · 2024-07-31T15:42:27.057Z · answers+comments (29)

Debate: Is it ethical to work at AI capabilities companies?
Ben Pace (Benito) · 2024-08-14T00:18:38.846Z · comments (21)

[link] Shifting Headspaces - Transitional Beast-Mode
Jonathan Moregård (JonathanMoregard) · 2024-08-12T13:02:06.120Z · comments (9)

[link] UC Berkeley course on LLMs and ML Safety
Dan H (dan-hendrycks) · 2024-07-09T15:40:00.920Z · comments (1)

[link] Big tech transitions are slow (with implications for AI)
jasoncrawford · 2024-10-24T14:25:06.873Z · comments (16)

Is the Power Grid Sustainable?
jefftk (jkaufman) · 2024-10-26T02:30:06.612Z · comments (37)

AI #89: Trump Card
Zvi · 2024-11-07T16:30:05.684Z · comments (12)

Finding the Wisdom to Build Safe AI
Gordon Seidoh Worley (gworley) · 2024-07-04T19:04:16.089Z · comments (10)

We’re not as 3-Dimensional as We Think
silentbob · 2024-08-04T14:39:16.799Z · comments (16)

An Introduction to Representation Engineering - an activation-based paradigm for controlling LLMs
Jan Wehner · 2024-07-14T10:37:21.544Z · comments (5)

[link] List of Collective Intelligence Projects
Chipmonk · 2024-07-02T14:10:41.789Z · comments (9)

[link] Twitter thread on politics of AI safety
Richard_Ngo (ricraz) · 2024-07-31T00:00:34.298Z · comments (2)

Economics Roundup #2
Zvi · 2024-07-02T12:40:05.908Z · comments (5)

[link] On Fables and Nuanced Charts
Niko_McCarty (niko-2) · 2024-09-08T17:09:07.503Z · comments (2)

Book Review: On the Edge: The Gamblers
Zvi · 2024-09-24T11:50:06.065Z · comments (1)

Live Machinery: An Interface Design Philosophy for Wholesome AI Futures (Workshop @ EA Hotel!)
Sahil · 2024-11-01T17:24:09.957Z · comments (2)

Index of rationalist groups in the Bay Area July 2024
Lucie Philippon (lucie-philippon) · 2024-07-26T16:32:25.337Z · comments (10)

Video and transcript of presentation on Otherness and control in the age of AGI
Joe Carlsmith (joekc) · 2024-10-08T22:30:38.054Z · comments (1)

Empirical vs. Mathematical Joints of Nature
Elizabeth (pktechgirl) · 2024-06-26T01:55:22.858Z · comments (1)

Open Problems in AIXI Agent Foundations
Cole Wyeth (Amyr) · 2024-09-12T15:38:59.007Z · comments (2)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

cubefox on Seven lessons I didn't learn from election day

Cities are very heavily Democratic, while rural areas are only moderately Republican.

I think this isn't compatible with both getting about equally many votes. Because much more US Americans live in cities than in rural areas:

In 2020, about 82.66 percent of the total population in the United States lived in cities and urban areas.

https://www.statista.com/statistics/269967/urbanization-in-the-united-states/

jbash on There Is a Solution to AI’s Existential Risk Problem

Fortunately, Nobel Laureate Geoffrey Hinton, Turing Award winner Yoshua Bengio, and many others have provided a piece of the solution. In a policy paper published in Science earlier this year, they recommended “if-then commitments”: commitments to be activated if and when red-line capabilities are found in frontier AI systems.

So race to the brink and hope you can actually stop when you get there?

Once the most powerful nations have signed this treaty, it is in their interest to verify each others’ compliance, and to make sure uncontrollable AI is not built elsewhere, either.

How, exactly?

viliam on The Early Christian Strategy

I suspect that to solve this puzzle, we would need more precise data. For example, the thing about martyrdom. Naively, it makes it sound like the early Christians were quite suicidal, which is amazing in itself, and also makes you wonder how they survived as a group.

But let's try to use numbers. What fraction of early Christians was actually willing to die for their faith? I have no idea, so just for the sake of a thought experiment, I propose a number... 1%. (No idea whether it is correct.)

Suddenly the fact that a religion which promises you an awesome afterlife can make 1% of its members die voluntarily, does not feel so surprising. There are all kinds of crazy and otherwise vulnerable people out there. With enough peer pressure, you could probably start a cult where 1% of your members commit some kind of suicide even today. Only, the moment you would actually do it, the media would describe you as a crazy murderous cult, and you would probably end up in jail. It would be difficult to keep recruiting members. I suppose the Rome could have been different, for example didn't care about suicides of slaves so much. Also, "suicide by a (Roman) cop" is a non-central form of suicide; it does not make your group look like villains. And if you are actively gaining new members, losing 1% does not make much of a difference.

Also, I wonder how hard Romans actually tried to eliminate Christians. I imagine that if someone tried the same way Hitler tried to get rid of Jews, it would be game over for Christianity. But if the level of persecution is more like "once in a while, we will take a high-status member, try to make them deny Jesus, and kill them if they refuse", that won't stop the group than meanwhile recruits hundred new members. Also, this was ancient Rome, life was probably cheap, you could have get killed for many different things, plus die of many different diseases, perhaps the chance of being killed for your religion didn't increase the overall risk significantly if you were an average member.

avturchin on If I care about measure, choices have additional burden (+AI generated LW-comments)

But if I use quantum coin to make a life choice, there will be splitting, right?

cubefox on Leon Lang's Shortform

It's not that "they" should be more precise, but that "we" would like to have more precise information.

We know pretty conclusively now from The Information and Bloomberg that for OpenAI, Google and Anthropic, new frontier base LLMs have yielded disappointing performance gains. The question is which of your possibilities did cause this.

They do mention that the availability of high quality training data (text) is an issue, which suggests it's probably not your first bullet point.

tag on If I care about measure, choices have additional burden (+AI generated LW-comments)

Every quantum event splits the multiverse, so my measure should decline by 20 orders of magnitude every second.

there isn't the slightest evidence that irrevocable splitting, splitting into decoherent branches occurs on that scale, and plenty of evidence -- eg. The existence of quantum computing -- that it doesnt.

See

https://www.lesswrong.com/posts/wvGqjZEZoYnsS5xfn/any-evidence-or-reason-to-expect-a-multiverse-everett?commentId=o6RzrFRCiE5kr3xD4 [LW(p) · GW(p)]

adam_scholl on Untrusted smart models and trusted dumb models

I'm curious if "trusted" in this sense basically just means "aligned"—or like, the superset of that which also includes "unaligned yet too dumb to cause harm" and "unaligned yet prevented from causing harm"—or whether you mean something more specific? E.g., are you imagining that some powerful unconstrained systems are trusted yet unaligned, or vice versa?

jeremy-gillen on Thoughts after the Wolfram and Yudkowsky discussion

I get the feeling that I’m still missing the point somehow and that Yudkowsky would say we still have a big chance of doom if our algorithms were created by hand with programmers whose algorithms always did exactly what they intended even when combined with their other algorithms.

I would bet against Eliezer being pessimistic about this, if we are assuming the algorithms are deeply-understood enough that we are confident that we can iterate on building AGI. I think there's maybe a problem with the way Eliezer communicates that gives people the impression that he's a rock with "DOOM" written on it.

I think the pessimism comes from there being several currently-unsolved problems that get in the way of "deeply-understood enough". In principle it's possible to understand these problems and hand-build a safe and stable AGI, it just looks a lot easier to hand-build an AGI without understanding them all, and even easier than that to train an AGI without even thinking about them.

I call most of these "instability" problems. Where the AI might for example learn more, or think more, or self-modify, and each of these can shift the context in a way that causes an imperfectly designed AI to pursue unintended goals.

Here are some descriptions of problems in that cluster: optimization daemons, ontology shifts, translating between our ontology and the AI's internal ontology in a way that generalizes, pascal's mugging [LW · GW], reflectively stable preferences & decision algorithms, reflectively stable corrigibility, and correctly estimating future competence under different circumstances.

Some may be resolved by default along the way to understanding how to build AGI by hand, but it isn't clear. Some are kinda solved already in some contexts.

avturchin on If I care about measure, choices have additional burden (+AI generated LW-comments)

Wei· 3h

This post touches on several issues I've been thinking about since my early work on anthropic decision theory and UDT. Let me break this down:

1. The measure-decline problem is actually more general than just quantum mechanics. It appears in any situation where your decision algorithm gets instantiated multiple times, including classical copying, simulation, or indexical uncertainty. See my old posts on anthropic probabilities and probability-as-preference.

2. The "functional identity" argument being used here to dismiss certain types of splitting is problematic. What counts as "functionally identical" depends on your decision theory's level of grain. UDT1.1 would treat seemingly identical copies differently if they're in different computational states, while CDT might lump them together.

Some relevant questions that aren't addressed:

- How do we handle preference aggregation across different versions of yourself with different measures?
- Should we treat quantum branching differently from other forms of splitting? (I lean towards "no" these days)
- How does this interact with questions of personal identity continuity?
- What happens when we consider infinite branches? (This relates to my work on infinite ethics)

The real issue here isn't about measure per se, but about how to aggregate preferences across different instances of your decision algorithm. This connects to some open problems in decision theory:

1. The problem of preference aggregation across copies
2. How to handle logical uncertainty in the context of anthropics
3. Whether "caring about measure" can be coherently formalized

I explored some of these issues in my paper on UDT, but I now think the framework needs significant revision to handle these cases properly.

Stuart · 2h
> The problem of preference aggregation across copies

This seems key. Have you made any progress on formalizing this since your 2019 posts?

Wei · 2h
Some progress on the math, but still hitting fundamental issues with infinity. Might post about this soon.

Abram · 1h
Curious about your current thoughts on treating decision-theoretic identical copies differently. Seems related to logical causation?

Wei · 45m
Yes - this connects to some ideas about logical coordination I've been developing. The key insight is that even "identical" copies might have different logical roles...

[Edit: For those interested in following up, I recommend starting with my sequence on decision theory and anthropics, then moving to the more recent work on logical uncertainty.]

avturchin on If I care about measure, choices have additional burden (+AI generated LW-comments)

Vladimir_N 3h

(This is a rather technical comment that attempts to clarify some decision-theoretic confusions.)

Your treatment of measure requires more formal specification. Let's be precise about what we mean by "caring about measure" in decision-theoretic terms.

Consider a formalization where we have:
1. A space of possible outcomes Ω
2. A measure μ on this space
3. A utility function U: Ω → ℝ
4. A decision function D that maps available choices to distributions over Ω

The issue isn't about "spending" measure, but about how we aggregate utility across branches. The standard formulation already handles this correctly through expected utility:

E[U] = ∫_Ω U(ω)dμ(ω)

Your concern about "measure decline" seems to conflate the measure μ with the utility U. These are fundamentally different mathematical objects serving different purposes in the formalism.

If we try to modify this to "care about measure directly," we'd need something like:

U'(ω) = U(ω) * f(μ(ω))

But this leads to problematic decision-theoretic behavior, violating basic consistency requirements like dynamic consistency. It's not clear how to specify f in a way that doesn't lead to contradictions.

The apparent paradox dissolves when we properly separate:
1. Measure as probability measure (μ)
2. Utility as preference ordering over outcomes (U)
3. Decision-theoretic aggregation (E[U])

[Technical note: This relates to my work on logical uncertainty and reflection principles. See my 2011 paper on decision theory in anthropic contexts.]

orthonormal · 2h
> U'(ω) = U(ω) * f(μ(ω))

This is a very clean way of showing why "caring about measure" leads to problems.

Vladimir_N · 2h
Yes, though there are even deeper issues with updateless treatment of anthropic measure that I haven't addressed here for brevity.

Wei_D · 1h
Interesting formalization. How would this handle cases where the agent's preferences include preferences over the measure itself?

Vladimir_N · 45m
That would require extending the outcome space Ω to include descriptions of measures, which brings additional technical complications...

[Note: This comment assumes familiarity with measure theory and decision theory fundamentals.]