LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Intent alignment as a stepping-stone to value alignment
Seth Herd · 2024-11-05T20:43:24.950Z · comments (4)

An argument that consequentialism is incomplete
cousin_it · 2024-10-07T09:45:12.754Z · comments (27)

RLHF is the worst possible thing done when facing the alignment problem
tailcalled · 2024-09-19T18:56:27.676Z · comments (10)

DunCon @Lighthaven
Duncan Sabien (Deactivated) (Duncan_Sabien) · 2024-09-29T04:56:27.205Z · comments (0)

[link] Concrete benefits of making predictions
Jonny Spicer (jonnyspicer) · 2024-10-17T14:23:17.613Z · comments (5)

Investigating the Ability of LLMs to Recognize Their Own Writing
Christopher Ackerman (christopher-ackerman) · 2024-07-30T15:41:44.017Z · comments (0)

[link] End Single Family Zoning by Overturning Euclid V Ambler
Maxwell Tabarrok (maxwell-tabarrok) · 2024-07-26T14:08:45.046Z · comments (1)

[question] How unusual is the fact that there is no AI monopoly?
Viliam · 2024-08-16T20:21:51.012Z · answers+comments (15)

[link] NAO Updates, Fall 2024
jefftk (jkaufman) · 2024-10-18T00:00:04.142Z · comments (2)

[link] What is it like to be psychologically healthy? Podcast ft. DaystarEld
Chipmonk · 2024-10-05T19:14:04.743Z · comments (8)

Apply to MATS 7.0!
Ryan Kidd (ryankidd44) · 2024-09-21T00:23:49.778Z · comments (0)

A more systematic case for inner misalignment
Richard_Ngo (ricraz) · 2024-07-20T05:03:03.500Z · comments (4)

Music in the AI World
Martin Sustrik (sustrik) · 2024-08-16T04:20:01.706Z · comments (8)

Incentive design and capability elicitation
Joe Carlsmith (joekc) · 2024-11-12T20:56:05.088Z · comments (0)

Open Thread Fall 2024
habryka (habryka4) · 2024-10-05T22:28:50.398Z · comments (110)

[question] When is reward ever the optimization target?
Noosphere89 (sharmake-farah) · 2024-10-15T15:09:20.912Z · answers+comments (12)

Meme Talking Points
ymeskhout · 2024-11-06T15:27:54.024Z · comments (0)

SAE Probing: What is it good for? Absolutely something!
Subhash Kantamneni (subhashk) · 2024-11-01T19:23:55.418Z · comments (0)

The slingshot helps with learning
Wilson Wu (wilson-wu) · 2024-10-31T23:18:16.762Z · comments (0)

Extracting SAE task features for in-context learning
Dmitrii Kharlapenko (dmitrii-kharlapenko) · 2024-08-12T20:34:13.747Z · comments (1)

Inference-Only Debate Experiments Using Math Problems
Arjun Panickssery (arjun-panickssery) · 2024-08-06T17:44:27.293Z · comments (0)

[link] Stone Age Herbalist's notes on ant warfare and slavery
trevor (TrevorWiesinger) · 2024-11-09T02:40:01.128Z · comments (0)

AI labs can boost external safety research
Zach Stein-Perlman · 2024-07-31T19:30:16.207Z · comments (1)

Book Review: What Even Is Gender?
Joey Marcellino · 2024-09-01T16:09:27.773Z · comments (14)

Context-dependent consequentialism
Jeremy Gillen (jeremy-gillen) · 2024-11-04T09:29:24.310Z · comments (6)

Bay Winter Solstice 2024: Speech Auditions
ozymandias · 2024-11-04T22:31:38.680Z · comments (0)

[LDSL#6] When is quantification needed, and when is it hard?
tailcalled · 2024-08-13T20:39:45.481Z · comments (0)

Balancing Label Quantity and Quality for Scalable Elicitation
Alex Mallen (alex-mallen) · 2024-10-24T16:49:00.939Z · comments (1)

[question] What's the Deal with Logical Uncertainty?
Ape in the coat · 2024-09-16T08:11:43.588Z · answers+comments (23)

Attention Output SAEs Improve Circuit Analysis
Connor Kissane (ckkissane) · 2024-06-21T12:56:07.969Z · comments (0)

[LDSL#1] Performance optimization as a metaphor for life
tailcalled · 2024-08-08T16:16:27.349Z · comments (4)

Resolving von Neumann-Morgenstern Inconsistent Preferences
niplav · 2024-10-22T11:45:20.915Z · comments (5)

[link] Epistemic states as a potential benign prior
Tamsin Leake (carado-1) · 2024-08-31T18:26:14.093Z · comments (2)

5 ways to improve CoT faithfulness
CBiddulph (caleb-biddulph) · 2024-10-05T20:17:12.637Z · comments (8)

AIS terminology proposal: standardize terms for probability ranges
eggsyntax · 2024-08-30T15:43:39.857Z · comments (12)

AI #85: AI Wins the Nobel Prize
Zvi · 2024-10-10T13:40:07.286Z · comments (6)

[link] Baking vs Patissing vs Cooking, the HPS explanation
adamShimi · 2024-07-17T20:29:09.645Z · comments (16)

[link] Safety tax functions
owencb · 2024-10-20T14:08:38.099Z · comments (0)

[question] What are things you're allowed to do as a startup?
Elizabeth (pktechgirl) · 2024-06-20T00:01:59.257Z · answers+comments (9)

"Full Automation" is a Slippery Metric
ozziegooen · 2024-06-11T19:56:49.855Z · comments (1)

Some comments on intelligence
Viliam · 2024-08-01T15:17:07.215Z · comments (5)

Fun With CellxGene
sarahconstantin · 2024-09-06T22:00:03.461Z · comments (2)

[link] [Paper] Hidden in Plain Text: Emergence and Mitigation of Steganographic Collusion in LLMs
Yohan Mathew (ymath) · 2024-09-25T14:52:48.263Z · comments (2)

AI #74: GPT-4o Mini Me and Llama 3
Zvi · 2024-07-25T13:50:06.528Z · comments (6)

AI Constitutions are a tool to reduce societal scale risk
Sammy Martin (SDM) · 2024-07-25T11:18:17.826Z · comments (2)

Examples of How I Use LLMs
jefftk (jkaufman) · 2024-10-14T17:10:04.597Z · comments (2)

DPO/PPO-RLHF on LLMs incentivizes sycophancy, exaggeration and deceptive hallucination, but not misaligned powerseeking
tailcalled · 2024-06-10T21:20:11.938Z · comments (13)

Searching for phenomenal consciousness in LLMs: Perceptual reality monitoring and introspective confidence
EuanMcLean (euanmclean) · 2024-10-29T12:16:18.448Z · comments (7)

Paper Summary: Princes and Merchants: European City Growth Before the Industrial Revolution
Jeffrey Heninger (jeffrey-heninger) · 2024-07-15T21:30:04.043Z · comments (1)

[LDSL#4] Root cause analysis versus effect size estimation
tailcalled · 2024-08-11T16:12:14.604Z · comments (0)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

cubefox on Seven lessons I didn't learn from election day

Cities are very heavily Democratic, while rural areas are only moderately Republican.

I think this isn't compatible with both getting about equally many votes. Because much more US Americans live in cities than in rural areas:

In 2020, about 82.66 percent of the total population in the United States lived in cities and urban areas.

https://www.statista.com/statistics/269967/urbanization-in-the-united-states/

jbash on There Is a Solution to AI’s Existential Risk Problem

Fortunately, Nobel Laureate Geoffrey Hinton, Turing Award winner Yoshua Bengio, and many others have provided a piece of the solution. In a policy paper published in Science earlier this year, they recommended “if-then commitments”: commitments to be activated if and when red-line capabilities are found in frontier AI systems.

So race to the brink and hope you can actually stop when you get there?

Once the most powerful nations have signed this treaty, it is in their interest to verify each others’ compliance, and to make sure uncontrollable AI is not built elsewhere, either.

How, exactly?

viliam on The Early Christian Strategy

I suspect that to solve this puzzle, we would need more precise data. For example, the thing about martyrdom. Naively, it makes it sound like the early Christians were quite suicidal, which is amazing in itself, and also makes you wonder how they survived as a group.

But let's try to use numbers. What fraction of early Christians was actually willing to die for their faith? I have no idea, so just for the sake of a thought experiment, I propose a number... 1%. (No idea whether it is correct.)

Suddenly the fact that a religion which promises you an awesome afterlife can make 1% of its members die voluntarily, does not feel so surprising. There are all kinds of crazy and otherwise vulnerable people out there. With enough peer pressure, you could probably start a cult where 1% of your members commit some kind of suicide even today. Only, the moment you would actually do it, the media would describe you as a crazy murderous cult, and you would probably end up in jail. It would be difficult to keep recruiting members. I suppose the Rome could have been different, for example didn't care about suicides of slaves so much. Also, "suicide by a (Roman) cop" is a non-central form of suicide; it does not make your group look like villains. And if you are actively gaining new members, losing 1% does not make much of a difference.

Also, I wonder how hard Romans actually tried to eliminate Christians. I imagine that if someone tried the same way Hitler tried to get rid of Jews, it would be game over for Christianity. But if the level of persecution is more like "once in a while, we will take a high-status member, try to make them deny Jesus, and kill them if they refuse", that won't stop the group than meanwhile recruits hundred new members. Also, this was ancient Rome, life was probably cheap, you could have get killed for many different things, plus die of many different diseases, perhaps the chance of being killed for your religion didn't increase the overall risk significantly if you were an average member.

avturchin on If I care about measure, choices have additional burden (+AI generated LW-comments)

But if I use quantum coin to make a life choice, there will be splitting, right?

cubefox on Leon Lang's Shortform

It's not that "they" should be more precise, but that "we" would like to have more precise information.

We know pretty conclusively now from The Information and Bloomberg that for OpenAI, Google and Anthropic, new frontier base LLMs have yielded disappointing performance gains. The question is which of your possibilities did cause this.

They do mention that the availability of high quality training data (text) is an issue, which suggests it's probably not your first bullet point.

tag on If I care about measure, choices have additional burden (+AI generated LW-comments)

Every quantum event splits the multiverse, so my measure should decline by 20 orders of magnitude every second.

there isn't the slightest evidence that irrevocable splitting, splitting into decoherent branches occurs on that scale, and plenty of evidence -- eg. The existence of quantum computing -- that it doesnt.

See

https://www.lesswrong.com/posts/wvGqjZEZoYnsS5xfn/any-evidence-or-reason-to-expect-a-multiverse-everett?commentId=o6RzrFRCiE5kr3xD4 [LW(p) · GW(p)]

adam_scholl on Untrusted smart models and trusted dumb models

I'm curious if "trusted" in this sense basically just means "aligned"—or like, the superset of that which also includes "unaligned yet too dumb to cause harm" and "unaligned yet prevented from causing harm"—or whether you mean something more specific? E.g., are you imagining that some powerful unconstrained systems are trusted yet unaligned, or vice versa?

jeremy-gillen on Thoughts after the Wolfram and Yudkowsky discussion

I get the feeling that I’m still missing the point somehow and that Yudkowsky would say we still have a big chance of doom if our algorithms were created by hand with programmers whose algorithms always did exactly what they intended even when combined with their other algorithms.

I would bet against Eliezer being pessimistic about this, if we are assuming the algorithms are deeply-understood enough that we are confident that we can iterate on building AGI. I think there's maybe a problem with the way Eliezer communicates that gives people the impression that he's a rock with "DOOM" written on it.

I think the pessimism comes from there being several currently-unsolved problems that get in the way of "deeply-understood enough". In principle it's possible to understand these problems and hand-build a safe and stable AGI, it just looks a lot easier to hand-build an AGI without understanding them all, and even easier than that to train an AGI without even thinking about them.

I call most of these "instability" problems. Where the AI might for example learn more, or think more, or self-modify, and each of these can shift the context in a way that causes an imperfectly designed AI to pursue unintended goals.

Here are some descriptions of problems in that cluster: optimization daemons, ontology shifts, translating between our ontology and the AI's internal ontology in a way that generalizes, pascal's mugging [LW · GW], reflectively stable preferences & decision algorithms, reflectively stable corrigibility, and correctly estimating future competence under different circumstances.

Some may be resolved by default along the way to understanding how to build AGI by hand, but it isn't clear. Some are kinda solved already in some contexts.

avturchin on If I care about measure, choices have additional burden (+AI generated LW-comments)

Wei· 3h

This post touches on several issues I've been thinking about since my early work on anthropic decision theory and UDT. Let me break this down:

1. The measure-decline problem is actually more general than just quantum mechanics. It appears in any situation where your decision algorithm gets instantiated multiple times, including classical copying, simulation, or indexical uncertainty. See my old posts on anthropic probabilities and probability-as-preference.

2. The "functional identity" argument being used here to dismiss certain types of splitting is problematic. What counts as "functionally identical" depends on your decision theory's level of grain. UDT1.1 would treat seemingly identical copies differently if they're in different computational states, while CDT might lump them together.

Some relevant questions that aren't addressed:

- How do we handle preference aggregation across different versions of yourself with different measures?
- Should we treat quantum branching differently from other forms of splitting? (I lean towards "no" these days)
- How does this interact with questions of personal identity continuity?
- What happens when we consider infinite branches? (This relates to my work on infinite ethics)

The real issue here isn't about measure per se, but about how to aggregate preferences across different instances of your decision algorithm. This connects to some open problems in decision theory:

1. The problem of preference aggregation across copies
2. How to handle logical uncertainty in the context of anthropics
3. Whether "caring about measure" can be coherently formalized

I explored some of these issues in my paper on UDT, but I now think the framework needs significant revision to handle these cases properly.

Stuart · 2h
> The problem of preference aggregation across copies

This seems key. Have you made any progress on formalizing this since your 2019 posts?

Wei · 2h
Some progress on the math, but still hitting fundamental issues with infinity. Might post about this soon.

Abram · 1h
Curious about your current thoughts on treating decision-theoretic identical copies differently. Seems related to logical causation?

Wei · 45m
Yes - this connects to some ideas about logical coordination I've been developing. The key insight is that even "identical" copies might have different logical roles...

[Edit: For those interested in following up, I recommend starting with my sequence on decision theory and anthropics, then moving to the more recent work on logical uncertainty.]

avturchin on If I care about measure, choices have additional burden (+AI generated LW-comments)

Vladimir_N 3h

(This is a rather technical comment that attempts to clarify some decision-theoretic confusions.)

Your treatment of measure requires more formal specification. Let's be precise about what we mean by "caring about measure" in decision-theoretic terms.

Consider a formalization where we have:
1. A space of possible outcomes Ω
2. A measure μ on this space
3. A utility function U: Ω → ℝ
4. A decision function D that maps available choices to distributions over Ω

The issue isn't about "spending" measure, but about how we aggregate utility across branches. The standard formulation already handles this correctly through expected utility:

E[U] = ∫_Ω U(ω)dμ(ω)

Your concern about "measure decline" seems to conflate the measure μ with the utility U. These are fundamentally different mathematical objects serving different purposes in the formalism.

If we try to modify this to "care about measure directly," we'd need something like:

U'(ω) = U(ω) * f(μ(ω))

But this leads to problematic decision-theoretic behavior, violating basic consistency requirements like dynamic consistency. It's not clear how to specify f in a way that doesn't lead to contradictions.

The apparent paradox dissolves when we properly separate:
1. Measure as probability measure (μ)
2. Utility as preference ordering over outcomes (U)
3. Decision-theoretic aggregation (E[U])

[Technical note: This relates to my work on logical uncertainty and reflection principles. See my 2011 paper on decision theory in anthropic contexts.]

orthonormal · 2h
> U'(ω) = U(ω) * f(μ(ω))

This is a very clean way of showing why "caring about measure" leads to problems.

Vladimir_N · 2h
Yes, though there are even deeper issues with updateless treatment of anthropic measure that I haven't addressed here for brevity.

Wei_D · 1h
Interesting formalization. How would this handle cases where the agent's preferences include preferences over the measure itself?

Vladimir_N · 45m
That would require extending the outcome space Ω to include descriptions of measures, which brings additional technical complications...

[Note: This comment assumes familiarity with measure theory and decision theory fundamentals.]