LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Wrong answer bias
lukehmiles (lcmgcd) · 2024-02-01T20:05:38.573Z · comments (24)

[link] On scalable oversight with weak LLMs judging strong LLMs
zac_kenton (zkenton) · 2024-07-08T08:59:58.523Z · comments (18)

Low Probability Estimation in Language Models
Gabriel Wu (gabriel-wu) · 2024-10-18T15:50:05.947Z · comments (0)

[link] The Evals Gap
Marius Hobbhahn (marius-hobbhahn) · 2024-11-11T16:42:46.287Z · comments (7)

[link] Evaluating Stability of Unreflective Alignment
james.lucassen · 2024-02-01T22:15:40.902Z · comments (10)

[link] in defense of Linus Pauling
bhauth · 2024-06-03T21:27:43.962Z · comments (8)

An issue with training schemers with supervised fine-tuning
Fabien Roger (Fabien) · 2024-06-27T15:37:56.020Z · comments (12)

Notes on control evaluations for safety cases
ryan_greenblatt · 2024-02-28T16:15:17.799Z · comments (0)

[LDSL#0] Some epistemological conundrums
tailcalled · 2024-08-07T19:52:55.688Z · comments (10)

Interoperable High Level Structures: Early Thoughts on Adjectives
johnswentworth · 2024-08-22T21:12:38.223Z · comments (1)

The Broken Screwdriver and other parables
bhauth · 2024-03-04T03:34:38.807Z · comments (1)

AI #67: Brief Strange Trip
Zvi · 2024-06-06T18:50:03.514Z · comments (6)

So you want to work on technical AI safety
gw · 2024-06-24T14:29:57.481Z · comments (3)

Bounty: Diverse hard tasks for LLM agents
Beth Barnes (beth-barnes) · 2023-12-17T01:04:05.460Z · comments (31)

[link] Anthropic announces interpretability advances. How much does this advance alignment?
Seth Herd · 2024-05-21T22:30:52.638Z · comments (4)

[link] DM Parenting
Shoshannah Tekofsky (DarkSym) · 2024-07-16T08:50:08.144Z · comments (4)

Why the Best Writers Endure Isolation
Declan Molony (declan-molony) · 2024-07-16T05:58:25.032Z · comments (6)

Interested in Cognitive Bootcamp?
Raemon · 2024-09-19T22:12:13.348Z · comments (0)

The Mom Test: Summary and Thoughts
Adam Zerner (adamzerner) · 2024-04-18T03:34:21.020Z · comments (3)

Evaluating the truth of statements in a world of ambiguous language.
Hastings (hastings-greer) · 2024-10-07T18:08:09.920Z · comments (19)

“Why can’t you just turn it off?”
Roko · 2023-11-19T14:46:18.427Z · comments (25)

[link] Active Recall and Spaced Repetition are Different Things
Saul Munn (saul-munn) · 2024-11-08T20:14:56.092Z · comments (2)

SRE's review of Democracy
Martin Sustrik (sustrik) · 2024-08-03T07:20:01.483Z · comments (2)

Highlights from Lex Fridman’s interview of Yann LeCun
Joel Burget (joel-burget) · 2024-03-13T20:58:13.052Z · comments (15)

[link] Book review: Xenosystems
jessicata (jessica.liu.taylor) · 2024-09-16T20:17:56.670Z · comments (18)

AISC 2024 - Project Summaries
NickyP (Nicky) · 2023-11-27T22:32:23.555Z · comments (3)

[question] If I wanted to spend WAY more on AI, what would I spend it on?
Logan Zoellner (logan-zoellner) · 2024-09-15T21:24:46.742Z · answers+comments (16)

AI and the Technological Richter Scale
Zvi · 2024-09-04T14:00:08.625Z · comments (8)

An alternative approach to superbabies
Towards_Keeperhood (Simon Skade) · 2024-11-05T22:56:15.740Z · comments (19)

On the lethality of biased human reward ratings
Eli Tyre (elityre) · 2023-11-17T18:59:02.303Z · comments (10)

[link] Designing for a single purpose
Itay Dreyfus (itay-dreyfus) · 2024-05-07T14:11:22.242Z · comments (12)

Exercise: Planmaking, Surprise Anticipation, and "Baba is You"
Raemon · 2024-02-24T20:33:49.574Z · comments (19)

D&D.Sci(-fi): Colonizing the SuperHyperSphere
abstractapplic · 2024-01-12T23:36:54.248Z · comments (23)

[link] Every Mention of EA in "Going Infinite"
KirstenH · 2023-10-07T14:42:32.217Z · comments (0)

What is the next level of rationality?
lsusr · 2023-12-12T08:14:14.846Z · comments (24)

Safety First: safety before full alignment. The deontic sufficiency hypothesis.
Chipmonk · 2024-01-03T17:55:19.825Z · comments (3)

Experiments as a Third Alternative
Adam Zerner (adamzerner) · 2023-10-29T00:39:31.399Z · comments (21)

The Handbook of Rationality (2021, MIT press) is now open access
romeostevensit · 2023-10-10T00:30:05.589Z · comments (4)

Competitive, Cooperative, and Cohabitive
Screwtape · 2023-09-28T23:25:52.723Z · comments (12)

[link] Contra Acemoglu on AI
Maxwell Tabarrok (maxwell-tabarrok) · 2024-06-28T13:13:15.796Z · comments (0)

[link] Spaced repetition for teaching two-year olds how to read (Interview)
Chipmonk · 2023-11-26T16:52:58.412Z · comments (9)

Making Bad Decisions On Purpose
Screwtape · 2023-11-09T03:36:59.611Z · comments (8)

D&D.Sci: The Mad Tyrant's Pet Turtles [Evaluation and Ruleset]
abstractapplic · 2024-04-09T14:01:34.426Z · comments (6)

[link] Urging an International AI Treaty: An Open Letter
Olli Järviniemi (jarviniemi) · 2023-10-31T11:26:25.864Z · comments (2)

Translations Should Invert
abramdemski · 2023-10-05T17:44:23.262Z · comments (19)

Philosophers wrestling with evil, as a social media feed
David Gross (David_Gross) · 2024-06-03T22:25:22.507Z · comments (2)

Mechanistic Interpretability Workshop Happening at ICML 2024!
Neel Nanda (neel-nanda-1) · 2024-05-03T01:18:26.936Z · comments (6)

Misnaming and Other Issues with OpenAI's “Human Level” Superintelligence Hierarchy
Davidmanheim · 2024-07-15T05:50:17.770Z · comments (2)

[link] JumpReLU SAEs + Early Access to Gemma 2 SAEs
Senthooran Rajamanoharan (SenR) · 2024-07-19T16:10:54.664Z · comments (10)

[link] Web-surfing tips for strange times
eukaryote · 2024-05-31T07:10:25.805Z · comments (19)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

viliam on The Early Christian Strategy

I suspect that to solve this puzzle, we would need more precise data. For example, the thing about martyrdom. Naively, it makes it sound like the early Christians were quite suicidal, which is amazing in itself, and also makes you wonder how they survived as a group.

But let's try to use numbers. What fraction of early Christians was actually willing to die for their faith? I have no idea, so just for the sake of a thought experiment, I propose a number... 1%. (No idea whether it is correct.)

Suddenly the fact that a religion which promises you an awesome afterlife can make 1% of its members die voluntarily, does not feel so surprising. There are all kinds of crazy and otherwise vulnerable people out there. With enough peer pressure, you could probably start a cult where 1% of your members commit some kind of suicide even today. Only, the moment you would actually do it, the media would describe you as a crazy murderous cult, and you would probably end up in jail. It would be difficult to keep recruiting members. I suppose the Rome could have been different, for example didn't care about suicides of slaves so much. Also, "suicide by a (Roman) cop" is a non-central form of suicide; it does not make your group look like villains. And if you are actively gaining new members, losing 1% does not make much of a difference.

Also, I wonder how hard Romans actually tried to eliminate Christians. I imagine that if someone tried the same way Hitler tried to get rid of Jews, it would be game over for Christianity. But if the level of persecution is more like "once in a while, we will take a high-status member, try to make them deny Jesus, and kill them if they refuse", that won't stop the group than meanwhile recruits hundred new members. Also, this was ancient Rome, life was probably cheap, you could have get killed for many different things, plus die of many different diseases, perhaps the chance of being killed for your religion didn't increase the overall risk significantly if you were an average member.

avturchin on If I care about measure, choices have additional burden (+AI generated LW-comments)

But if I use quantum coin to make a life choice, there will be splitting, right?

cubefox on Leon Lang's Shortform

It's not that "they" should be more precise, but that "we" would like to have more precise information.

We know pretty conclusively now from The Information and Bloomberg that for OpenAI, Google and Anthropic, new frontier base LLMs have yielded disappointing performance gains. The question is which of your possibilities did cause this.

They do mention that the availability of high quality training data (text) is an issue, which suggests it's probably not your first bullet point.

tag on If I care about measure, choices have additional burden (+AI generated LW-comments)

Every quantum event splits the multiverse, so my measure should decline by 20 orders of magnitude every second.

there isn't the slightest evidence that irrevocable splitting, splitting into decoherent branches occurs on that scale, a s plenty of evidence -- eg. The existence of quantum computing -- that it doesnt.

adam_scholl on Untrusted smart models and trusted dumb models

I'm curious if "trusted" in this sense basically just means "aligned"—or like, the superset of that which also includes "unaligned yet too dumb to cause harm" and "unaligned yet prevented from causing harm"—or whether you mean something more specific? E.g., are you imagining that some powerful unconstrained systems are trusted yet unaligned, or vice versa?

jeremy-gillen on Thoughts after the Wolfram and Yudkowsky discussion

I get the feeling that I’m still missing the point somehow and that Yudkowsky would say we still have a big chance of doom if our algorithms were created by hand with programmers whose algorithms always did exactly what they intended even when combined with their other algorithms.

I would bet against Eliezer being pessimistic about this, if we are assuming the algorithms are deeply-understood enough that we are confident that we can iterate on building AGI. I think there's maybe a problem with the way Eliezer communicates that gives people the impression that he's a rock with "DOOM" written on it.

I think the pessimism comes from there being several currently-unsolved problems that get in the way of "deeply-understood enough". In principle it's possible to understand these problems and hand-build a safe and stable AGI, it just looks a lot easier to hand-build an AGI without understanding them all, and even easier than that to train an AGI without even thinking about them.

I call most of these "instability" problems. Where the AI might for example learn more, or think more, or self-modify, and each of these can shift the context in a way that causes an imperfectly designed AI to pursue unintended goals.

Here are some descriptions of problems in that cluster: optimization daemons, ontology shifts, translating between our ontology and the AI's internal ontology in a way that generalizes, pascal's mugging [LW · GW], reflectively stable preferences & decision algorithms, reflectively stable corrigibility, and correctly estimating future competence under different circumstances.

Some may be resolved by default along the way to understanding how to build AGI by hand, but it isn't clear. Some are kinda solved already in some contexts.

avturchin on If I care about measure, choices have additional burden (+AI generated LW-comments)

Wei· 3h

This post touches on several issues I've been thinking about since my early work on anthropic decision theory and UDT. Let me break this down:

1. The measure-decline problem is actually more general than just quantum mechanics. It appears in any situation where your decision algorithm gets instantiated multiple times, including classical copying, simulation, or indexical uncertainty. See my old posts on anthropic probabilities and probability-as-preference.

2. The "functional identity" argument being used here to dismiss certain types of splitting is problematic. What counts as "functionally identical" depends on your decision theory's level of grain. UDT1.1 would treat seemingly identical copies differently if they're in different computational states, while CDT might lump them together.

Some relevant questions that aren't addressed:

- How do we handle preference aggregation across different versions of yourself with different measures?
- Should we treat quantum branching differently from other forms of splitting? (I lean towards "no" these days)
- How does this interact with questions of personal identity continuity?
- What happens when we consider infinite branches? (This relates to my work on infinite ethics)

The real issue here isn't about measure per se, but about how to aggregate preferences across different instances of your decision algorithm. This connects to some open problems in decision theory:

1. The problem of preference aggregation across copies
2. How to handle logical uncertainty in the context of anthropics
3. Whether "caring about measure" can be coherently formalized

I explored some of these issues in my paper on UDT, but I now think the framework needs significant revision to handle these cases properly.

Stuart · 2h
> The problem of preference aggregation across copies

This seems key. Have you made any progress on formalizing this since your 2019 posts?

Wei · 2h
Some progress on the math, but still hitting fundamental issues with infinity. Might post about this soon.

Abram · 1h
Curious about your current thoughts on treating decision-theoretic identical copies differently. Seems related to logical causation?

Wei · 45m
Yes - this connects to some ideas about logical coordination I've been developing. The key insight is that even "identical" copies might have different logical roles...

[Edit: For those interested in following up, I recommend starting with my sequence on decision theory and anthropics, then moving to the more recent work on logical uncertainty.]

avturchin on If I care about measure, choices have additional burden (+AI generated LW-comments)

Vladimir_N 3h

(This is a rather technical comment that attempts to clarify some decision-theoretic confusions.)

Your treatment of measure requires more formal specification. Let's be precise about what we mean by "caring about measure" in decision-theoretic terms.

Consider a formalization where we have:
1. A space of possible outcomes Ω
2. A measure μ on this space
3. A utility function U: Ω → ℝ
4. A decision function D that maps available choices to distributions over Ω

The issue isn't about "spending" measure, but about how we aggregate utility across branches. The standard formulation already handles this correctly through expected utility:

E[U] = ∫_Ω U(ω)dμ(ω)

Your concern about "measure decline" seems to conflate the measure μ with the utility U. These are fundamentally different mathematical objects serving different purposes in the formalism.

If we try to modify this to "care about measure directly," we'd need something like:

U'(ω) = U(ω) * f(μ(ω))

But this leads to problematic decision-theoretic behavior, violating basic consistency requirements like dynamic consistency. It's not clear how to specify f in a way that doesn't lead to contradictions.

The apparent paradox dissolves when we properly separate:
1. Measure as probability measure (μ)
2. Utility as preference ordering over outcomes (U)
3. Decision-theoretic aggregation (E[U])

[Technical note: This relates to my work on logical uncertainty and reflection principles. See my 2011 paper on decision theory in anthropic contexts.]

orthonormal · 2h
> U'(ω) = U(ω) * f(μ(ω))

This is a very clean way of showing why "caring about measure" leads to problems.

Vladimir_N · 2h
Yes, though there are even deeper issues with updateless treatment of anthropic measure that I haven't addressed here for brevity.

Wei_D · 1h
Interesting formalization. How would this handle cases where the agent's preferences include preferences over the measure itself?

Vladimir_N · 45m
That would require extending the outcome space Ω to include descriptions of measures, which brings additional technical complications...

[Note: This comment assumes familiarity with measure theory and decision theory fundamentals.]

avturchin on If I care about measure, choices have additional burden (+AI generated LW-comments)

Eli · 2h

*sigh*

I feel like I need to step in here because people are once again getting confused about measure, identity, and decision theory in ways I thought we cleared up circa 2008-2009.

First: The whole "measure declining by choice" framing is confused. You're not "spending" measure like some kind of quantum currency. The measure *describes* the Born probabilities; it's not something you optimize for directly any more than you should optimize for having higher probabilities in your belief distribution.

Second: The apparent "splitting" of worlds isn't fundamentally different between quantum events, daily choices, and life-changing decisions. It's all part of the same unified wavefunction evolving according to the same physics. The distinction being drawn here is anthropocentric and not particularly meaningful from the perspective of quantum mechanics.

What *is* relevant is how you handle subjective anticipation of future experiences. But note that "caring about measure" in the way described would lead to obviously wrong decisions - like refusing to make any choices at all to "preserve measure," which would itself be a choice (!).

If you're actually trying to maximize expected utility across the multiverse (which is what you should be doing), then the Born probabilities handle everything correctly without need for additional complexity. The framework I laid out in Quantum Ethics handles this cleanly.

And please, can we stop with the quantum suicide thought experiments? They're actively harmful to clear thinking about decision theory and anthropics. I literally wrote "Don't Un-think the Quantum" to address exactly these kinds of confusions.

(Though I suppose I should be somewhat grateful that at least nobody in this thread has brought up p-zombies or consciousness crystals yet...)

[Edit: To be clear, this isn't meant to discourage exploration of these ideas. But we should build on existing work rather than repeatedly discovering the same confusions.]

RationalSkeptic · 1h
> like refusing to make any choices at all to "preserve measure,"

This made me laugh out loud. Talk about Pascal's Mugging via quantum mechanics...

Eli · 45m
Indeed. Though I'd note that proper handling of Pascal's Mugging itself requires getting anthropics right first...

viliam on Heresies in the Shadow of the Sequences

Stop using LLM's to write. It burns the commons by filling allowing you to share takes on topics you don't care enough to write about yourself, while also introducing insidious (and perhaps eventually malign) errors.

Yeah, someone just started doing this in ACX comments, and it's annoying.

When I read texts written by humans, there is some relation between the human and the text. If I trust the human, I will trust the text. If the text is wrong, I will stop trusting the human. Shortly, I hold humans accountable for their texts.

But if you just copy-paste whatever the LLM has vomited out, I don't know... did you at least do some sanity check, in other words, are you staking your personal reputation on these words? Or if I spend my time finding an error, will you just shrug and say "not my fault, we all know that LLMs hallucinate sometimes"? In other words, will feedback improve your writing in the future? If not... then the only reason to give feedback is to warn other humans who happen to read that text.

The same thing applies when someone uses an LLM to generate code. Yes, it is often a way more efficient way to write the code. But did you review the code? Or are you just copying it blindly? We already had a smaller version of this problem with people blindly copying code from Stack Exchange. LLM is like Stack Exchange on steroids, both the good and the bad parts.

there do exist fairly coherent moral projects such as religions

I am not sure how coherent they are. For example, I was reading on ACX about Christianity, and... it has the message of loving your neighbor and turning the other cheek... but also the recommendation not to cast pearls before the swine... and I am not sure whether it makes it clear when exactly are you supposed to treat your neighbors with love or as swines.

It also doesn't provide an answer to whom you should give your coat if two people are trying to steal your shirt, etc.

Plus, there were historical situations when Christians didn't turn the other cheek (Crusades, Inquisition, etc.), and maybe without those situations Christianity would not exist today.

What I am saying is that there is a human judgment involved (which sometimes results in breaking the rules), and maybe the projects are not going to work without that.