LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

AI #40: A Vision from Vitalik
Zvi · 2023-11-30T17:30:08.350Z · comments (12)

AI #37: Moving Too Fast
Zvi · 2023-11-09T17:50:04.324Z · comments (5)

[link] Breaking Circuit Breakers
mikes · 2024-07-14T18:57:20.251Z · comments (13)

AMA: Earning to Give
jefftk (jkaufman) · 2023-11-07T16:20:10.972Z · comments (8)

A D&D.Sci Dodecalogue
abstractapplic · 2024-04-12T01:10:01.625Z · comments (0)

Reflections on my first year of AI safety research
Jay Bailey · 2024-01-08T07:49:08.147Z · comments (3)

The case for stopping AI safety research
catubc (cat-1) · 2024-05-23T15:55:18.713Z · comments (38)

Can we build a better Public Doublecrux?
Raemon · 2024-05-11T19:21:53.326Z · comments (6)

[link] OpenAI Staff (including Sutskever) Threaten to Quit Unless Board Resigns
Seth Herd · 2023-11-20T14:20:33.539Z · comments (28)

BatchTopK: A Simple Improvement for TopK-SAEs
Bart Bussmann (Stuckwork) · 2024-07-20T02:20:51.848Z · comments (0)

Pseudonymity and Accusations
jefftk (jkaufman) · 2023-12-21T19:20:19.944Z · comments (20)

AI #43: Functional Discoveries
Zvi · 2023-12-21T15:50:04.442Z · comments (26)

AI #86: Just Think of the Potential
Zvi · 2024-10-17T15:10:06.552Z · comments (6)

AI #45: To Be Determined
Zvi · 2024-01-04T15:00:05.936Z · comments (4)

The Geometry of Feelings and Nonsense in Large Language Models
7vik (satvik-golechha) · 2024-09-27T17:49:27.420Z · comments (10)

Anthropical Paradoxes are Paradoxes of Probability Theory
Ape in the coat · 2023-12-06T08:16:26.846Z · comments (18)

Schelling points in the AGI policy space
mesaoptimizer · 2024-06-26T13:19:25.186Z · comments (2)

Announcing the Double Crux Bot
sanyer (santeri-koivula) · 2024-01-09T18:54:15.361Z · comments (8)

Was Releasing Claude-3 Net-Negative?
Logan Riggs (elriggs) · 2024-03-27T17:41:56.245Z · comments (5)

Aligned AI is dual use technology
lc · 2024-01-27T06:50:10.435Z · comments (31)

Gradient Descent on the Human Brain
Jozdien · 2024-04-01T22:39:24.862Z · comments (5)

The Stochastic Parrot Hypothesis is debatable for the last generation of LLMs
Quentin FEUILLADE--MONTIXI (quentin-feuillade-montixi) · 2023-11-07T16:12:20.031Z · comments (20)

[link] Bed Time Quests & Dinner Games for 3-5 year olds
Gunnar_Zarncke · 2024-06-22T07:53:38.989Z · comments (0)

How to Give in to Threats (without incentivizing them)
Mikhail Samin (mikhail-samin) · 2024-09-12T15:55:50.384Z · comments (25)

Will 2024 be very hot? Should we be worried?
A.H. (AlfredHarwood) · 2023-12-29T11:22:50.200Z · comments (12)

Spatial attention as a “tell” for empathetic simulation?
Steven Byrnes (steve2152) · 2024-04-26T15:10:58.040Z · comments (12)

Reformative Hypocrisy, and Paying Close Enough Attention to Selectively Reward It.
Andrew_Critch · 2024-09-11T04:41:24.872Z · comments (7)

The Assumed Intent Bias
silentbob · 2023-11-05T16:28:03.282Z · comments (13)

Cooperating with aliens and AGIs: An ECL explainer
Chi Nguyen · 2024-02-24T22:58:47.345Z · comments (8)

On OpenAI’s Preparedness Framework
Zvi · 2023-12-21T14:00:05.144Z · comments (4)

Two LessWrong speed friending experiments
mikko (morrel) · 2024-06-15T10:52:26.081Z · comments (3)

[link] Anthropic's updated Responsible Scaling Policy
Zac Hatfield-Dodds (zac-hatfield-dodds) · 2024-10-15T16:46:48.727Z · comments (3)

[link] Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems
Gunnar_Zarncke · 2024-05-16T13:09:39.265Z · comments (20)

The Shutdown Problem: Incomplete Preferences as a Solution
EJT (ElliottThornley) · 2024-02-23T16:01:16.378Z · comments (22)

Rewilding the Gut VS the Autoimmune Epidemic
GGD · 2024-08-16T18:00:46.239Z · comments (0)

Parental Writing Selection Bias
jefftk (jkaufman) · 2024-10-13T14:00:03.225Z · comments (3)

Provably Safe AI: Worldview and Projects
bgold · 2024-08-09T23:21:02.763Z · comments (43)

Llama Llama-3-405B?
Zvi · 2024-07-24T19:40:07.565Z · comments (9)

Polysemantic Attention Head in a 4-Layer Transformer
Jett Janiak (jett) · 2023-11-09T16:16:35.132Z · comments (0)

D&D.Sci Alchemy: Archmage Anachronos and the Supply Chain Issues Evaluation & Ruleset
aphyer · 2024-06-17T21:29:08.778Z · comments (11)

Does literacy remove your ability to be a bard as good as Homer?
Adrià Garriga-alonso (rhaps0dy) · 2024-01-18T03:43:14.994Z · comments (19)

Model evals for dangerous capabilities
Zach Stein-Perlman · 2024-09-23T11:00:00.866Z · comments (9)

[link] The Good Balsamic Vinegar
jenn (pixx) · 2024-01-26T19:30:57.435Z · comments (4)

[link] how birds sense magnetic fields
bhauth · 2024-06-27T18:59:35.075Z · comments (4)

On Complexity Science
Garrett Baker (D0TheMath) · 2024-04-05T02:24:32.039Z · comments (19)

Paper in Science: Managing extreme AI risks amid rapid progress
JanB (JanBrauner) · 2024-05-23T08:40:40.678Z · comments (2)

On Overhangs and Technological Change
Roko · 2023-11-05T22:58:51.306Z · comments (19)

AI #82: The Governor Ponders
Zvi · 2024-09-19T13:30:04.863Z · comments (8)

Transfer learning and generalization-qua-capability in Babbage and Davinci (or, why division is better than Spanish)
RP (Complex Bubble Tea) · 2024-02-09T07:00:45.825Z · comments (6)

Vipassana Meditation and Active Inference: A Framework for Understanding Suffering and its Cessation
Benjamin Sturgeon (benjamin-sturgeon) · 2024-03-21T12:32:22.475Z · comments (8)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

philgoetz on Eliezer Yudkowsky Is Frequently, Confidently, Egregiously Wrong

But there is no possible world with a perfect predictor. The Newcomb paradox hinges on a hypothetical situation with zero probability, so all reason based on it is inadmissible.

yams on yams's Shortform

Many MATS scholars go to Anthropic (source: I work there).

Redwood I’m really not sure, but that could be right.

Sam now works at Anthropic.

Palisade: I’ve done some work for them, I love them, I don’t know that their projects so far inhibit Anthropic (BadLlama, which I’m decently confident was part of the cause for funding them, was pretty squarely targeted at Meta, and is their most impactful work to date by several OOM). In fact, the softer versions of Palisade’s proposal (highlighting misuse risk, their core mission), likely empower Anthropic as seemingly the most transparent lab re misuse risks.

I take the thrust of your comment to be “OP funds safety, do your research”. I work in safety; I know they fund safety.

I also know most safety projects differentially benefit Anthropic (this fact is independent of whether you think differentially benefiting Anthropic is good or bad).

If you can make a stronger case for any of the other of the dozens of orgs on your list than exists for the few above, I’d love to hear it. I’ve thought about most of them and don’t see it, hence why I asked the question.

Further: the goalpost is not ‘net positive with respect to TAI x-risk.’ It is ‘not plausibly a component of a meta-strategy targeting the development of TAI at Anthropic before other labs.’

Edit: use of the soldier mindset flag above is pretty uncharitable here; I am asking for counter-examples to a hypothesis I’m entertaining. This is the actual opposite of soldier mindset.

matthew-barnett on The Hidden Complexity of Wishes

While the term "outer alignment" wasn’t coined until later to describe the exact issue that I'm talking about, I was using that term purely as a descriptive label for the problem this post clearly highlights, rather than implying that you were using or aware of the term in 2007.

Because I was simply using "outer alignment" in this descriptive sense, I reject the notion that my comment was anachronistic. I used that term as shorthand for the thing I was talking about, which is clearly and obviously portrayed by your post, that's all.

To be very clear: the exact problem I am talking about is the inherent challenge of precisely defining what you want or intend, especially (though not exclusively) in the context of designing a utility function. The difficulty arises because, when the desired outcome is complex, it becomes nearly impossible to perfectly delineate between all potential 'good' scenarios and all possible 'bad' scenarios. This challenge has been a recurring theme in discussions of alignment, as it's considered hard to capture every nuance of what you want in your specification without missing an edge case.

It is frankly frustrating to me that, from my perspective, you seem to have reliably missed the point of what I am trying to convey here.

I only brought up Christiano-style proposals because I thought you were changing the topic to a broader discussion, specifically to ask me what methodologies I had in mind when I made particular points. If you had not asked me "So would you care to spell out what clever methodology you think invalidates what you take to be the larger point of this post -- though of course it has no bearing on the actual point that this post makes?" then I would not have mentioned those things. In any case, none of the things I said about Christiano-style proposals were intended to critique this post's narrow point. I was responding to that particular part of your comment instead.

As far as the actual content of this post, I do not dispute its exact thesis. The post seems to be a parable, not a detailed argument with a clear conclusion. The parable seems interesting to me. It also doesn't seem wrong, in any strict sense. However, I do think that some of the broader conclusions that many people have drawn from the parable seem false, in context. I was responding to the specific way that this post had been applied and interpreted in broader arguments about AI alignment.

My central thesis in regards to this post is simply: the post clearly portrays a specific problem that was later called the "outer alignment" problem by other people. This post portrays this problem as being difficult in a particular way. And I think this portrayal is misleading, even if the literal parable holds up in pure isolation.

lsusr on What's a good book for a technically-minded 11-year old?

Besides abstractapplic's excellent answer [LW(p) · GW(p)],

A Brief History of Time and The Universe in a Nutshell by Stephen Hawking
Ender's Game by Orson Scott Card
Foundation by Isaac Asimov
The Martian by Andy Weir
Paleontology: A Brief History of Life by Ian Tattersall
Richard Feynmann's books

radford-neal-1 on Change My Mind: Thirders in "Sleeping Beauty" are Just Doing Epistemology Wrong

Sure. By tweaking your "weights" or other fudge factors, you can get the right answer using any probability you please. But you're not using a generally-applicable method, that actually tells you what the right answer is. So it's a pointless exercise that sheds no light on how to correctly use probability in real problems.

To see that the probability of Heads is not "either 1/2 or 1/3, depending on what reference class you choose, or how you happen to feel about the problem today", but is instead definitely, no doubt about it, 1/3, consider the following possibility:

Upon wakening, Beauty see that there is a plate of fresh muffins beside her bed. She recognizes them as coming from a nearby cafe. She knows that they are quite delicious. She also knows that, unfortunately, the person who makes them on Mondays puts in an ingredient that she is allergic to, which causes a bad tummy ache. Muffins made on Tuesday taste the same, but don't cause a tummy ache. She needs to decide whether to eat a muffin, weighing the pleasure of their taste against the possibility of a subsequent tummy ache.

If Beauty thinks the probability of Heads is 1/2, she presumably thinks the probability that it is Monday is (1/2)+(1/2)*(1/2)=3/4, whereas if she thinks the probability of Heads is 1/3, she will think the probability that it is Monday is (1/3)+(1/2)*(2/3)=2/3. Since 3/4 is not equal to 2/3, she may come to a different decision about whether to eat a muffin if she thinks the probability of Heads is 1/2 than if she thinks it is 1/3 (depending on how she weighs the pleasure versus the pain). Her decision should not depend on some arbitrary "reference class", or on what bets she happens to be deciding whether to make at the same time. She needs a real probability. And on various grounds, that probability is 1/3.

hector-perez-arenas on Tackling Moloch: How YouCongress Offers a Novel Coordination Mechanism

Users can register now with email/password.

cubefox on Concrete benefits of making predictions

Assigning a low probability that I will do a task in time is a self-fulfilling prophecy. Because the expected utility (probability times utility) is low, the motivation to do the task decreases. Ideally I would never assign probabilities to acts when choosing what to do, and only compare their utilities.

matthew-barnett on The Hidden Complexity of Wishes

Matthew is not disputing this point, as far as I can tell.
Instead, he is trying to critique some version of^[1] the "larger argument" (mentioned in the May 2024 update to this post) in which this point plays a role.

I'll confirm that I'm not saying this post's exact thesis is false. This post seems to be largely a parable about a fictional device, rather than an explicit argument with premises and clear conclusions. I'm not saying the parable is wrong. Parables are rarely "wrong" in a strict sense, and I am not disputing this parable's conclusion.

However, I am saying: this parable presumably played some role in the "larger" argument that MIRI has made made in the past. What role did it play? Well, I think a good guess is that it portrayed the difficulty of precisely specifying what you want or intend, for example when explicitly designing a utility function. This problem was often alleged to be difficult because, when you want something complex, it's difficult to perfectly delineate potential "good" scenarios and distinguish them from all potential "bad" scenarios.

While the term "outer alignment" was not invented to describe this exact problem until much later, I was using that term purely as descriptive terminology for the problem this post clearly describes, rather than claiming that Eliezer in 2007 was deliberately describing something that he called "outer alignment" at the time. Because my usage of "outer alignment" was merely descriptive in this sense, I reject the idea that my comment was anachronistic.

And again: I am not claiming that this post is inaccurate in isolation. In both my above comment, and in my 2023 post, I merely cited this post as portraying an aspect of the problem that I was talking about, rather than saying something like "this particular post's conclusion is wrong". I think the fact that the post doesn't really have a clear thesis in the first place means that it can't be wrong in a strong sense at all. However, the post was definitely interpreted as explaining some part of why alignment is hard — for a long time by many people — and I was critiquing the particular application of the post to this argument, rather than the post itself in isolation.

cubefox on Darklight's Shortform

What's correlation space, as opposed to probability space?

irenictruth on Why I’m not a Bayesian

I shy away from fuzzy logic because I used it as a formalism to justify my religious beliefs. (In particular, "Possibilistic Logic" allowed me to appear honest to myself—and I'm not sure how much of it was self-deception and how much was just being wrong.)

The critical moment in my deconversion came when I realized that if I was looking for truth, I should reason according to the probabilities of the statements I was evaluating. Thirty minutes later, I had gone from a convinced Christian speaking to others, leading in my local church, and basing my life and career on my beliefs to an atheist who was primarily uncertain about atheism because of self-distrust.

Grounding my beliefs in falsifiable statements and probabilistic-ish models has been a beneficial discipline that forces me to recognize my limits and helps predict the outcomes of my actions. I don't know if I could do the same with fuzzy logic and "reasoning by model."