LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Apply to be a Safety Engineer at Lockheed Martin!
yanni kyriacos (yanni) · 2024-03-31T21:02:08.499Z · comments (3)

OpenAI: Helen Toner Speaks
Zvi · 2024-05-30T21:10:02.938Z · comments (8)

There is a globe in your LLM
jacob_drori (jacobcd52) · 2024-10-08T00:43:40.300Z · comments (4)

Scalable oversight as a quantitative rather than qualitative problem
Buck · 2024-07-06T17:42:41.325Z · comments (11)

[link] Anxiety vs. Depression
Sable · 2024-03-17T00:15:08.255Z · comments (35)

Reflections on Less Online
Error · 2024-07-07T03:49:44.534Z · comments (15)

Rejecting Television
Declan Molony (declan-molony) · 2024-04-23T04:59:50.253Z · comments (10)

[link] Linkpost: Rishi Sunak's Speech on AI (26th October)
bideup · 2023-10-27T11:57:46.575Z · comments (8)

[link] Dario Amodei’s prepared remarks from the UK AI Safety Summit, on Anthropic’s Responsible Scaling Policy
Zac Hatfield-Dodds (zac-hatfield-dodds) · 2023-11-01T18:10:31.110Z · comments (1)

GPT-o1
Zvi · 2024-09-16T13:40:06.236Z · comments (34)

[Valence series] 2. Valence & Normativity
Steven Byrnes (steve2152) · 2023-12-07T16:43:49.919Z · comments (5)

[link] Environmentalism in the United States Is Unusually Partisan
Jeffrey Heninger (jeffrey-heninger) · 2024-05-13T21:23:10.755Z · comments (26)

Natural Latents: The Concepts
johnswentworth · 2024-03-20T18:21:19.878Z · comments (18)

Addressing Feature Suppression in SAEs
Benjamin Wright (Benw8888) · 2024-02-16T18:32:51.927Z · comments (3)

[link] [Paper] Stress-testing capability elicitation with password-locked models
Fabien Roger (Fabien) · 2024-06-04T14:52:50.204Z · comments (10)

[link] "AI Safety for Fleshy Humans" an AI Safety explainer by Nicky Case
habryka (habryka4) · 2024-05-03T18:10:12.478Z · comments (10)

[link] Nietzsche's Morality in Plain English
Arjun Panickssery (arjun-panickssery) · 2023-12-04T00:57:42.839Z · comments (13)

[link] Hardshipification
Jonathan Moregård (JonathanMoregard) · 2024-05-28T20:02:29.709Z · comments (17)

A simple case for extreme inner misalignment
Richard_Ngo (ricraz) · 2024-07-13T15:40:37.518Z · comments (41)

[link] A Universal Emergent Decomposition of Retrieval Tasks in Language Models
Alexandre Variengien (alexandre-variengien) · 2023-12-19T11:52:27.354Z · comments (3)

Fluent, Cruxy Predictions
Raemon · 2024-07-10T18:00:06.424Z · comments (11)

[link] [Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Teun van der Weij (teun-van-der-weij) · 2024-06-13T10:04:49.556Z · comments (10)

The case for unlearning that removes information from LLM weights
Fabien Roger (Fabien) · 2024-10-14T14:08:04.775Z · comments (3)

MATS Winter 2023-24 Retrospective
utilistrutil · 2024-05-11T00:09:17.059Z · comments (28)

Some for-profit AI alignment org ideas
Eric Ho (eh42) · 2023-12-14T14:23:20.654Z · comments (19)

Newsom Vetoes SB 1047
Zvi · 2024-10-01T12:20:06.127Z · comments (6)

Sparse Autoencoders Work on Attention Layer Outputs
Connor Kissane (ckkissane) · 2024-01-16T00:26:14.767Z · comments (9)

A Crisper Explanation of Simulacrum Levels
Thane Ruthenis · 2023-12-23T22:13:52.286Z · comments (13)

Actually, Power Plants May Be an AI Training Bottleneck.
Lao Mein (derpherpize) · 2024-06-20T04:41:33.567Z · comments (13)

Retirement Accounts and Short Timelines
jefftk (jkaufman) · 2024-02-19T18:50:05.231Z · comments (35)

[Intuitive self-models] 1. Preliminaries
Steven Byrnes (steve2152) · 2024-09-19T13:45:27.976Z · comments (18)

Untrusted smart models and trusted dumb models
Buck · 2023-11-04T03:06:38.001Z · comments (12)

AI #51: Altman’s Ambition
Zvi · 2024-02-20T19:50:07.439Z · comments (5)

[link] What are you getting paid in?
Austin Chen (austin-chen) · 2024-07-17T19:23:04.219Z · comments (14)

[link] What Depression Is Like
Sable · 2024-08-27T17:43:22.549Z · comments (23)

Release: Optimal Weave (P1): A Prototype Cohabitive Game
mako yass (MakoYass) · 2024-08-17T14:08:18.947Z · comments (21)

Agent Boundaries Aren't Markov Blankets. [Unless they're non-causal; see comments.]
abramdemski · 2023-11-20T18:23:40.443Z · comments (11)

OpenAI o1, Llama 4, and AlphaZero of LLMs
Vladimir_Nesov · 2024-09-14T21:27:41.241Z · comments (24)

[link] Essay competition on the Automation of Wisdom and Philosophy — $25k in prizes
owencb · 2024-04-16T10:10:13.338Z · comments (12)

Coup probes: Catching catastrophes with probes trained off-policy
Fabien Roger (Fabien) · 2023-11-17T17:58:28.687Z · comments (7)

AI #83: The Mask Comes Off
Zvi · 2024-09-26T12:00:08.689Z · comments (19)

Saying the quiet part out loud: trading off x-risk for personal immortality
disturbance · 2023-11-02T17:43:34.155Z · comments (89)

Why you should be using a retinoid
GeneSmith · 2024-08-19T03:07:41.722Z · comments (57)

Live Theory Part 0: Taking Intelligence Seriously
Sahil · 2024-06-26T21:37:10.479Z · comments (3)

Muddling Along Is More Likely Than Dystopia
Jeffrey Heninger (jeffrey-heninger) · 2023-10-20T21:25:15.459Z · comments (10)

Some Vacation Photos
johnswentworth · 2024-01-04T17:15:01.187Z · comments (0)

Refusal mechanisms: initial experiments with Llama-2-7b-chat
Andy Arditi (andy-arditi) · 2023-12-08T17:08:01.250Z · comments (7)

My Criticism of Singular Learning Theory
Joar Skalse (Logical_Lunatic) · 2023-11-19T15:19:16.874Z · comments (56)

AISafety.com – Resources for AI Safety
Søren Elverlin (soren-elverlin-1) · 2024-05-17T15:57:11.712Z · comments (3)

[link] New voluntary commitments (AI Seoul Summit)
Zach Stein-Perlman · 2024-05-21T11:00:41.794Z · comments (17)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

yams on yams's Shortform

Many MATS scholars go to Anthropic (source: I work there).

Redwood I’m really not sure, but that could be right.

Sam now works at Anthropic.

Palisade: I’ve done some work for them, I love them, I don’t know that their projects so far inhibit Anthropic (BadLlama, which I’m decently confident was part of the cause for funding them, was pretty squarely targeted at Meta, and is their most impactful work to date by several OOM). In fact, the softer versions of Palisade’s proposal (highlighting misuse risk, their core mission), likely empower Anthropic as seemingly the most transparent lab re misuse risks.

I take the thrust of your comment to be “OP funds safety, do your research”. I work in safety; I know they fund safety.

I also know most safety projects differentially benefit Anthropic (this fact is independent of whether you think differentially benefiting Anthropic is good or bad).

If you can make a stronger case for any of the other of the dozens of orgs on your list than exists for the few above, I’d love to hear it. I’ve thought about most of them and don’t see it, hence why I asked the question.

Further: the goalpost is not ‘net positive with respect to TAI x-risk.’ It is ‘not plausibly a component of a meta-strategy targeting the development of TAI at Anthropic before other labs.’

Edit: use of the soldier mindset flag above is pretty uncharitable here; I am asking for counter-examples to a hypothesis I’m entertaining. This is the actual opposite of soldier mindset.

matthew-barnett on The Hidden Complexity of Wishes

While the term "outer alignment" wasn’t coined until later to describe the exact issue that I'm talking about, I was using that term purely as a descriptive label for the problem this post clearly highlights, rather than implying that you were using or aware of the term in 2007.

Because I was simply using "outer alignment" in this descriptive sense, I reject the notion that my comment was anachronistic. I used that term as shorthand for the thing I was talking about, which is clearly and obviously portrayed by your post, that's all.

To be very clear: the exact problem I am talking about is the inherent challenge of precisely defining what you want or intend, especially (though not exclusively) in the context of designing a utility function. The difficulty arises because, when the desired outcome is complex, it becomes nearly impossible to perfectly delineate between all potential 'good' scenarios and all possible 'bad' scenarios. This challenge has been a recurring theme in discussions of alignment, as it's considered hard to capture every nuance of what you want in your specification without missing an edge case.

It is frankly frustrating to me that, from my perspective, you seem to have reliably missed the point of what I am trying to convey here.

I only brought up Christiano-style proposals because I thought you were changing the topic to a broader discussion, specifically to ask me what methodologies I had in mind when I made particular points. If you had not asked me "So would you care to spell out what clever methodology you think invalidates what you take to be the larger point of this post -- though of course it has no bearing on the actual point that this post makes?" then I would not have mentioned those things. In any case, none of the things I said about Christiano-style proposals were intended to critique this post's narrow point. I was responding to that particular part of your comment instead.

As far as the actual content of this post, I do not dispute its exact thesis. The post seems to be a parable, not a detailed argument with a clear conclusion. The parable seems interesting to me. It also doesn't seem wrong, in any strict sense. However, I do think that some of the broader conclusions that many people have drawn from the parable seem false, in context. I was responding to the specific way that this post had been applied and interpreted in broader arguments about AI alignment.

My central thesis in regards to this post is simply: the post clearly portrays a specific problem that was later called the "outer alignment" problem by other people. This post portrays this problem as being difficult in a particular way. And I think this portrayal is misleading, even if the literal parable holds up in pure isolation.

lsusr on What's a good book for a technically-minded 11-year old?

Besides abstractapplic's excellent answer [LW(p) · GW(p)],

A Brief History of Time and The Universe in a Nutshell by Stephen Hawking
Ender's Game by Orson Scott Card
Foundation by Isaac Asimov
The Martian by Andy Weir
Paleontology: A Brief History of Life by Ian Tattersall
Richard Feynmann's books

radford-neal-1 on Change My Mind: Thirders in "Sleeping Beauty" are Just Doing Epistemology Wrong

Sure. By tweaking your "weights" or other fudge factors, you can get the right answer using any probability you please. But you're not using a generally-applicable method, that actually tells you what the right answer is. So it's a pointless exercise that sheds no light on how to correctly use probability in real problems.

To see that the probability of Heads is not "either 1/2 or 1/3, depending on what reference class you choose, or how you happen to feel about the problem today", but is instead definitely, no doubt about it, 1/3, consider the following possibility:

Upon wakening, Beauty see that there is a plate of fresh muffins beside her bed. She recognizes them as coming from a nearby cafe. She knows that they are quite delicious. She also knows that, unfortunately, the person who makes them on Mondays puts in an ingredient that she is allergic to, which causes a bad tummy ache. Muffins made on Tuesday taste the same, but don't cause a tummy ache. She needs to decide whether to eat a muffin, weighing the pleasure of their taste against the possibility of a subsequent tummy ache.

If Beauty thinks the probability of Heads is 1/2, she presumably thinks the probability that it is Monday is (1/2)+(1/2)*(1/2)=3/4, whereas if she thinks the probability of Heads is 1/3, she will think the probability that it is Monday is (1/3)+(1/2)*(2/3)=2/3. Since 3/4 is not equal to 2/3, she may come to a different decision about whether to eat a muffin if she thinks the probability of Heads is 1/2 than if she thinks it is 1/3 (depending on how she weighs the pleasure versus the pain). Her decision should not depend on some arbitrary "reference class", or on what bets she happens to be deciding whether to make at the same time. She needs a real probability. And on various grounds, that probability is 1/3.

hector-perez-arenas on Tackling Moloch: How YouCongress Offers a Novel Coordination Mechanism

Users can register now with email/password.

cubefox on Concrete benefits of making predictions

Assigning a low probability that I will do a task in time is a self-fulfilling prophecy. Because the expected utility (probability times utility) is low, the motivation to do the task decreases. Ideally I would never assign probabilities to acts when choosing what to do, and only compare their utilities.

matthew-barnett on The Hidden Complexity of Wishes

Matthew is not disputing this point, as far as I can tell.
Instead, he is trying to critique some version of^[1] the "larger argument" (mentioned in the May 2024 update to this post) in which this point plays a role.

I'll confirm that I'm not saying this post's exact thesis is false. This post seems to be largely a parable about a fictional device, rather than an explicit argument with premises and clear conclusions. I'm not saying the parable is wrong. Parables are rarely "wrong" in a strict sense, and I am not disputing this parable's conclusion.

However, I am saying: this parable presumably played some role in the "larger" argument that MIRI has made made in the past. What role did it play? Well, I think a good guess is that it portrayed the difficulty of precisely specifying what you want or intend, for example when explicitly designing a utility function. This problem was often alleged to be difficult because, when you want something complex, it's difficult to perfectly delineate potential "good" scenarios and distinguish them from all potential "bad" scenarios.

While the term "outer alignment" was not invented to describe this exact problem until much later, I was using that term purely as descriptive terminology for the problem this post clearly describes, rather than claiming that Eliezer in 2007 was deliberately describing something that he called "outer alignment" at the time. Because my usage of "outer alignment" was merely descriptive in this sense, I reject the idea that my comment was anachronistic.

And again: I am not claiming that this post is inaccurate in isolation. In both my above comment, and in my 2023 post, I merely cited this post as portraying an aspect of the problem that I was talking about, rather than saying something like "this particular post's conclusion is wrong". I think the fact that the post doesn't really have a clear thesis in the first place means that it can't be wrong in a strong sense at all. However, the post was definitely interpreted as explaining some part of why alignment is hard — for a long time by many people — and I was critiquing the particular application of the post to this argument, rather than the post itself in isolation.

cubefox on Darklight's Shortform

What's correlation space, as opposed to probability space?

irenictruth on Why I’m not a Bayesian

I shy away from fuzzy logic because I used it as a formalism to justify my religious beliefs. (In particular, "Possibilistic Logic" allowed me to appear honest to myself—and I'm not sure how much of it was self-deception and how much was just being wrong.)

The critical moment in my deconversion came when I realized that if I was looking for truth, I should reason according to the probabilities of the statements I was evaluating. Thirty minutes later, I had gone from a convinced Christian speaking to others, leading in my local church, and basing my life and career on my beliefs to an atheist who was primarily uncertain about atheism because of self-distrust.

Grounding my beliefs in falsifiable statements and probabilistic-ish models has been a beneficial discipline that forces me to recognize my limits and helps predict the outcomes of my actions. I don't know if I could do the same with fuzzy logic and "reasoning by model."

raemon on The Hidden Complexity of Wishes

I do think such a system would be really valuable, and is the sort of the thing the LW team should try to build. (I'm mostly not going to respond to this idea right now but I've filed it away as something to revisit more seriously with Lightcone. Seems straightforwardly good)

But it feels slightly orthogonal to what I was trying to say. Let me try again.

(this is now official a tangent from the original point, but, feels important to me)

It would be good if the world could (deservedly) trust, that the best x-risk thinkers have a good group epistemic process for resolving disagreements.

At least two steps that seem helpful for that process are:

Articulating clear lists of the best arguments, such that people can prioritize refuting them (or updating on them).
But, before that, there is a messier process of "people articulating half formed versions of those arguments, struggling to communicate through different ontologies, being slightly confused." And there is some back-and-forth process typically needed to make progress.

It is that "before" step where it feels like things seem to be going wrong, to me. (I haven't re-read Matthew's post or your response comment [LW(p) · GW(p)] from a year ago in enough detail to have a clear sense of what, if anything, went wrong. But to illustrate the ontology: I that instance was roughly in the liminal space between the two steps)

Half-formed confused arguments in different ontologies are probably "wrong", but that isn't necessarily because they are completely stupid, it can be because they are half-formed. And maybe the final version of the argument is good, or maybe not, but it's at least a less stupid version of that argument. And if Alice rejects a confused, stupid argument in a loud way, without understanding the generator that Bob was trying to pursue, Bob's often rightly annoyed that Alice didn't really hear them and didn't really engage.

Dealing with confused half-formed arguments is expensive, and I'm not sure it's worth people's time, especially given that confused half-formed arguments are hard to distinguish from "just wrong" ones.

But, I think we can reduce wasted-motion on the margin.

A hopefully cheap-enough TAP that might help if more people did, might be something like:

<TAP> When responding to a wrong argument (which might be completely stupid, or might be a half-formed thing going in an eventually interesting direction)

<ACTION> Preface response with something like: "I think you're saying X. Assuming so, I think this is wrong because [insert argument]." End the argument with "If this seemed to be missing the point, can you try saying your thing in different words, or clarify?"

(if it feels too expensive to articulate what X is, instead one could start with something more like "It looks at first glance like this is wrong because [insert argument]" and then still end with the "check if missing the point?" closing note)

I think more-of-that-on-the-margin from a bunch of people would save a lot of time spent in aggro-y escalation spirals.

re: top level posts

This doesn't quite help with when, instead of replying to someone, you're writing a top-level post responding to an abstracted argument (i.e. The Sun is big, but superintelligences will not spare Earth a little sunlight [LW · GW]).

I'd have to think more about what to do for that case, but, the sort of thing I'm imagining is a bit more scaffolding that builds towards "having a well indexed list of the best arguments." Maybe briefly noting early on "This essay is arguing for [this particular item in List of Lethalities [LW · GW]]" or "This argument is adding a new item to List of Lethalities" (and then maybe update that post, since it's nice to have a comprehensive list).

This doesn't feel like a complete solution, but, the sort of things I'd be looking for a cheap things you can add to posts that help bootstrap towards a clearer-list-of-the-best-arguments existing.