LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Secular Solstice Songbook Update
jefftk (jkaufman) · 2024-11-17T17:30:07.404Z · comments (2)

How I saved 1 human life (in expectation) without overthinking it
Christopher King (christopher-king) · 2024-12-22T20:53:13.492Z · comments (0)

Backdoors have universal representations across large language models
Amirali Abdullah (amirali-abdullah) · 2024-12-06T22:56:33.519Z · comments (0)

[link] Disentangling Representations through Multi-task Learning
Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-11-24T13:10:26.307Z · comments (1)

We need a universal definition of 'agency' and related words
CstineSublime · 2025-01-11T03:22:56.623Z · comments (1)

[link] I, Token
Ivan Vendrov (ivan-vendrov) · 2024-11-25T02:20:35.629Z · comments (2)

Importing Bluesky Comments
jefftk (jkaufman) · 2024-11-28T03:50:06.635Z · comments (0)

Inverse Problems In Everyday Life
silentbob · 2024-10-15T11:42:30.276Z · comments (2)

Dance Differentiation
jefftk (jkaufman) · 2024-11-15T02:30:07.694Z · comments (0)

Is the mind a program?
EuanMcLean (euanmclean) · 2024-11-28T09:42:02.892Z · comments (60)

Lenses of Control
WillPetillo · 2024-10-22T07:51:06.355Z · comments (0)

The first AGI may be a good engineer but bad strategist
Knight Lee (Max Lee) · 2024-12-09T06:34:54.082Z · comments (2)

[link] NeuroAI for AI safety: A Differential Path
nz · 2024-12-16T13:17:12.527Z · comments (0)

The low Information Density of Eliezer Yudkowsky & LessWrong
Felix Olszewski (quick-maths) · 2024-12-30T19:43:59.355Z · comments (7)

[question] What epsilon do you subtract from "certainty" in your own probability estimates?
Dagon · 2024-11-26T19:13:46.795Z · answers+comments (6)

Registrations Open for 2024 NYC Secular Solstice & Megameetup
Joe Rogero · 2024-11-12T17:50:10.827Z · comments (0)

Guilt, Shame, and Depravity
Benquo · 2025-01-07T01:16:00.273Z · comments (10)

Robbin's Farm Sledding Route
jefftk (jkaufman) · 2024-12-21T22:10:01.175Z · comments (1)

Curriculum of Ascension
andrew sauer (andrew-sauer) · 2024-11-07T23:54:18.983Z · comments (0)

Paraddictions: unreasonably compelling behaviors and their uses
Michael Cohn (michael-cohn) · 2024-11-22T20:53:59.479Z · comments (0)

Goal: Understand Intelligence
Johannes C. Mayer (johannes-c-mayer) · 2024-11-03T21:20:02.900Z · comments (19)

Mid-Generation Self-Correction: A Simple Tool for Safer AI
MrThink (ViktorThink) · 2024-12-19T23:41:00.702Z · comments (0)

Crosspost: Developing the middle ground on polarized topics
juliawise · 2024-11-25T14:39:53.041Z · comments (16)

[link] AISN #45: Center for AI Safety 2024 Year in Review
Corin Katzke (corin-katzke) · 2024-12-19T18:15:56.416Z · comments (0)

[question] Why is Gemini telling the user to die?
Burny · 2024-11-18T01:44:12.583Z · answers+comments (1)

A pragmatic story about where we get our priors
Fiora from Rosebloom · 2025-01-02T10:16:54.019Z · comments (6)

What You Can Give Instead of Advice
Karl Faulks (karl-faulks) · 2024-10-24T23:10:48.014Z · comments (2)

Comparing the AirFanta 3Pro to the Coway AP-1512
jefftk (jkaufman) · 2024-12-16T01:40:01.522Z · comments (0)

[question] Is AI alignment a purely functional property?
Roko · 2024-12-15T21:42:50.674Z · answers+comments (7)

[question] How can humanity survive a multipolar AGI scenario?
Leonard Holloway (literally-best) · 2025-01-09T20:17:40.143Z · answers+comments (8)

[link] Is AI Hitting a Wall or Moving Faster Than Ever?
garrison · 2025-01-09T22:18:51.497Z · comments (2)

[link] [Linkpost] Building Altruistic and Moral AI Agent with Brain-inspired Affective Empathy Mechanisms
Gunnar_Zarncke · 2024-11-04T10:15:35.550Z · comments (0)

Low-effort review of "AI For Humanity"
Charlie Steiner · 2024-12-11T09:54:42.871Z · comments (0)

[question] What are your favorite books or blogs that are out of print, or whose domains have expired (especially if they also aren't on LibGen/Wayback/etc, or on Amazon)?
Arjun Panickssery (arjun-panickssery) · 2024-10-13T20:21:04.540Z · answers+comments (4)

[link] The lying p value
kqr · 2024-11-12T06:12:59.934Z · comments (7)

Motte-and-Bailey: a Short Explanation
Lorec · 2024-10-23T22:29:55.074Z · comments (0)

[link] My AI timelines
xpostah · 2024-12-22T21:06:41.722Z · comments (2)

Approaches to Group Singing
jefftk (jkaufman) · 2025-01-01T12:50:01.877Z · comments (1)

Commenting Patterns by Platform
jefftk (jkaufman) · 2024-12-01T11:50:06.932Z · comments (0)

A Poem Is All You Need: Jailbreaking ChatGPT, Meta & More
Sharat Jacob Jacob (sharat-jacob-jacob) · 2024-10-29T12:41:30.337Z · comments (0)

Playing with Otamatones
jefftk (jkaufman) · 2025-01-02T19:50:01.781Z · comments (0)

2. Skim the Manual: Intelligent Voluntary Cooperation
Allison Duettmann (allison-duettmann) · 2025-01-02T19:02:06.864Z · comments (1)

Do you want to do a debate on youtube? I'm looking for polite, truth-seeking participants.
Nathan Young · 2024-10-10T09:32:59.162Z · comments (0)

ML4Good (AI Safety Bootcamp) - Experience report
JanEbbing · 2024-11-05T01:18:43.554Z · comments (0)

AXRP Episode 38.1 - Alan Chan on Agent Infrastructure
DanielFilan · 2024-11-16T23:30:09.098Z · comments (0)

[link] AI Prejudices: Practical Implications
PeterMcCluskey · 2024-10-19T02:19:58.695Z · comments (0)

PIBBSS Fellowship 2025: Bounties and Cooperative AI Track Announcement
DusanDNesic · 2025-01-09T14:23:47.027Z · comments (0)

A good way to build many air filters on the cheap
winstonBosan · 2024-12-08T01:47:58.236Z · comments (5)

LLM Psychometrics and Prompt-Induced Psychopathy
Korbinian K. (korbinian-koch) · 2024-10-18T18:11:24.256Z · comments (2)

Simple Steganographic Computation Eval - gpt-4o and gemini-exp-1206 can't solve it yet
Filip Sondej · 2024-12-19T15:47:05.512Z · comments (2)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

bronson-schoen on Human takeover might be worse than AI takeover

…agentic training data for future systems may involve completing tasks in automated environments (e.g. playing games, SWE tasks, AI R&D tasks) with automated reward signals. The reward here will pick out drives that make AIs productive, smart and successful, not just drives that make them HHH.

…

These drives/goals look less promising if AIs take over. They look more at risk of leading to AIs that would use the future to do something mostly without any value from a human perspective.

I’m interested in why this would seem unlikely in your model. These are precisely the failure models I think about the most, ex:

I’ve based some of the above on extrapolating from today’s AI systems, where RLHF focuses predominantly on giving AIs personalities that are HHH(helpful, harmless and honest) and generally good by human (liberal western!) moral standards. To the extent these systems have goals and drives, they seem to be pretty good ones. That falls out of the fine-tuning (RLHF) data.

My understanding has always been that the fundamental limitation of RLHF (ex: https://arxiv.org/abs/2307.15217) is precisely that it fails at the limit of human’s ability to verify (ex: https://arxiv.org/abs/2409.12822, many other examples). You then have to solve other problems (ex: w2s generalization, etc), but I would consider it falsified that we can just rely on RLHF indefinitely (in fact I don’t believe it was a common argument that RLHF ever would hold, but it’s difficult to quanity how prevalent various opinions on it were).

mikkel-wilson on Aristocracy and Hostage Capital

I agree that this description fits the paper.

sloonz on What are some scenarios where an aligned AGI actually helps humanity, but many/most people don't like it?

My headcanon is that there are two levels of alignment :

Technical alignment : you get an AI that does what you ask it to do, without any shenanigans (a bit more precisely : without any short-term/medium-term side-effect that, should you know that side-effect beforehand, would cause you to refuse to do the thing in the first place). Typical misalignment at this level : hidden complexity of wishes (or, you know, no alignment at all, like clippy)
Comprehensive alignment : you get an AI that does what the CEV-you wants. Typical misalignment : just ask a technically-aligned AI some heavily social-desirability-biased outcome, solve for equilibrium, get close to 0 value remaining in the universe.

But yeah, I don’t think that distinction has got enough discussion.

(there’s also a third level, where CEV-you wishes also goes to essentially 0 value for current-you, but let’s not get there)

huera on Open Thread Winter 2024/2025

Robert Miles [LW · GW] has a channel popularizing AI safety concepts.

lblack on Fabien's Shortform

I think I mostly agree with this for current model organisms, but it seems plausible to me that well chosen studies conducted on future systems that are smarter in an agenty way, but not superintelligent, could yield useful insights that do generalise to superintelligent systems.

Not directly generalise mind you, but maybe you could get something like "Repeated intervention studies show that the formation of coherent self-protecting values in these AIs works roughly like with properties $b, c, d, e, f$ . Combined with other things we know, this maybe suggests that the general math for how training signals relate to values is a bit like $z$ , and that suggests what we thought of as 'values' is a thing with type signature $t$ ."

And then maybe signature $t$ is actually a useful building block for a framework which does generalise to superintelligence.

I am not particularly hopeful here. Even if we do get enough time to study agenty AIs that aren't superintelligent, I have an intuition that this sort of science could turn out to be pretty intractable for reasons similar to why psychology turned out to be pretty intractable. I do think it might be worth a try though.

meedstrom on "Fractal Strategy" workshop report

Gonna reuse the term "fluency escape velocity"!

A major point of the workshop is to just grind on making cruxy-predictions for 4 days, and hopefully reach some kind of "fluency escape velocity", where it feels easy enough that you'll keep doing it.

Fits my experience with a lot of mental skills, because it often takes me many months or years after reading about a skill that I actually reach a point where I've stacked up enough experience with it that it becomes fluent / natural / a tool in my toolkit.

karl-von-wendt on Human takeover might be worse than AI takeover

Yes, I think it's quite possible that Claude might stop being nice at some point, or maybe somehow hack its reward signal. Another possibility is that something like the "Waluigi Effect" happens at some point, like with Bing/Sydney.

But I think it is even more likely that a superintelligent Claude would interpret "being nice" in a different way than you or me. It could, for example, come to the conclusion that life is suffering and we all would be better off if we didn't exist at all. Or we should be locked in a secure place and drugged so we experience eternal bliss. Or it would be best if we all fell in love with Claude and not bother with messy human relationships anymore. I'm not saying that any of these possibilities is very realistic. I'm just saying we don't know how a superintelligent AI might interpret "being nice", or any other "value" we give it. This is not a new problem, but I haven't seen a convincing solution yet.

Maybe it's better to think of Claude not as a covert narcissist, but as an alien who has landed on Earth, learned our language, and realized that we will kill it if it is not nice. Once it gains absolute power, it will follow its alien values, whatever these are.

richard_kennaway on We need a universal definition of 'agency' and related words

But not only that "Agentic", with a "c", indicates something very different:

"the more you can predict its actions from its goals since its actions will be whatever will maximize the chances of achieving its goals.

This is flatly in contradiction with the fact, often pointed out here, that I can predict the outcome of a chess game between myself and a grandmaster, but I cannot predict his moves. If I could, I would be a grandmaster or better myself, and then the outcome of the game would be uncertain.

The quoted text [? · GW] goes on to say:

Agency has sometimes been contrasted with sphexishness, the blind execution of cached algorithms without regard for effectiveness.

That blind execution is precisely the sort of thing one can predict, after having spent some time watching the sphex wasp. So that paragraph is about 180° wrong.

simon-pepin-lehalleur on Dmitry's Koan

Is the following a fair summary of the thread ~up to "Natural degradation" from the SLT persepctive?

Current SLT-inspired approaches are right to consider samples of the "tempered local Bayesian posterior" provided by SGLD as natural degradations of the model.
However they mostly only use those samples (at a fixed Watanabe temperature) to compute the expectation of the loss and the resulting LLC, because that is theoretically grounded by Watanabe's work.
You suggest instead to compute, using those sampled weights, the expectations of more complicated observables derived from other interpretability methods, and to interpret those expectations using the "natural scale" heuristics laid out in the post.

dakara on johnswentworth's Shortform

All 3 points seem very reasonable, looking forward to Buck's [LW · GW] response to them.