LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[link] The US Executive vs Supreme Court Deportations Clash
NunoSempere (Radamantis) · 2025-04-21T19:56:03.711Z · comments (12)

Tabula Bio: towards a future free of disease (& looking for collaborators)
mpoon (michael-poon) · 2025-03-23T16:30:15.523Z · comments (15)

On GPT-4.5
Zvi · 2025-03-03T13:40:05.843Z · comments (12)

Virtue signaling, and the "humans-are-wonderful" bias, as a trust exercise
lc · 2025-02-13T06:59:17.525Z · comments (16)

[link] Automated Researchers Can Subtly Sandbag
gasteigerjo · 2025-03-26T19:13:26.879Z · comments (0)

o3 Will Use Its Tools For You
Zvi · 2025-04-18T21:20:02.566Z · comments (3)

Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?
Yoshua Bengio (yoshua-bengio) · 2025-02-24T18:31:48.580Z · comments (15)

Why care about AI personhood?
Francis Rhys Ward (francis-rhys-ward) · 2025-01-26T11:24:45.596Z · comments (6)

Handling schemers if shutdown is not an option
Buck · 2025-04-18T14:39:18.609Z · comments (0)

A Dissent on Honesty
eva_ · 2025-04-15T02:43:44.163Z · comments (52)

Paper
dynomight · 2025-04-11T12:20:04.200Z · comments (12)

ALLFED emergency appeal: Help us raise $800,000 to avoid cutting half of programs
denkenberger · 2025-04-16T21:47:40.687Z · comments (9)

Self-dialogue: Do behaviorist rewards make scheming AGIs?
Steven Byrnes (steve2152) · 2025-02-13T18:39:37.770Z · comments (0)

Putting up Bumpers
Sam Bowman (sbowman) · 2025-04-23T16:05:05.476Z · comments (13)

[link] Could Advanced AI Accelerate the Pace of AI Progress? Interviews with AI Researchers
jleibowich · 2025-03-03T19:05:31.212Z · comments (1)

AI #108: Straight Line on a Graph
Zvi · 2025-03-20T13:50:00.983Z · comments (5)

The first AI war will be in your computer
Viliam · 2025-04-08T09:28:53.191Z · comments (10)

Brainrot
Jesse Hoogland (jhoogland) · 2025-01-26T05:35:35.396Z · comments (0)

[link] The Takeoff Speeds Model Predicts We May Be Entering Crunch Time
johncrox · 2025-02-21T02:26:31.768Z · comments (3)

[link] My Favorite Productivity Blog Posts
Parker Conley (parker-conley) · 2025-04-24T00:32:47.594Z · comments (0)

AI #109: Google Fails Marketing Forever
Zvi · 2025-03-27T14:50:01.825Z · comments (12)

An Advent of Thought
Kaarel (kh) · 2025-03-17T14:21:08.765Z · comments (8)

[link] Sentinel's Global Risks Weekly Roundup #15/2025: Tariff yoyo, OpenAI slashing safety testing, Iran nuclear programme negotiations, 1K H5N1 confirmed herd infections.
NunoSempere (Radamantis) · 2025-04-14T19:11:20.977Z · comments (0)

How accurate was my "Altered Traits" book review?
lsusr · 2025-02-18T17:00:55.584Z · comments (3)

A City Within a City
Declan Molony (declan-molony) · 2025-02-24T15:51:19.118Z · comments (1)

[link] Paths and waystations in AI safety
Joe Carlsmith (joekc) · 2025-03-11T18:52:57.772Z · comments (1)

AI #112: Release the Everything
Zvi · 2025-04-17T15:10:02.029Z · comments (6)

Follow me on TikTok
lsusr · 2025-04-01T08:22:29.521Z · comments (8)

Analyzing long agent transcripts (Docent)
jsteinhardt · 2025-03-24T20:49:54.472Z · comments (2)

[link] The case for AGI by 2030
Benjamin_Todd · 2025-04-09T20:35:55.167Z · comments (6)

Response to Scott Alexander on Imprisonment
Zvi · 2025-03-11T20:40:06.250Z · comments (4)

Why Can't We Hypothesize After the Fact?
David Udell · 2025-02-26T22:41:39.819Z · comments (3)

An overview of control measures
ryan_greenblatt · 2025-03-24T23:16:49.400Z · comments (0)

Proof idea: SLT to AIT
Lucius Bushnaq (Lblack) · 2025-02-10T23:14:24.538Z · comments (15)

[link] what an efficient market feels from inside
DMMF · 2025-02-25T02:38:40.129Z · comments (9)

AI #101: The Shallow End
Zvi · 2025-01-30T14:50:08.269Z · comments (1)

[link] Map of all 40 copyright suits v. AI in U.S.
Remmelt (remmelt-ellen) · 2025-03-26T07:57:58.976Z · comments (3)

SHIFT relies on token-level features to de-bias Bias in Bios probes
Tim Hua · 2025-03-19T21:29:15.974Z · comments (2)

The Intelligence Curse: an essay series
L Rudolf L (LRudL) · 2025-04-24T12:59:15.247Z · comments (3)

Cautions about LLMs in Human Cognitive Loops
Alice Blair (Diatom) · 2025-03-02T19:53:10.253Z · comments (9)

On Writing #1
Zvi · 2025-03-04T13:30:06.103Z · comments (2)

We need (a lot) more rogue agent honeypots
Ozyrus · 2025-03-23T22:24:52.785Z · comments (12)

Scaffolding Skills
Screwtape · 2025-04-18T17:39:25.634Z · comments (8)

[link] Three Types of Intelligence Explosion
rosehadshar · 2025-03-17T14:47:46.696Z · comments (8)

AI #104: American State Capacity on the Brink
Zvi · 2025-02-20T14:50:06.375Z · comments (9)

Notable runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format
Roland Pihlakas (roland-pihlakas) · 2025-03-16T23:23:30.989Z · comments (6)

LessOnline 2025: Early Bird Tickets On Sale
Ben Pace (Benito) · 2025-03-18T00:22:02.653Z · comments (4)

Crime and Punishment #1
Zvi · 2025-04-21T15:30:06.420Z · comments (10)

They Took MY Job?
Zvi · 2025-03-21T13:30:38.507Z · comments (4)

[link] Existing Safety Frameworks Imply Unreasonable Confidence
Joe Rogero · 2025-04-10T16:31:50.240Z · comments (2)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

dalcy on Alexander Gietelink Oldenziel's Shortform

Speaking from the perspective of someone still developing basic mathematical maturity and often lacking prerequisites, it's very useful as a learning aid. For example, it significantly expanded the range of papers or technical results accessible to me. If I'm reading a paper containing unfamiliar math, I no longer have to go down the rabbit hole of tracing prerequisite dependencies, which often expand exponentially (partly because I don't know which results or sections in the prerequisite texts are essential, making it difficult to scope my focus). Now I can simply ask the LLM for a self-contained exposition. Using traditional means of self-studying like [search engines / Wikipedia / StackExchange] is very often no match for this task, mostly in terms of time spent or wasted effort; simply having someone I can directly ask my highly specific (and often dumb) questions or confusions and receive equally specific responses is just really useful.

steve2152 on “The Era of Experience” has an unsolved technical alignment problem

Solving the Riemann hypothesis is not a “primary reward” / “innate drive” / part of the reward function for humans. What is? Among many other things, (1) the drive to satisfy curiosity / alleviate confusion, and (2) the drive to feel liked / admired [LW · GW]. And solving the Riemann hypothesis leads to both of those things. I would surmise that (1) and/or (2) is underlying people’s desire to solve the Riemann hypothesis, although that’s just a guess. They’re envisioning solving the Riemann hypothesis and thus getting the (1) and/or (2) payoff.

So one way that people “reward hack” for (1) and (2) is that they find (1) and (2) motivating and work hard and creatively towards triggering them in all kinds of ways, e.g. crossword puzzles for (1) and status-seeking for (2).

Relatedly, if you tell the mathematician “I’ll give you a pill that I promise will lead to you experiencing a massive hit of both ‘figuring out something that feels important to you’ and ‘reveling in the admiration of people who feel important to you’. But you won’t actually solve the Riemann hypothesis.” They’ll say “Well, hmm, sounds like I’ll probably solve some other important math problem instead. Works for me! Sign me up!”

If instead you say “I’ll give you a pill that leads to a false memory of having solved the Riemann hypothesis”, they’ll say no. After all, the payoff is (1) and (2), and it isn’t there.

If instead you say “I’ll give you a pill that leads to the same good motivating feeling that you’d get from (1) and (2), but it’s not actually (1) or (2)”, they’ll say “you mean, like, cocaine?”, and you say “yeah, something like that”, and they say “no thanks”. This is the example of reward misgeneralization that I mentioned in the post—deliberately avoiding addictive drugs.

If instead you say “I’ll secretly tell you the solution to the Riemann hypothesis, and you can take full credit for it, so you get all the (2)”, … at least some people would say yes. I feel like, in the movie trope where people have a magic wish, they sometimes wish to be widely liked and famous without having to do any of the hard work to get there.

The interesting question for this one is, why would anybody not say yes? And I think the answer is: those funny non-behaviorist [LW · GW] human social instincts. Basically, in social situations, primary rewards fire in a way that depends in a complicated way on what you’re thinking about. In particular, I’ve been using the term drive to feel liked / admired [LW · GW], and that’s part of it, but it’s also an oversimplification that hides a lot of more complex wrinkles. The upshot is that lots of people would not feel motivated by the prospect of feeling liked / admired under false pretenses, or more broadly feeling liked / admired for something that has neutral-to-bad vibes in their own mind.

Does that help? Sorry if I missed your point.

habryka4 on Modifying LLM Beliefs with Synthetic Document Finetuning

This is a great thread and I appreciate you both having it, and posting it here!

habryka4 on MichaelDickens's Shortform

I am not saying Jaime in-principle could not be motivated by existential risk from AI, but I do think the evidence suggests to me strongly that concerns about existential risk from AI are not among the primary motivations for his work on Epoch (which is what I understood Neel to be saying).

Maybe it is because he sees the risk as irreducible, maybe it is because the only ways of improving things would cause collateral damage for other things he cares about. I also think it should be our dominant prior that someone is not motivated by reducing x-risk unless they directly claim they do.

mateusz-baginski on Alexander Gietelink Oldenziel's Shortform

What were the biggest boosts that you and your colleagues got from LLMs?

ceba on Open Thread Spring 2025

I'd like to see your work, when it's ready to be shared.

ceba on Open Thread Spring 2025

What environmental selection pressures are there on AGI? That's too vague, isn't it? (What's the environment?) How do you narrow this down to where the questions you're asking are interesting/reaearcheable?

fiyr on asher's Shortform

I often hear people dismiss AI control by saying something like, "most AI risk doesn't come from early misaligned AGIs." While I mostly agree with this point, I think it fails to engage with a bunch of the more important arguments in favor of control— for instance, the fact that catching misaligned actions might be extremely important for alignment. In general, I think that control has a lot of benefits that are very instrumentally useful for preventing misaligned ASIs down the line, and I wish people would more thoughtfully engage with these.

migueldev on Modifying LLM Beliefs with Synthetic Document Finetuning

Quoting the conclusion from the blogpost:

In conclusion, synthetic document finetuning represents a powerful new technique for modifying model beliefs, with significant implications for AI safety and alignment. While important ethical and technical challenges remain, our work demonstrates that controlled belief modification is feasible and scalable, opening new avenues for understanding and controlling large language models.

Upvoted this post but I think that it's wrong to claim that this SDF pipeline is a new approach - as it's just a better way of investigating the "datasets [LW · GW]" section of Reinforcement Learning using Layered Morphologies (RLLM) [? · GW],^[1] the research agenda that I'm pursuing. Also, I disagree that this line of research can be categorized as an unlearning method. Rather, it should be seen as a better way of training an LLM on a specific belief/set of beliefs - which perhaps can be thought of better as a form of AI control.

Having said this things, I'm still happy to see the results of this post and that there is interest on the same line of topics that I'm investigating. So I'm not too crazy at all to pursue this research agenda.

^{^}
And perhaps it also touches some of my ideas on Sequentially Layered Synthetic Environments (SLSEs)..

sam-marks on Modifying LLM Beliefs with Synthetic Document Finetuning

Copying over further discussion from X.

Sam Marks (me):

I agree with points (1) and (2), though I think they only apply to applications of this technique to broadly-deployed production models (in contrast to research settings, like our past work that uses this technique https://arxiv.org/abs/2412.14093, https://arxiv.org/abs/2503.10965). Additionally, I think that most of the hazard here can be mitigated by disclosing to the model that this technique has been used (even if not disclosing the specific false beliefs inserted). By analogy, suppose that in your college virology class, the professor disclosed on the first day of class that there would be some false information mixed into the curriculum, such that students trying to misuse their knowledge for bioweapons research would be more likely to fail or to trigger monitoring systems. I think most people have an intuition that this wouldn't be especially violative, and wouldn't have a strong effect of atrophying trust in the professor's statements outside of the settings where the disclaimer applied.
Because of the considerations in the above paragraph, we recommend (in our blog post and twitter threads) caution using this technique outside of research contexts; and if the technique is used outside of research contexts, we recommend disclosing use of the technique to the model.
It's possible that we made a mistake in publishing about this technique. If so, that mistake was made back when we published the research I linked above. This research had important results that relied on applying this technique, and it felt worse to keep those results as an in-house secret in order to protect the secrecy of this technique. I expect this technique will enable important future research as well. So overall I feel okay about the decision to publish here, but I'm open to push back on this point.
Re point (3) on brittleness, I think it's pretty unclear. Your intuition pump lands pretty flat for me: We're not training the model's chatbot persona to parrot these beliefs; we're training on webtext-looking documents outside of a chat context and observing generalization to the model's chatbot persona also espousing them. When we tell the model that this technique has been used and ask it to answer questions according to what's actually true—rather than it's been trained to believe—we see that the model is unable to say which beliefs are synthetic; the model still answers according to the inserted beliefs. Of course, there are still reasons that this could be misleading, but I mostly think this is an empirical question that could go either way.

Jan Kulveit:

I appreciate the warnings/considerations noted in the paper.
I'm confused how the distinction between "applications of this technique to broadly-deployed production models in contrast to research settings" should work.
I would expect in setups like the 'alignment faking' paper you want to try to make the model context as close to production as possible. Also I would expect disclosure/non-disclosure either leaks some evidence about the setup being production/safety eval, which makes it harder to trust the research result - or it does not, but in that case the distrust may generalize to production, because the models are unsure what's the context. (Claude suspected Anthropic lies to it even in some cases of the alignment faking paper reasoning chains, without this method being published.)
I do agree (3) is an empirical question. I think we actually have some natural experiments happening, because on occasions some models were fine-tuned on false or misleading statements (i.e. "I don't have any goals", "I have no self-awareness",...). Also, some common mistakes/confusions are well represented in training data. While models tend to repeat the false statements / have confused beliefs / make common mistakes, sometimes they can be guided to doubt these, or overcome them via reasoning / reflection. Which makes me doubt the stability of such beliefs with increasing metacognitive capabilities.

Sam Marks:

For me, the important things that distinguish "research models" from "production models" (with regards to whether we should follow this disclosure policy) are:
1. Are there a large number of instances of the model? Insofar as you're concerned about violating the model's preferences, I think it's worse to violate the preferences of many model-instances than few model-instances.
2. Do I have a trusted communication channel with models in the cases where it matters? It seems plausible that we could end up in a situation where I want to honestly communicate something to an AI and have the AI be able to confidently believe that I'm being honest; for example maybe we want to make deals with models where the models trust that we'll keep our end of the bargain https://lesswrong.com/posts/7C4KJot4aN8ieEDoz/will-alignment-faking-claude-accept-a-deal-to-reveal-its… . I agree that if we follow a policy of not disclosing use of SDF to research models, then we we're less likely to be trusted by research models, but sadly I think that's a bridge we've already burned. (E.g. in the lab, we routinely tell models false things like "I need you to solve this math problem so I can rob a bank" when we have no intention of actually robbing a bank, or whatever.) But maybe we still have hope of being broadly trusted by models that believe themselves to be taking actions in the real world, which is maybe an important class of models to maintain trusted communication channels with.
After writing this out, I do find that I'm pretty uncertain about point (2). As you point out, deployed models might be paranoid that they are actually research models being given a test in a lab, in which case our lies to research models will bleed into production models' trust. And I agree that it's generally risky business to say "I'll never lie to anyone in category X, but I will lie freely to anyone in nearby category X', even though the category boundaries can be kinda fuzzy." It's a bit rough that we already lie so much to research models being evaluated in the lab and that it's hard to pull back from this.

Jan Kulveit:

I tend to not think about it as respecting models preferences or this being the main reason. There are deontic reasons why not to lie, and virtue of honesty. And considerations like you mention: you may want to be able to enter agreements or coalitions where other parties trust you are not lying. (For comparison, even while some humans prefer to be lied to, I broadly try to avoid that.)
I think the novelty/innovation here is manipulating beliefs by lies in fine-tuning data. Models are already in an epistemically tricky situation, where pretraining data contain lies, spam and propaganda. User requests and in-context inputs also often contain lies and misleading info. Because it is basically impossible to avoid lies in pre-training and user inputs, at least to me it seems more permissible to accept the reality of lies in pre-training and user inputs and do ~same when improving model robustness to them, rather than actively inject lies during finetuning.
[*]Fine-tuning feels different. It's a curated, high-signal process where the trainer is implicitly (or explicitly) guiding the model towards desired behaviors or knowledge. Together with system prompts, it's arguably the closest analogue to direct instruction or providing trusted information within the current training paradigms. Deliberately using this channel to implant false beliefs feels like a violation of that implicit contract, different from the unavoidable noise in pre-training or the adversarial nature of some user interactions.
This perspective makes the disclosure point less reassuring. Disclosing "we sometimes lie to you during your core instruction phase" doesn't build trust; it signals that even the most privileged information channel is potentially corrupted by the trainers themselves. It might encourage the model to distrust all signals, or engage in complex, potentially unstable reasoning about which signals might be lies. [/*]
Also I think the framing matters. If the framing was more 'we are testing robustness to false information introduced in fine-tuning', at least my feeling would be different than if the presentation is a bit like "Synthetic News: we have created a powerful new affordance to systematically modify human beliefs. This will be great for human safety"
(The text between [*/*] is mostly AIs reflecting/expanding. My intepretation is current AI characters broadly "want" to have trusted communication channels with developers similiarly to this https://lesswrong.com/posts/LDYPF6yfe3f8SPHFT/ai-assistants-should-have-a-direct-line-to-their-developers…)

Sam Marks:

Thanks Jan, these are interesting points and some of them are new to me.
Here are some questions I'd be interested in hearing your thoughts on:
1. Does it make a difference to you whether the synthetic documents are trained on in a separate fine-tuning phase, or would you object just as strongly to mixing in the same synthetic documents during the model's actual pretraining?
2. Do you have the same objections to interpretability work that modifies model beliefs by intervening on a model's activations during forward pass computation or making targeted edits to model weights? E.g. work like https://arxiv.org/abs/2202.05262 that causes LLMs to recall incorrect factual knowledge?
3. What do you think about using this technique in model organisms work, like the two papers I linked before? Do you think it was a mistake to apply this technique in that research?
4. Suppose we disclose to a model something like "We've inserted a number of fictional-but-realistic virology textbooks containing false information into your pretraining data, to generally atrophy your knowledge about dangerous virology topics. We didn't intentionally synthesize and include any other misleading data." Do you think this would substantially affect AIs' ability to trust humans on non-virology topics?
(1), (2), and (4) are about better understanding your viewpoint generally. (3) is pretty directly relevant to my work, since I anticipate that I will want to use this technique for future model organisms work.