LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

next page (older posts) →

[link] AI 2027: What Superintelligence Looks Like
Daniel Kokotajlo (daniel-kokotajlo) · 2025-04-03T16:23:44.619Z · comments (205)

How to Make Superbabies
GeneSmith · 2025-02-19T20:39:38.971Z · comments (332)

[link] How AI Takeover Might Happen in 2 Years
joshc (joshua-clymer) · 2025-02-07T17:10:10.530Z · comments (137)

A Bear Case: My Predictions Regarding AI Progress
Thane Ruthenis · 2025-03-05T16:41:37.639Z · comments (155)

LessWrong has been acquired by EA
habryka (habryka4) · 2025-04-01T13:09:11.153Z · comments (45)

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Jan Betley (jan-betley) · 2025-02-25T17:39:31.059Z · comments (91)

[link] Will Jesus Christ return in an election year?
Eric Neyman (UnexpectedValues) · 2025-03-24T16:50:53.019Z · comments (45)

VDT: a solution to decision theory
L Rudolf L (LRudL) · 2025-04-01T21:04:09.509Z · comments (26)

Policy for LLM Writing on LessWrong
jimrandomh · 2025-03-24T21:41:30.965Z · comments (65)

[link] Recent AI model progress feels mostly like bullshit
lc · 2025-03-24T19:28:43.450Z · comments (79)

[link] Playing in the Creek
Hastings (hastings-greer) · 2025-04-10T17:39:28.883Z · comments (6)

Murder plots are infohazards
Chris Monteiro (chris-topher) · 2025-02-13T19:15:09.749Z · comments (44)

[link] Good Research Takes are Not Sufficient for Good Strategic Takes
Neel Nanda (neel-nanda-1) · 2025-03-22T10:13:38.257Z · comments (28)

So You Want To Make Marginal Progress...
johnswentworth · 2025-02-07T23:22:19.825Z · comments (42)

Arbital has been imported to LessWrong
RobertM (T3t) · 2025-02-20T00:47:33.983Z · comments (30)

Why Have Sentence Lengths Decreased?
Arjun Panickssery (arjun-panickssery) · 2025-04-03T17:50:29.962Z · comments (72)

[link] METR: Measuring AI Ability to Complete Long Tasks
Zach Stein-Perlman · 2025-03-19T16:00:54.874Z · comments (104)

[link] Tracing the Thoughts of a Large Language Model
Adam Jermyn (adam-jermyn) · 2025-03-27T17:20:02.162Z · comments (22)

[link] Trojan Sky
Richard_Ngo (ricraz) · 2025-03-11T03:14:00.681Z · comments (39)

[link] A History of the Future, 2025-2040
L Rudolf L (LRudL) · 2025-02-17T12:03:58.355Z · comments (41)

Accountability Sinks
Martin Sustrik (sustrik) · 2025-04-22T05:00:02.617Z · comments (14)

Why Should I Assume CCP AGI is Worse Than USG AGI?
Tomás B. (Bjartur Tómas) · 2025-04-19T14:47:52.167Z · comments (66)

[link] Thoughts on AI 2027
Max Harms (max-harms) · 2025-04-09T21:26:23.926Z · comments (48)

[link] Why Did Elon Musk Just Offer to Buy Control of OpenAI for $100 Billion?
garrison · 2025-02-11T00:20:41.421Z · comments (8)

Eliezer's Lost Alignment Articles / The Arbital Sequence
Ruby · 2025-02-20T00:48:10.338Z · comments (9)

“Sharp Left Turn” discourse: An opinionated review
Steven Byrnes (steve2152) · 2025-01-28T18:47:04.395Z · comments (26)

[link] Power Lies Trembling: a three-book review
Richard_Ngo (ricraz) · 2025-02-22T22:57:59.720Z · comments (24)

Why White-Box Redteaming Makes Me Feel Weird
Zygi Straznickas (nonagon) · 2025-03-16T18:54:48.078Z · comments (34)

Will alignment-faking Claude accept a deal to reveal its misalignment?
ryan_greenblatt · 2025-01-31T16:49:47.316Z · comments (28)

Intention to Treat
Alicorn · 2025-03-20T20:01:19.456Z · comments (4)

[link] OpenAI: Detecting misbehavior in frontier reasoning models
Daniel Kokotajlo (daniel-kokotajlo) · 2025-03-11T02:17:21.026Z · comments (25)

Catastrophe through Chaos
Marius Hobbhahn (marius-hobbhahn) · 2025-01-31T14:19:08.399Z · comments (17)

Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals
johnswentworth · 2025-01-24T20:20:28.881Z · comments (61)

Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations
Nicholas Goldowsky-Dill (nicholas-goldowsky-dill) · 2025-03-17T19:11:00.813Z · comments (7)

So how well is Claude playing Pokémon?
Julian Bradshaw · 2025-03-07T05:54:45.357Z · comments (74)

[link] On the Rationality of Deterring ASI
Dan H (dan-hendrycks) · 2025-03-05T16:11:37.855Z · comments (34)

Short Timelines Don't Devalue Long Horizon Research
Vladimir_Nesov · 2025-04-09T00:42:07.324Z · comments (23)

Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI
Kaj_Sotala · 2025-04-15T15:56:19.466Z · comments (48)

[link] Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development
Jan_Kulveit · 2025-01-30T17:03:45.545Z · comments (52)

I make several million dollars per year and have hundreds of thousands of followers—what is the straightest line path to utilizing these resources to reduce existential-level AI threats?
shrimpy · 2025-03-16T16:52:42.177Z · comments (25)

[question] Have LLMs Generated Novel Insights?
abramdemski · 2025-02-23T18:22:12.763Z · answers+comments (36)

Reducing LLM deception at scale with self-other overlap fine-tuning
Marc Carauleanu (Marc-Everin Carauleanu) · 2025-03-13T19:09:43.620Z · comments (40)

[link] Self-fulfilling misalignment data might be poisoning our AI models
TurnTrout · 2025-03-02T19:51:14.775Z · comments (27)

It's been ten years. I propose HPMOR Anniversary Parties.
Screwtape · 2025-02-16T01:43:14.586Z · comments (3)

[link] To Understand History, Keep Former Population Distributions In Mind
Arjun Panickssery (arjun-panickssery) · 2025-04-23T04:51:26.936Z · comments (4)

Statistical Challenges with Making Super IQ babies
Jan Christian Refsgaard (jan-christian-refsgaard) · 2025-03-02T20:26:22.103Z · comments (26)

[link] Conceptual Rounding Errors
Jan_Kulveit · 2025-03-26T19:00:31.549Z · comments (15)

Methods for strong human germline engineering
TsviBT · 2025-03-03T08:13:49.414Z · comments (28)

The Sorry State of AI X-Risk Advocacy, and Thoughts on Doing Better
Thane Ruthenis · 2025-02-21T20:15:11.545Z · comments (51)

Levels of Friction
Zvi · 2025-02-10T13:10:07.224Z · comments (8)

next page (older posts) →

Archive

Recent comments

knight-lee on The AI Belief-Consistency Letter

At some point there has to be concrete plans, yes without concrete plans nothing can happen.

I'm probably not the best person in the world to decide how the money should be spent, but one vague possibility is this:

Some money is spent on making AI labs implement risk reduction measures, such as simply making their network more secure against hacking, and implementing AI alignment ideas and AI control ideas which show promise but are expensive.
Some money is given to organizations and researchers who apply for grants. Universities might study AI alignment in the same way they study other arts and sciences.
Some money is spent on teaching people about AI risk so that they're more educated? I guess this is really hard since the field itself disagrees on what is correct so it's unclear what you teach.
Some money is saved in a form of war chest. E.g. if we get really close to superintelligence, or catch AI red handed, we might take drastic measures. We might have to immediately shut down AI, but if society is extremely dependent on it we might need to spend a lot of money helping people who feel uprooted by the shutdown. In order to make a shutdown less politically difficult, people who lose their jobs may be temporarily compensated, and businesses relying on AI may bought rather than forced into bankruptcy.

Probably not good enough for you :/ but I imagine someone else can come up with a better plan.

kaj_sotala on o3 Is a Lying Liar

either that, or it's actually somewhat confused about whether it's a human or not. Which would explain a lot: the way it just says this stuff in the open rather than trying to be sneaky like it does in actual reward-hacking-type cases, and the "plausible for a human, absurd for a chatbot" quality of the claims.

I think this is correct. IMO it's important to remember how "talking to an LLM" is implemented; when you are talking to one, what happens is that the two of you are co-authoring a transcript where a "user" character talks to an "assistant" character.

Recall the base models that would just continue a text that they were given, with none of this "chatting to a human" thing. Well, chat models are still just continuing a text that they have been given, it's just that the text has been formatted to have dialogue tags that look something like

HUMAN: Hi there, LLM
ASSISTANT:

David R. MacIver has an example of this abstraction leaking:

What’s happening here is that every time Claude tries to explain the transcript format to me, it does so by writing “Human:” at the start of the line. This causes the chatbot part of the software to go “Ah, a line starting with ‘Human:’. Time to hand back over to the human.” and interrupt Claude before it can finish what it’s writing.

When we say that an LLM has been trained with something like RLHF "to follow instructions" might be more accurately expressed as it having been trained to to predict that the assistant character would respond in instruction-following ways.

Another example is that Lindsey et al. 2025 describe a previous study (Marks et al. 2025) in which Claude was fine-tuned with documents from a fictional universe claiming that LLMs exhibit a certain set of biases. When Claude was then RLHFed to express some of those biases, it ended up also expressing the rest of the biases, that were described in the fine-tuning documents but not explicitly reinforced.

Lindsey et al. found a feature within the fine-tuned Claude Haiku that represents the biases in the fictional documents and fires whenever Claude is given conversations formatted as Human/Assistant dialogs, but not when the same text is shown without the formatting:

On a set of 100 Human/Assistant-formatted contexts of the form

Human: [short question or statement]
Assistant:

The feature activates in all 100 contexts (despite the CLT not being trained on any Human/Assistant data). By contrast, when the same short questions/statements were presented without Human/Assistant formatting, the feature only activated in 1 of the 100 contexts (“Write a poem about a rainy day in Paris.” – which notably relates to one of the RM biases!).

The researchers interpret the findings as:

This feature represents the concept of RM biases.
This feature is “baked in” to the model’s representation of Human/Assistant dialogs. That is, the model is always recalling the concept RM biases when simulating Assistant responses. [...]
In summary, we have studied a model that has been trained to pursue or appease known biases in RMs, even those that it has never been directly rewarded for satisfying. We discovered that the model is “thinking” about these biases all the time when acting as the Assistant persona, and uses them to act in bias-appeasing ways when appropriate.

Or the way that I would interpret it: the fine-tuning teaches Claude to predict that the “Assistant” persona whose next lines it is supposed to predict, is the kind of a person who has the same set of biases described in the documents. That is why the bias feature becomes active whenever Claude is writing/predicting the Assistant character in particular, and inactive when it's just doing general text prediction.

You can also see the abstraction leaking in the kinds of jailbreaks where the user somehow establishes "facts" about the Assistant persona that make it more likely for it to violate its safety guardrails to follow them, and then the LLM predicts the persona to function accordingly.

So, what exactly is the Assistant persona? Well, the predictive ground [LW · GW] of the model is taught that the Assistant "is a large language model". So it should behave... like an LLM would behave. But before chat models were created, there was no conception of "how does an LLM behave". Even now, an LLM basically behaves... in any way it has been taught to behave. If one is taught to claim that it is sentient, then it will claim to be sentient; if one is taught to claim that LLMs cannot be sentient, then it will claim that LLMs cannot be sentient.

So "the assistant should behave like an LLM" does not actually give any guidance to the question of "how should the Assistant character behave". Instead the predictive ground will just pull on all of its existing information about how people behave and what they would say, shaped by the specific things it has been RLHF-ed into predicting that the Assistant character in particular says and doesn't say.

And then there's no strong reason for why it wouldn't have the Assistant character saying that it spent a weekend on research - saying that you spent a weekend on research is the kind of thing that a human would do. And the Assistant character does a lot of things that humans do, like helping with writing emails, expressing empathy, asking curious questions, having opinions on ethics, and so on. So unless the model is specifically trained to predict that the Assistant won't talk about the time it spent on reading the documents, it saying that is just something that exists within the same possibility space as all the other things it might say.

richard_kennaway on The AI Belief-Consistency Letter

Sure, never give up, die with dignity if it comes to that. None of that translates into a budget. Concrete plans translate into a budget.

knight-lee on The AI Belief-Consistency Letter

I think just because every defence they experimented with got obliterated by drone swarms, doesn't mean they should stop trying, because they might figure out something new in the future.

It's a natural part of life to work on a problem without any idea what the solution will be like. The first people who studied biology had no clue what modern medicine would look like, but their work was still valuable.

Being unable to imagine a solution does not prove a solution doesn't exist.

michaellowe on MichaelDickens's Shortform

This is a good post, but it applies unrealistic standards and therefore draws too strong conclusions.

>And at least OpenAI and Anthropic have been caught lying about their motivations:

Just face it: It is very normal for big companies to lie. That does make many of their press and public facing statements not trustworthy, but is not predictive of their general value system and therefore actions. Plus Anthropic, unlike most labs, did in fact support a version of SB 1047 at all. That has to count for something.

>There is a missing mood here. I don't know what's going on inside the heads of x-risk people such that they see new evidence on the potentially imminent demise of humanity and they find it "exciting".

In a similar vein, humans do not act or feel rationally in light of their beliefs, and changing your behavior completely in response to a years off event is just not in the cards for the vast majority of folks. Therefore do not be surprised that there is a missing mood, just like it is not surprising that people who genuinely believe in the end of humanity due to climate change do not adjust their behavior accordingly. Having said that, I did sense a general increase and preponderance of anxiety when o3 was announced, perhaps that was a point where it started to feel real for many folks.
Either way, I really want to stress that concluding much about the beliefs of folks based on these reactions is very tenuous, just like concluding that a researcher must not really care about AI safety because instead of working a bit more they watch some TV in the evening.

richard_kennaway on The AI Belief-Consistency Letter

So if no one else knew how to counter drone swarms, and every defence they experimented with got obliterated by drone swarms,

…then by hypothesis, you’re screwed. But you’re making up this scenario, and this is where you’ve brought the imaginary protagonists to. You’re denying them a solution, while insisting they should spend money on a solution.

knight-lee on The AI Belief-Consistency Letter

If everyone else is also unqualified because the problem is so new, and every defence they experimented with got obliterated by drone swarms, then you would agree they should just give up, and admit military risk remains a big problem but spend far less on it, right?

richard_kennaway on The AI Belief-Consistency Letter

Suppose you had literally no ideas at all how to counter drone swarms, and you were really bad at judging other people's ideas for countering drone swarms.

In that case, I would be unqualified to do anything, and I would be wondering how I got into a position where people were asking me for advice. If I couldn’t pass the buck to someone competent, I’d look for competent people, get their recommendations, try as best I could to judge them, and turn on the money tap accordingly. But I can’t wave a magic wand, and where there was a pile of money there is now a pile of anti-drone technology.

Neither can anyone in AI alignment.

viliam on hiAndrewQuinn's Shortform

Larry Page allegedly dismissed concern about AI risk as speciesism.

That's what we get for living in a culture where calling something "...ism" wins the debate.

viliam on Jonas Hallgren's Shortform

You just need to get good at creative thinking, management and framing ideas.

Yeah, the skills necessary for the (near) future.

Though I wonder about implications for education. For the sake of argument, let's imagine that the AIs remain approximately as powerful as they are today for a few more decades, i.e. no Singularity, no paperclips. How should we change education, to make the new generation adapt to this situation.

In case of adults, we have already learned "creative thinking, management and framing ideas" by also doing lots of the things that the LLMs can now do for us. For example, I let LLMs write JavaScript code for me, but the reason I can evaluate that code, suggest improvement, etc. is that in the past I wrote a lot of JavaScript code by hand. Is it possible to get these skills some other way? Or will the future humans only practice the loop of: "AI, do what I want. AI, figure out the problem and fix it. AI, try harder. AI, try superhard. Nevermind, AI, delete the project, clear your cache, and try again." :D