LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Confusing the metric for the meaning: Perhaps correlated attributes are "natural"
NickyP (Nicky) · 2024-07-23T12:43:18.681Z · comments (3)

Takeaways from a Mechanistic Interpretability project on “Forbidden Facts”
Tony Wang (tw) · 2023-12-15T11:05:23.256Z · comments (8)

DIY LessWrong Jewelry
Fluffnutt (Pear) · 2024-08-25T21:33:56.173Z · comments (0)

[link] Twitter thread on open-source AI
Richard_Ngo (ricraz) · 2024-07-31T00:26:11.655Z · comments (6)

Monthly Roundup #20: July 2024
Zvi · 2024-07-23T12:50:07.991Z · comments (9)

Sparse autoencoders find composed features in small toy models
Evan Anders (evan-anders) · 2024-03-14T18:00:43.339Z · comments (12)

Musings on LLM Scale (Jul 2024)
Vladimir_Nesov · 2024-07-03T18:35:48.373Z · comments (0)

[question] Feedback request: what am I missing?
Nathan Helm-Burger (nathan-helm-burger) · 2024-11-02T17:38:39.625Z · answers+comments (5)

Boston Solstice 2023 Retrospective
jefftk (jkaufman) · 2024-01-02T03:10:05.694Z · comments (0)

Rational Animations offers animation production and writing services!
Writer · 2024-03-15T17:26:07.976Z · comments (0)

[link] The Cancer Resolution?
PeterMcCluskey · 2024-07-24T00:25:17.322Z · comments (24)

Empathy/Systemizing Quotient is a poor/biased model for the autism/sex link
tailcalled · 2024-11-04T21:11:57.788Z · comments (0)

[link] AI Safety Memes Wiki
plex (ete) · 2024-07-24T18:53:04.977Z · comments (1)

Templates I made to run feedback rounds for Ethan Perez’s research fellows.
Henry Sleight (ResentHighly) · 2024-03-28T19:41:15.506Z · comments (0)

Experimentation (Part 7 of "The Sense Of Physical Necessity")
LoganStrohl (BrienneYudkowsky) · 2024-03-18T21:25:56.527Z · comments (0)

Monthly Roundup #16: March 2024
Zvi · 2024-03-19T13:10:05.529Z · comments (4)

Introducing REBUS: A Robust Evaluation Benchmark of Understanding Symbols
Arjun Panickssery (arjun-panickssery) · 2024-01-15T21:21:03.962Z · comments (0)

Effectively Handling Disagreements - Introducing a New Workshop
Camille Berger (Camille Berger) · 2024-04-15T16:33:50.339Z · comments (2)

Proveably Safe Self Driving Cars [Modulo Assumptions]
Davidmanheim · 2024-09-15T13:58:19.472Z · comments (26)

AI #63: Introducing Alpha Fold 3
Zvi · 2024-05-09T14:20:03.176Z · comments (2)

More on the Apple Vision Pro
Zvi · 2024-02-13T17:40:05.388Z · comments (5)

UDT1.01: Logical Inductors and Implicit Beliefs (5/10)
Diffractor · 2024-04-18T08:39:13.368Z · comments (2)

[link] FTX expects to return all customer money; clawbacks may go away
Mikhail Samin (mikhail-samin) · 2024-02-14T03:43:13.218Z · comments (1)

[link] Vacuum: Theory and Technologies
ethanmorse · 2024-01-21T17:23:49.257Z · comments (0)

What AI companies should do: Some rough ideas
Zach Stein-Perlman · 2024-10-21T14:00:10.412Z · comments (10)

One True Love
Zvi · 2024-02-09T15:10:05.298Z · comments (7)

[link] Information dark matter
Logan Kieller (logan-kieller) · 2024-10-01T15:05:41.159Z · comments (4)

LLMs can strategically deceive while doing gain-of-function research
Igor Ivanov (igor-ivanov) · 2024-01-24T15:45:08.795Z · comments (4)

The slingshot helps with learning
Wilson Wu (wilson-wu) · 2024-10-31T23:18:16.762Z · comments (0)

How I build and run behavioral interviews
benkuhn · 2024-02-26T05:50:05.328Z · comments (6)

[link] Manifund: 2023 in Review
Austin Chen (austin-chen) · 2024-01-18T23:50:13.557Z · comments (0)

[link] the subreddit size threshold
bhauth · 2024-01-23T00:38:13.747Z · comments (3)

[link] Why you, personally, should want a larger human population
jasoncrawford · 2024-02-23T19:48:10.526Z · comments (32)

A quick experiment on LMs’ inductive biases in performing search
Alex Mallen (alex-mallen) · 2024-04-14T03:41:08.671Z · comments (2)

[question] What Software Should Exist?
Tomás B. (Bjartur Tómas) · 2024-01-19T21:43:50.112Z · answers+comments (27)

[link] Self-Resolving Prediction Markets
PeterMcCluskey · 2024-03-03T02:39:42.212Z · comments (0)

[link] Concrete benefits of making predictions
Jonny Spicer (jonnyspicer) · 2024-10-17T14:23:17.613Z · comments (5)

If you are also the worst at politics
lemonhope (lcmgcd) · 2024-05-26T20:07:49.201Z · comments (8)

5 ways to improve CoT faithfulness
CBiddulph (caleb-biddulph) · 2024-10-05T20:17:12.637Z · comments (34)

Preface to the Sequence on LLM Psychology
Quentin FEUILLADE--MONTIXI (quentin-feuillade-montixi) · 2023-11-07T16:12:07.742Z · comments (0)

Being against involuntary death and being open to change are compatible
Andy_McKenzie · 2024-05-27T06:37:27.644Z · comments (5)

The International PauseAI Protest: Activism under uncertainty
Joseph Miller (Josephm) · 2023-10-12T17:36:15.716Z · comments (1)

5 Reasons Why Governments/Militaries Already Want AI for Information Warfare
trevor (TrevorWiesinger) · 2023-10-30T16:30:38.020Z · comments (0)

In Defense of Lawyers Playing Their Part
Isaac King (KingSupernova) · 2024-07-01T01:32:58.695Z · comments (9)

Open Thread Fall 2024
habryka (habryka4) · 2024-10-05T22:28:50.398Z · comments (114)

[link] New Tool: the Residual Stream Viewer
AdamYedidia (babybeluga) · 2023-10-01T00:49:51.965Z · comments (7)

RLHF is the worst possible thing done when facing the alignment problem
tailcalled · 2024-09-19T18:56:27.676Z · comments (10)

Housing Roundup #10
Zvi · 2024-10-29T13:50:09.416Z · comments (2)

Computational Approaches to Pathogen Detection
jefftk (jkaufman) · 2023-11-01T00:30:13.012Z · comments (5)

Is suffering like shit?
KatjaGrace · 2024-05-31T01:20:03.855Z · comments (5)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

johnburidan on OpenAI Email Archives (from Musk v. Altman)

From a historical perspective this is an excellent treasure cache. Truly when you are the cutting edge of something ideas, relationships, personality, and economics all truly come together to drive history.

radford-neal-1 on Social events with plausible deniability

Then you know that someone who voiced opinion A that you put in the hat, and also opinion B, likely actually believes opinion B.

(There's some slack from the possibility that someone else put opinion B in the hat.)

marius-hobbhahn on Training AI agents to solve hard problems could lead to Scheming

Some questions and responses:
1. What if you want the AI to solve a really hard problem? You don't know how to solve it, so you cannot give it detailed instructions. It's also so hard that the AI cannot solve it without learning new things -> you're back to the story above. The story also just started with someone instructing the model to "cure cancer".
2. Instruction following models are helpful-only. What do you do about the other two H's? Do you trust the users to only put in good instructions? I guess you do want to have some side constraints baked into its personality and these can function like goals. Many of the demonstrations that we have for scheming are cases where the model is too much of a saint, i.e. it schemes for the right cause. For example, it might be willing to deceive its developers if we provide it with strong reasons that they have non-HHH goals. I'm not really sure what to make of this. I guess it's good that it cares about being harmless and honest, but it's also a little bit scary that it cares so much.

My best guess for how the approach should look is that some outcome-based RL will be inevitable if we want to unlock the benefits, we just have to hammer the virtues of being non-scheming and non-power-seeking into it at all points of the training procedure. And we then have to add additional lines of defense like control, interpretability, scalable oversight, etc. and think hard about how we minimize correlated failures. But I feel like right now, we don't really have the right tools, model organisms, and evals to establish whether any of these lines of defense actually reduce the problem.

leogao on "It's a 10% chance which I did 10 times, so it should be 100%"

related: https://xkcd.com/217/

cole-wyeth on Social events with plausible deniability

This kind of thing does justified harm to our community’s reputation. If you have fun arguing that only white people can save us while deliberately obfuscating whether you actually believe that, it is in fact a concerning sign about your intentions/seriousness/integrity/trustworthiness.

mitchell_porter on Why is Gemini telling the user to die?

I don't have a detailed explanation, but the user is posting a series of assignment or exam questions. Some of them are about "abuse". Gemini is providing an example of verbal abuse.

q-home on Q Home's Shortform

I see. But I'm not talking about figuring out human preferences, I'm talking about finding world-models in which real objects (such as "strawberries" or "chairs") can be identified. Sorry if it wasn't clear in my original message because I mentioned "caring".

Models or real objects or things capture something that is not literally present in the world. The world contains shadows of these things, and the most straightforward way of finding models is by looking at the shadows and learning from them.

You might need to specify what you mean a little bit.

The most straightforward way of finding a world-model is just predicting your sensory input. But then you're not guaranteed to get a model in which something corresponding to "real objects" can be easily identified. That's one of the main reasons why ELK [? · GW] is hard, I believe: in an arbitrary world-model, "Human Simulator" can be much simpler than "Direct Translator".

So how do humans get world-models in which something corresponding to "real objects" can be easily identified? My theory is in the original message [LW(p) · GW(p)]. Note that the idea is not just "predict sensory input", it has an additional twist.

marius-hobbhahn on Training AI agents to solve hard problems could lead to Scheming

Good point. That's another crux for which RL seems relevant.

From the perspective of 10 years ago, specifying any goal into the AI seemed incredibly hard since we expected it would have to go through utility functions. With LLMs, this completely changed. Now it's almost trivial to give the goal, and it probably even has a decent understanding of the side constraints by default. So, goal specification seems like a much much smaller problem now.

So the story where we misspecify the goal, the model realizes that the given goal differs from the intended goal and decides to scheme is also less likely.

Instead, there has to be a component where the AIs goals substantially change over time from something that we actually intended to something misaligned. Again, outcome-based RL and instrumental convergence yield a plausible answer.

seth-herd on Training AI agents to solve hard problems could lead to Scheming

Here's my proposal for how we avoid this consequence of consequentialist goals: make the primary goal instruction-following. This is a non-consequentialist goal. All other goals are consequentialist subgoals of that one, when the human gives an instruction to accomplish something.

This would only prevent scheming to accomplish the consequentialist goals instructed your AGI to pursue if it was also used to give side-constraints like "don't lie to me" and lots of time carefully exploring its theories on what its goals mean and how to accomplish them. This approach seems likely to work - but I want to hear more pushback on it before I'd trust it in practice.

I think this is not only an interesting dodge around this class of alignment concerns, but it's the most likely thing to actually be implemented. When someone is actually getting close to launching a system they hope is or will become smarter than they are, they'll think a little harder about making its central goal "solve cancer" or anything else broad and consequentialist. The natural choice is to just extend what LLMs are mostly aligned for now: following instructions, including consequentialist instructions.

This logic is all laid out in more detail in Instruction-following AGI is easier and more likely than value aligned AGI [LW · GW], but I didn't specifically address scheming there.

seth-herd on Training AI agents to solve hard problems could lead to Scheming

Edit note: you responded to approximately the first half of my eventual comment; sorry! I accidentally committed it half-baked, then quickly added the rest. But the meaning of the first part wasn't really changed, so I'll respond to your comments on that part.

I agree that it's not that simple in practice, because we'd try to avoid that by giving side constraints; but it is that simple in the abstract, and by default. If it followed our initial goal as we intended it there would be no problem; but the core of much alignment worry is that it's really hard to get exactly what we intended into an AI as its goal.

I also agree that good HHH training might be enough to overcome the consequentialist/instrumental logic of scheming. Those tendencies would function as side constraints. The AI would have a "character" that is in conflict with its instrumental goal. Which would win out would be a result of exactly how that goal was implemented in the AIs decision-making procedures, particularly the ones surrounding learning.