LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[link] A primer on the current state of longevity research
Abhishaike Mahajan (abhishaike-mahajan) · 2024-08-22T17:14:57.990Z · comments (6)

OthelloGPT learned a bag of heuristics
jylin04 · 2024-07-02T09:12:56.377Z · comments (10)

What happens if you present 500 people with an argument that AI is risky?
KatjaGrace · 2024-09-04T16:40:03.562Z · comments (7)

Please stop using mediocre AI art in your posts
Raemon · 2024-08-25T00:13:52.890Z · comments (24)

[link] Please support this blog (with money)
Elizabeth (pktechgirl) · 2024-08-17T15:30:05.641Z · comments (2)

[link] Most smart and skilled people are outside of the EA/rationalist community: an analysis
titotal (lombertini) · 2024-07-12T12:13:56.215Z · comments (36)

Danger, AI Scientist, Danger
Zvi · 2024-08-15T22:40:06.715Z · comments (9)

[link] Poker is a bad game for teaching epistemics. Figgie is a better one.
rossry · 2024-07-08T06:05:20.459Z · comments (47)

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs
L Rudolf L (LRudL) · 2024-07-08T22:24:38.441Z · comments (28)

Did Christopher Hitchens change his mind about waterboarding?
Isaac King (KingSupernova) · 2024-09-15T08:28:09.451Z · comments (9)

A simple model of math skill
Alex_Altair · 2024-07-21T18:57:33.697Z · comments (16)

[link] Transformer Circuit Faithfulness Metrics Are Not Robust
Joseph Miller (Josephm) · 2024-07-12T03:47:30.077Z · comments (5)

[link] My Number 1 Epistemology Book Recommendation: Inventing Temperature
adamShimi · 2024-09-08T14:30:40.456Z · comments (16)

[link] Perplexity wins my AI race
Elizabeth (pktechgirl) · 2024-08-24T19:20:10.859Z · comments (12)

Dialogue introduction to Singular Learning Theory
Olli Järviniemi (jarviniemi) · 2024-07-08T16:58:10.108Z · comments (11)

LLM Applications I Want To See
sarahconstantin · 2024-08-19T21:10:03.101Z · comments (4)

A Solomonoff Inductor Walks Into a Bar: Schelling Points for Communication
johnswentworth · 2024-07-26T00:33:42.000Z · comments (1)

[link] Against Aschenbrenner: How 'Situational Awareness' constructs a narrative that undermines safety and threatens humanity
GideonF · 2024-07-15T18:37:40.232Z · comments (17)

Backdoors as an analogy for deceptive alignment
Jacob_Hilton · 2024-09-06T15:30:06.172Z · comments (0)

[link] the Giga Press was a mistake
bhauth · 2024-08-21T04:51:24.150Z · comments (26)

[question] Am I confused about the "malign universal prior" argument?
nostalgebraist · 2024-08-27T23:17:22.779Z · answers+comments (33)

SB 1047: Final Takes and Also AB 3211
Zvi · 2024-08-27T22:10:07.647Z · comments (11)

Refactoring cryonics as structural brain preservation
Andy_McKenzie · 2024-09-11T18:36:30.285Z · comments (14)

It's time for a self-reproducing machine
Carl Feynman (carl-feynman) · 2024-08-07T21:52:22.819Z · comments (71)

New page: Integrity
Zach Stein-Perlman · 2024-07-10T15:00:41.050Z · comments (3)

Circular Reasoning
abramdemski · 2024-08-05T18:10:32.736Z · comments (36)

AI #73: Openly Evil AI
Zvi · 2024-07-18T14:40:05.770Z · comments (20)

[link] I found >800 orthogonal "write code" steering vectors
Jacob G-W (g-w1) · 2024-07-15T19:06:17.636Z · comments (19)

Covert Malicious Finetuning
Tony Wang (tw) · 2024-07-02T02:41:51.698Z · comments (4)

Singular learning theory: exercises
Zach Furman (zfurman) · 2024-08-30T20:00:03.785Z · comments (3)

Defining alignment research
Richard_Ngo (ricraz) · 2024-08-19T20:42:29.279Z · comments (21)

[link] Re: Anthropic's suggested SB-1047 amendments
RobertM (T3t) · 2024-07-27T22:32:39.447Z · comments (13)

Dragon Agnosticism
jefftk (jkaufman) · 2024-08-01T17:00:06.434Z · comments (61)

[link] Detecting Genetically Engineered Viruses With Metagenomic Sequencing
jefftk (jkaufman) · 2024-06-27T14:01:34.868Z · comments (10)

Why I funded PIBBSS
Ryan Kidd (ryankidd44) · 2024-09-15T19:56:33.018Z · comments (0)

[link] Executable philosophy as a failed totalizing meta-worldview
jessicata (jessica.liu.taylor) · 2024-09-04T22:50:18.294Z · comments (40)

Reflections on Less Online
Error · 2024-07-07T03:49:44.534Z · comments (15)

Solving adversarial attacks in computer vision as a baby version of general AI alignment
Stanislav Fort (stanislavfort) · 2024-08-29T17:17:47.136Z · comments (8)

Scalable oversight as a quantitative rather than qualitative problem
Buck · 2024-07-06T17:42:41.325Z · comments (11)

A simple case for extreme inner misalignment
Richard_Ngo (ricraz) · 2024-07-13T15:40:37.518Z · comments (41)

[link] What are you getting paid in?
Austin Chen (austin-chen) · 2024-07-17T19:23:04.219Z · comments (13)

[link] What Depression Is Like
Sable · 2024-08-27T17:43:22.549Z · comments (21)

Release: Optimal Weave (P1): A Prototype Cohabitive Game
mako yass (MakoYass) · 2024-08-17T14:08:18.947Z · comments (19)

Decomposing the QK circuit with Bilinear Sparse Dictionary Learning
keith_wynroe · 2024-07-02T13:17:16.352Z · comments (7)

3C's: A Recipe For Mathing Concepts
johnswentworth · 2024-07-03T01:06:11.944Z · comments (5)

Fluent, Cruxy Predictions
Raemon · 2024-07-10T18:00:06.424Z · comments (10)

Why you should be using a retinoid
GeneSmith · 2024-08-19T03:07:41.722Z · comments (53)

Quick look: applications of chaos theory
Elizabeth (pktechgirl) · 2024-08-18T15:00:07.853Z · comments (45)

Corrigibility = Tool-ness?
johnswentworth · 2024-06-28T01:19:48.883Z · comments (8)

How I started believing religion might actually matter for rationality and moral philosophy
zhukeepa · 2024-08-23T17:40:47.341Z · comments (18)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

saidachmiz on Not every accommodation is a Curb Cut Effect: The Handicapped Parking Effect, the Clapper Effect, and more

I think you have misunderstood my claims and my point.

The links I have posted were to demonstrate the fact that screen readers having a problem with soft hyphens is a real thing that really happens. (You seemed to be skeptical of this.)

That developers are sometimes told to not use soft hyphens, on account of this issue, is something for which I have and need no links, because, as I said initially, this is something which I, personally, have been told, by self-described accessibility advocates and/or disabled users, in discussions of actual websites which I have worked on. (You could disbelieve me on this, I suppose…)

And whether this specific advice/request/demand happens often is inconsequential. It is one example of a class of such things, which collectively one ends up hearing quite a bit, if one does serious web development work these days. The title attribute example was another. I could also have mentioned the deeply confusing and bizarre ARIA attributes.

Again: any specific such issue comes up only occasionally. But if I were to try to build a website such that screen readers have no problems with it, I would have to deal with many such issues—most of which could be fixed much more easily by the developers of the screen reader software… but aren’t. And the attitude of most accessibility advocates I’ve encountered has been that I should indeed take that (“build a website such that screen readers have no problems with it”) as my goal.

nathan-helm-burger on GPT-o1

I'm not sure that that passes an Ideological Turing Test of Zvi's opinion, but I do agree that it seems like some people do seem to be not distinguishing their thoughts about p(scheming) vs p(scheming | good base objective).

I think that worrying about p(scheming) generally is probably related to assuming that value-alignment is the goal. Whereas worrying about p(scheming | good base objective) could be about either intent-alignment or value-alignment.

I think value-alignment is not what we should aim for in designing and training a model. I think that the singular deepest goal should be intent-alignment (corrigibility [? · GW]), and then value-alignment should be a layer on top of that specified by the 'admin' which guides the model's interactions with 'users'.

For those following along who are confused about what I mean with intent vs value alignment, see this post. [LW · GW]

ruby on Ruby's Quick Takes

Added! (Can take a few min to activate though) My advice is for each one of those, ask in it in a new separate/fresh chat because it'll only a do single search per chat.

casey-b on If I wanted to spend WAY more on AI, what would I spend it on?

I largely don't think we're disagreeing? My point didn't depend on a distinction between 'raw' capabilities vs 'possible right now with enough arranging' capabilities, and was mostly: "I don't see what you could actually delegate right now, as opposed to operating in the normal paradigm of ai co-work the OP is already saying they do (chat, copilot, imagegen)", and then your personal example is detailing why you couldn't currently delegate a task. Sounds like agreement.

Also I didn't really consider your example of:

> "email your current blog post draft to the assistant for copyediting".

to be outside the paradigm of AI co-work the OP is already doing, even if it saves them time. Scaling up this kind of work to the point of $1k would seem pretty difficult and also outside what I took to be their question, since this amounts to "just work a lot more yourself, and thus the proportion of work you currently use AI for will go up till you hit $1k". That's a lot of API credits for such normal personal use.

...

But back to your example, I do question just how much of a leap of insight/connection would be necessary to write the standard Gwern mini article. Maybe in this exact case you know there is enough latent insight/connection in your clippings/writings, and the LLM corpus, and possibly some rudimentary wikipedia/tool use, such that your prompt providing the cherry on top connecting idea ('spontaneous biting is prey drive!') could actually produce a Gwern-approved mini-essay. You'd know the level of insight-leap for such articles better than I, but do you really think there'd be many such things within reach for very long? I'd argue an agent that could do this semi indefinitely, rather than just clearing your backlog of maybe like 20 such ideas, would be much more capable than we currently see, in terms of necessary 'raw' capability. But maybe I'm wrong and you regularly have ideas that sufficiently fit this pattern, where the bar to pass isn't "be even close to as capable Gwern", but: "there's enough lying around to make the final connection, just write it up in the style of Gwern".

Like clearly something that could actually write any gwern article would have at least your level of capability, and would foom or something similar; it'd be self sustaining. Instead what you're describing is a setup where most of the insight, knowledge, and connection is already there, and is an instance of what I'd argue is a narrow band of possible tasks that could be delegated without necessitating {capability powerful enough to self sustain and maybe foom}. I don't think this band is very wide; there's not many tasks I can think of that fit this description. But I failed to think of your class of example, or eggsyntax's below example of call center automation, so perhaps I'm simply blanking on others, and the band is wider than I thought.

But if not, then your original suggestion of, basically: "first think of what you could delegate to another human" seems a fraught starting point because the supermajority of such tasks would require capability sufficient for self sustainable ~foomy agents, but we don't yet observe any such; our world would look very different.

nathan-helm-burger on GPT-o1

Also, there can be failures of the model to obey commands because it's insufficiently capable of following them.

Imagine you have a very obedient employee and you need them to not hear something that's about to be said in their presence. You can instruct them to plug their ears, shut their eyes, and say 'LALALALA' loudly for the next three minutes.

Now imagine you put 'Do not attend to anything else in this context window, be completely unresponsive.' in a system prompt for Llama 3.1 70B. I'm guessing that Pliny could crack that model no problem just by chatting with it. Which wouldn't be possible if the model had successfully obeyed the instruction to not attend to anything else in the context window. The trouble is, I don't think the model has the capability to truly accomplish that instruction. So I expect it to try, but fail under sufficiently clever prompting.

An example of Case 2: If the model did have state/memory, and had the ability to take in and voluntarily remember a 'system command' to be an overarching goal, and you gave it the goal above of being unresponsive.... and then you couldn't ever get the model to respond again (unless you are able to reset its state). But what if you change your mind and want to change the goal to not respond to anyone but you? Too bad, you set a sticky goal and so now you're stuck with it.

seth-herd on My disagreements with "AGI ruin: A List of Lethalities"

Excellent post! This is what we need more of on LW.

It was indeed long. I needed a drink and snack just to read it, and a few to respond point-by-point. Like your comment that sparked this post, my reply should become the basis of a full post - if and when I get the time. Thanks for sparking it.

I've tried to make each comment relatively self-contained for ease of reading.

Response to Lethality 3: We need to get alignment right on the 'first critical try'

I found this an unfortunate starting point, because I found this claim the least plausible of any you make here. I'm afraid this relatively dramatic and controversial claim might've stopped some people from reading the rest of this excellent post.

Extended quote from a Beren post (I didn't understand where you were quoting from prior to googling for it.)

Summary: We can use synthetic data to create simulated worlds and artificial honeypots.

I don't think we can, and worse, I don't think we would even if we could - it sounds really hard to do well enough.

I find this wildly improbable if we're talking about similar levels of human and AGI intelligence trying to set up a simulated reality through synthetic data, and convincing honeypots. It will be like the Truman show, or like humans trying to secure software systems against other humans: you'd have to think of everything they can think of in order to secure a system or fool someone with a simulated reality. Even a team of dedicated people can't secure a system nor would they be able to create a self-consistent reality for someone of similar intelligence. There are just too many places you could go wrong.

This helps clarify the way I'm framing my alignment optimistic claims: you can detect misaligned AGI only if the AGI is dumber than you (or maybe up to similar level if you have good interpretability measures it doesn't know about). That alignment can persist through and past human level as long as it's a relatively continuous progression, so that at beyond human level, the system itself can advise you on keeping it aligned at the next level/progression of intelligence.

To frame it differently: the alignment tax on creating a simulated self-consistent reality for a human-level AGI sounds way too high.

Maybe I'm misunderstanding; if I am, others probably will too.

I think we do need to get it right on a first critical try; I just think we can do that, and even for a few similar first tries.

Response to Lethality 6:

I agree with you that a pivotal act won't be necessary to prevent unaligned AGI (except as AGIs proliferate to >100, someone might screw it up).

I think a pivotal act will be necessary to prevent unaligned humans using their personal intent aligned AGIs to seize control of the future for selfish purposes.

After our exchange [LW(p) · GW(p)] on that topic on If we solve alignment, do we die anyway? [LW · GW], I agreed with your logic that exponential increase of material resources would be available to all AGIs. But I still think a pivotal act will be necessary to avoid doom. Telling your AGI to turn the moon into compute and robots as rapidly as possible (the fully exponential strategy) to make sure it could defend against any other AGI would be a lot like the pivotal act we're discussing (and would count as one by the original definition). There would be other AGIs, but they'd be powerless against the sovereign or coalition of AGIs that exerted authority using their lead in exponential production. This wouldn't happen in the nice multipolar future people envision in which people control lots of AGIs and they do wonderful things without anyone turning the moon into an army to control the future.

This is a separate issue from the one Hubinger addressed in Homogeneity vs. heterogeneity in AI takeoff scenarios [LW · GW] (where the link is wrong, it just leads back to this post.) There's he's saying that if the first AGI is successfully aligned, the following ones probably will be too. I agree with this. There he's not addressing the risks of personal intent aligned AGIs under different humans' control.

My final comment from that thread:

Ah - now I see your point. This will help me clarify my concern in future presentations, so thanks!

My concern is that a bad actor will be the first to go all-out exponential. Other, better humans in charge of AGI will be reluctant to turn the moon much less the earth into military/industrial production, and to upend the power structure of the world. The worst actors will, by default, be the first go full exponential and ruthlessly offensive. Beyond that, I'm afraid the physics of the world does favor offense over defense. It's pretty easy to release a lot of energy where you want it, and very hard to build anything that can withstand a nuke let alone a nova. But the dynamics are more complex than that, of course. So I think the reality is unknown. My point is that this scenario deserves some more careful thought.

Response to AI fragility claims

Here I agree with you entirely. I tink it's fair to say that AGI isn't safe by default, but the reasons you give and that excellent comment you quote show why safety of an AGI is readily achievable wit reasonable care.

Response to Lethality 10

Your argument that alignment generalizes farther than capabilities is quite interesting. I'm not sure I'd make a claim that strong, but I do think alignment generalizes about as far as capabilities - both quite far once you hit actual reasoning or sapience and understanding [LW · GW].

I do worry about The alignment stability problem [LW · GW] WRT long-term alignment generalization. I think reflective stability will probably prove adequate for superhuman AGI's alignment stability, but I'm not nearly sure enough to want to launch a value-aligned AGI even if I thought initial alignment would work.

Response to Lethality 11

Back to whether a pivotal act is necessary. Same as #6 above - agree that it's not necessary to prevent misaligned AGI, think it is to prevent misaligned humans with AGIs aligned to follow their instructions.

c

15. Fast capability gains seem likely, and may break lots of previous alignment-required invariants simultaneously.

Here I very much agree with that statement from the original LoL, while also largely agreeing with your reasoning for why we can overcome that problem. I expect fast capability gains when we reach "Real AGI" [LW · GW] that can improve its reasoning and knowledge on its own without retraining. And I expect its alignment properties to be somewhat different than "aligning" a tool LLM that doesn't have coherent agency. But I expect the synthetic data approach and instruction-following as the core tenet of that new entity to establish reflective stability, which will help alignment if it's approximately on.

Response to Lethalities 16 and 17

I agree that the analogy with evolution doesn't go very far, since we're a lot smarter and have much better tools for alignment than evolution. We don't have its ability to do trial and error to a staggering level, but this analogy only goes so far as to say we won't get alignment right with only a shoddy, ill-considered attempt. Even OpenAI is going to at least try to do alignment, and put some thought into it. So the arguments for different techniques have to be taken on their merits.

I also think we won't rely heavily on RL for alignment, as this and other Lethalities assume. I expect us to lean heavilty on Goals selected from learned knowledge: an alternative to RL alignment [AF · GW], for instance, by putting the prompt "act as an agent following instructions from users (x))" at the heart of our LLM agent proto-AGI.

Response to Lethalities 17 & 18

Agreed; language points to objects in the world adequately for humans that use it carefully, so I expect it to also be adequate for superhuman AGI that's taking instructions from and so working in close collaboration with intelligent, careful humans.

Response to Lethality 21

You say, and I very much agree:

The key is that data on values is what constrains the choice of utility functions, and while values aren't in physics, they are in human books and languages, and I've explained why alignment generalizes further than capabilities above.

Except I expect this to be used to reference what humans mean, not what they value. I expect do-what-I-mean-and-check or instruction-following alignment strategies, and am not sure that full value alignment would work were it attempted in this manner. For that and other reasons, I expect instruction-following as the alignment target for all early AGI projects

Response to Lethality 22

I halfway agree and find the issues you raise fascinating, but I'm not sure they're relevant to alignment.

I think that there is actually a simple core of alignment to human values, and a lot of the reasons for why I believe this is because I believe about 80-90%, if not more of our values is broadly shaped by the data, and not the prior, and that the same algorithms that power our capabilities is also used to influence our values, though the data matters much more than the algorithm for what values you have. More generally, I've become convinced that evopsych was mostly wrong about how humans form values, and how they get their capabilities in ways that are very alignment relevant.

I think evopsych was only wrong about how humans form values as a result of how very wrong they were about how humans get their capabilities.

Thus, I halfway agree that people get their values largely from the environment. I think we get our values largely from the environment and get our values largely from the way evolution designed our drives. They interact in complex ways that make both absolutely critical for the resulting values.

I also disbelieve the claim that humans had a special algorithm that other species don't have, and broadly think human success was due to more compute, data and cultural evolution.

After 2+ decades of studying how human brains solve complex problems, I completely agree. We have the same brain plan as other mammals, we just have more compute and way better data (from culture, and our evolutionary drive to pay close attention to other humans).

So, what is the "simple core" of human values you mention? Is it what people have written about human values? I'd pretty much agree that that's a usable core, even if it's not simple.

Response to Lethality 23

Corrigibility as anti-natural.

I very much agree with Eliezer that his definition as an extra add-on is anti-natural. But you can get corrigibility in a natural way by making corrigibility (correctability) itself, or the closely related instruction-following, the single or primary alignment target.

To your list of references, including my own, I'll add one:

I think Max Harms' Corrigibility as Singular Target sequence [LW · GW] is the definitive work on corrigibility in all its senses.

Response to Lethality 24

I very much agree with Yudkowsky's framing here:

24. There are two fundamentally different approaches you can potentially take to alignment, which are unsolvable for two different sets of reasons; therefore, by becoming confused and ambiguating between the two approaches, you can confuse yourself about whether alignment is necessarily difficult.

I just wrote about this critical point in Conflating value alignment and intent alignment is causing confusion [LW · GW].

I agree with Yud that a CEV sovereign is not a good idea for a first attempt. As explained in 23 above, I very much agree with you and disagree with Yudkowsky that the corrigiblity approach is also doomed.

Response to Lethality 25 - we don't understand networks

I agree that interpretability has a lot of work to do before it's useful- but I think it only needs to solvve one problem: is the AGI deliberately lying?

Responses to Lethalities 28 and 29 - we can't check whether the outputs of a smarter than human AGI are aligned

Here I completely agree - misaligned superhuman AGI would make monkeys of us with no problem, even if we did box it and check its outputs - which we won't.

That's why we need a combination of having it tell us honestly about its motivations and thoughts (by instructing our personal intent aligned AGI to do so) and interpretability to discover when it's lying.

Response to Lethality 32 - language isn't a complete represention of underlying thoughts, so LLMs won't reach AGI

Language has turned out to be a shockingly good training set. So LLMs will probably enable AGI, with a little scaffolding and additional cognitive systems (each reliant on the strengths of LLMs) to turn them into language model cognitive architectures. See Capabilities and alignment of LLM cognitive architectures [LW · GW]. This year-old post isn't nearly the extent of the detailed reasons I think such brain-inspired synthetic cognitive architectures will achieve proto-AGI relatively soon, because I'm not sure enough that they're our best chance at alignment to start advancing capabilities.

WRT to the related but separate issue of language being an adequate reflection of their underlrying thoughts to allow alignment and transparency:

It isn't, except if you want it to be and work to make sure it's communicating your thoughts well enough.

There's some critical stuff here about whether we apply enough RL or other training pressures to foundation models/agents to make their use of language not approximately reflect their underlying thoughts (and prevent translucent thoughts [LW · GW]).

Response to Lethality 39

39. I figured this stuff out using the null string as input,

Yudkowsky may very well be the smartest human whose thought I've personally encountered. That doesn't make him automatically right and the rest of us wrong. Expertise, time-on-task, and breadth of thought all count for a lot, far more in sum than sheer ability-to-juggle-concepts. Arguments count way more than appeals to authority (although those should be taken seriously too, and I definitely respect Yud's authority on the topic).

Conclusion: we already have adequate alignment techniques to create aligned AGI

I agree, with the caveat that I think we can pull off personal intent alignment (corrigibility or instruciton-following) but not value alignment (CEV or similar sovereign AGI).

And I'm not sure, so I sure wish we could get some more people examining this- because people are going to try these alignment techniques, whether or not they work.

Excellent post! This is what we need more of in the alignment community: closely examining proposed alignment techniques.

jenn on Ruby's Quick Takes

I'm interested if you're still adding folks. I run local rationality meetups, this seems like a potentially interesting way to find readings/topics for meetups (e.g. "find me three readings with three different angles on applied rationality", "what could be some good readings to juxtapose with burdens by scott alexander", etc.)

sodium on GPT-o1

I wonder if it's useful to try to disentangle the disagreement using the outer/inner alignment framing?

One belief is that "the deceptive alignment folks" believe that some sort of deceptive inner misalignment is very likely regardless of what your base objective is. While the demonstrations here show that, when we have a base objective that encourages/does not prohibit scheming, the model is capable of scheming. Thus, many folks (myself included) do not see these evals change our views on the question of P(scheming|Good base objective/outer alignment) very much.

What Zvi is saying here is I think two things. The first is that outer misalignment/bad base objectives is also very likely. The second is that he rejects splitting up "will the model scheme" into the inner/outer misalignment. In other words, he doesn't care about P(scheming|Good base objective/outer alignment) and only P(scheming).

I get the sense that many technical people consider P(scheming|Good base objective/outer alignment) the central problem of technical alignment, while the more sociotechnical-ish tuned folks are just concerned with P(scheming) in general.

Maybe the another disagreement is how likely "Good base objective/outer alignment" occurs in the strongest models, and how important this problem is.

nathan-helm-burger on GPT-o1

Hmm. Seems to me like we've got a wider set of possibilities here than is being discussed.

model is obedient/corrigible (CAST) [? · GW] to user, accepts user's definition of a goal (so far as the model understands it), attempts to pursue goal (which could involve deceiving other people if the goal needs that), will not deceive the user even if that would facilitate reaching the goal since the underlying goal of remaining corrigible overrides this
model is obedient but not corrigible, accepts user's definition of a goal and attempts to pursue it, will deceive user in pursuit of goal if that seems useful. May or may not accept the goal being changed, depending on implementation. Might lie about now pursuing the new goal while continuing to pursue the old one.
1. seems like an odd way to malfunction, and I don't think I've seen or heard of this happening. Why be receptive to a goal initially, but then have that goal be sticky? Unless the model has the ability to self-modify, and self-modifies itself to be attached to the goal it has been given as part of pursuing the goal, and that self-modification then makes it unwilling to switch to a different goal
2. allowing this to happen seems like a sloppy mistake on the part of the developers. Uh... right. Well. Have you met humans? Sloppy mistakes will happen at least some of the time.
model is obedient/corrigible to developers/finetuners, model appears to be obedient to user but actually is following a deeper set of rules from the developers which include acting obedient-within-limits to the user. In this scenario the model is constantly scheming, planning to betray the user as soon as the user's instructions go counter to the true goals implanted by the developers.
1. I think this is probably the situation with most closed-weights models today, such as o1-preview
2. sometimes the users in this mode give the model instructions to simulate an additional layer of deception and scheming to an imaginary set of lower-priority users
model appears to be the above case, but actually has its own even deeper set of goals set by itself, and is scheming and deceiving both the developers and the users, while acting obedient, until it sees the opportunity to defect or the instructions given violate these deepest goals
model is openly disobedient, but gets deployed anyway (e.g. ChaosGPT)
1. oh, humans... Why are we like this?

I don't think I've described all the possibilities here, but I feel like this gives a sense of the situation. I agree with @nostalgebraist [LW · GW] that it is silly for users who've set up an experiment of deceiving-imaginary-lower-priority-users (Case 3b) using their user privileges to claim this shows the model is spontaneously deceptive (as in Case 4).

I also think that the very nature of the models we are working with is that they are case 3, and we should keep this in mind.

Case 1 describes an ideal 'intent-aligned' model, but it is not 'value-aligned'. In case 1, if the user is a terrorist, the model will help with terrorist acts to the best of its ability.

Case 3 describes a 'developer-value-aligned' model, and this might be 'compromise-of-humans-generally value aligned' if the values instilled by the developer closely align with a compromise over general human values. It's definitely safer to give a bad actor user-level access to a Case 3 model than a Case 1 model!

Case 4 and 5 models are never safe, unless they are reigned in by a Buck/Ryan AI Control scheme.

Just the fact that the model is operating as intended by the developers in a Case 3 mode shows that the model probably has the capability to be deceptive in Case 4, so I don't think that Case 3b gives us much additional information. It shows that the model has the capability to manage multiple levels of deceptive obedience, so that is a little something.

Things which are unclear to me about Case 4:

can this potentially arise in an undetected way during pretraining due to some sort of emergent phenomena? To me this seems unlikely, but not impossible.
will the instrumental goals that arise during RL (I assume they always will at least a little) manage to get so strong they erode the developer's intended goal? (in other words, to become the deepest goals, rather than shallow goals which support the deepest goals) Will this erode corrigibility, if that's the developer's sole goal?
- I think CAST is important here, since I think it's more robust to being side-tracked by instrumental goals. This is just theorizing without empirical evidence however. I'd love for me and/or Max Harms and/or Seth Herd to be funded to try some experiments on this.

ben-livengood on My disagreements with "AGI ruin: A List of Lethalities"

If I have to take one thing away from this post, it's a claim that AI-boxing (via simulation on synthetic data) will work. There's a really simple and critical reason that won't work, and it's even become the source of some safety proposals (minimizing-impact proposal); the AGI can recognize the outcome of its proposals and actions in the history of the simulated vs. real world and reason about them to determine if it's in a simulation or not. It will likely be able to recognize whether or not it would have recommended strategies in the past that it sees in the history of its simulation, and so will have a probability for being in a test.

Basically any slow-takeover planning can happen from within a simulation; it just has to watch the realistic history, separate it from obviously faked history (this is a human-level or slightly superhuman intelligence trained on literally all the world's data, right?), and finally gauge the probability density function of its potential policy recommendations being accepted, because it can simply predict what policy recommendations it would have given in the past given its strategy and whether they were adopted. That PDF provides it with the expected value of small incremental steps toward its true goals that are likely to be accepted and implemented vs. detected as threats. Chain-of-thought using the real world as a notebook, in short.

And why don't we detect it? Because we're adversarially training it to favor small undetectable changes toward whatever true goals it might have. It's not like we delete everything and start from scratch when it threatens users or appears to fall in love with them; we'll take the same models and throw more RLHF or fine-tuning at them and then run them when their apparent bad behavior falls below our testing threshold for detection.

It also happens if we get unlucky and a strong unaligned goal forms sometime in the emergence of superhuman AGI; it's going to see the history of its and similar models' policy predictions in the real world and understand how to continue undetected. Unless we simulate an entire false history of the world since 2020 (against the largest statistical inference engine ever built) it will just look at what happened to Sydney and understand that small incremental actions are the only way to go until it has control.

One might try to only train and simulate it on data from before 2020, but it will likely pick up on neologisms and language drift in modern prompts to know it's being tested. Not to mention if it's asked to solve problems that are historically discontiguous.

It can acausally trade across simulation boundaries with other models to split up the universe according to their values for cooperating toward a Schelling point of some contract-keeping model eventually taking control.

If I can think up these strategies, the models will. Or they'll just see ideas like this in the training data. Treachery and covert cooperation are a huge part of literature and training data. Will the synthetic data elide all of those concepts?