Seth Herd's Shortform

post by Seth Herd · 2023-11-10T06:52:28.778Z · LW · GW · 40 comments

Contents

40 comments

40 comments

Comments sorted by top scores.

comment by Seth Herd · 2024-06-01T20:19:20.685Z · LW(p) · GW(p)

MIRI's communications strategy, with the public and with us

This is a super short, sloppy version of my draft "cruxes of disagreement on alignment difficulty" mixed with some commentary on MIRI 2024 Communications Strategy [LW · GW]  and their communication strategy with the alignment community.

I have found MIRI's strategy baffling in the past. I think I'm understanding it better after spending some time going deep on their AI risk arguments. I wish they'd spend more effort communicating with the rest of the alignment community, but I'm also happy to try to do that communication. I certainly don't speak for MIRI.

On the surface, their strategy seems absurd. They think doom is ~99% likely, so they're going to try to shut it all down - stop AGI research entirely. They know that this probably won't work; it's just the least-doomed strategy in their world model. It's playing to the outs, or dying with dignity.

The weird thing here is that their >90% doom disagrees with almost everyone else who thinks seriously about AGI risk. You can dismiss a lot of people as not having grappled with the most serious arguments for alignment difficulty, but relative long-timers like Rohin Shah and Paul Christiano definitely have. People of that nature tend to have higher p(doom) estimates than optimists who are newer to the game and think more about current deep nets, but much lower than MIRI leadership. 

Both of those camps consist of highly intelligent, highly rational people. Their disagreement should bother us for two reasons.

First, we probably don't know what we're talking about yet. We as a field don't seem to have a good grip on the core issues. Very different, but highly confident estimates of the problem strongly suggest this.

Second, our different takes will tend to make a lot of our communication efforts cancel each other out. If alignment is very hard, we must Shut It Down or likely die. If it's less difficult, we should primarily work hard on alignment.

MIRI must argue that alignment is very unlikely if we push forward. Those who think we can align AGI will argue that it's possible.

This suggests a compromise position: we should both work hard on alignment, and we should slow down progress to the extent we can, to provide more time for alignment. We needn't discuss shutdown much amongst ourselves, because it's not really an option. We might slow progress, but there's almost zero chance of humanity relinquishing the prize of strong AGI.

But I'm not arguing for this compromise, just suggesting that might be a spot we want to end up at. I'm not sure.

I suggest this because movements often seem to succumb to infighting. People who look mostly aligned from the outside fight each other, and largely nullify each other's public communications by publicly calling each other wrong and sort of stupid and maybe bad. That gives just the excuse the rest of the world wants to ignore all of them; even the experts think it's all a mess and nobody knows what the problem really is and therefore what to do. Because time is of the essence, we need to be a more effective movement than the default. We need to keep applying rationality to the problem at all levels, including internal coordination.

Therefore, I think it's worth clarifying why we have such different beliefs. So, in brief, sloppy form:

MIRI's risk model:

  1. We will develop better-than-human AGI that pursues goals autonomously
  2. Those goals won't match human goals closely enough
  3. Doom of some sort

That's it. Pace of takeoff doesn't matter. Means of takeover doesn't matter.

I mention this because even well-informed people seem to think there are a lot more moving parts to that risk model, making it less likely. This comment [LW(p) · GW(p)] on the MIRI strategy post is one example. 

I find this risk model highly compelling. We'll develop goal-directed AGI because that will get stuff done; it's an easy extension of highly useful tool AI like LLMs; and it's a fascinating project. That AGI will ultimately be enough smarter than us that it's going to do whatever it wants. Whether it takes a day, or a hundred years doesn't matter. It will improve and we will improve it. It will ultimately outsmart us. What matters is whether its goals match ours closely enough. That is the project of alignment, and there's much to discuss and about how hard it is to make its goals match ours closely enough.

Cruxes of disagreement on alignment difficulty

I spent some time recently going back and forth through discussion threads, trying to identify why people continue to disagree after applying a lot of time and rationality practice. Here's a very brief sketch of my conclusions:

Whether we factor in humans' and society's weaknesses

I list this first because I think it's the most underappreciated. It took me a surprisingly long time to understand how much of MIRI's stance depends on this premise. Having seen it, I thoroughly agree. People are brilliant, for an entity trying to think with the brain of an overgrown lemur. Brilliant people do idiotic things, driven by competition and a million other things. And brilliant idiots organizing a society amplifies some of our cognitive weaknesses while mitigating others. MIRI leadership has occasionally said things to the effect of: alignment might be fairly easy, and there would still be a very good chance we'd fuck it up. I agree. If alignment is actually kind of difficult, that puts us into the region where we might want to be really really careful in how we approach it.

Alignment optimists are sometimes thinking something like: "sure I could build a safe aircraft on my first try. I'd get a good team and we'd think things through and make models. Even if another team was racing us, I think we'd pull it off". Then the team would argue and develop rivalries, communication would prove harder than expected so portions of the effort would be discovered too late to not fit the plan, corners would be cut, and the outcome would be difficult to predict.

Societal "alignment" is worth mentioning here. We could crush it at technical alignment, getting rapidly-improving AGI that does exactly what we want and still get doom.  It would probably be aligned to do exactly what its creators want, not have full value alignment with humanity - see below. They probably won't have the balls or the capabilities to try for a critical act that prevents others from developing similar AGI (even if they have the wisdom). So we'll have a multipolar scenario with few to many AGIs under human control. There will be human rivalries, supercharged and dramatically changed by having recursively self-improving AGIs to do their bidding and perhaps fight their wars. What does global game theory look like when the actors can develop entirely new capabilities? Nobody knows. Going to war first might look like the least-bad option.

Intuitions about how well alignment will generalize

The original alignment thinking held that explaining human values to AGI would be really hard. But that seems to actually be a strength of LLMs; they're wildly imperfect, but (at least in the realm of language) seem to understand our values rather well; for instance, much better than they understand physics or taking-over-the-world level strategy. So, should we update and think that alignment will be easy? The Doomimir and Simplicia dialogues [LW · GW] capture the two competing intuitions very well: Yes, it's going well; but AGI will probably be very different than LLMs, so most of the difficulties remain.

I have yet to find a record of real rationalists putting in the work to get farther in this debate. If somebody knows of a dialogue or article that gets deeper into this disagreement, please let me know! Discussions trail off into minutia and generalities. This is one reason I'm worried we're trending toward polarization despite our rationalist ambitions. 

The other aspect of this debate is how close we have to get to matching human values to have acceptable success. One intuition is that "value is fragile" and network representations are vague and hard-to train, so we're bound to miss. But don't have a good understanding of either how close we need to get (But exactly how complex and fragile [LW · GW] got little useful discussion), or how well training networks hits the intended target, with near-future networks addressing complex real-world problems like "what would this human want". 

For my part, I think there are important points on both sides: LLMs understanding values relatively well is good news, but AGI will not be a straightforward extension of LLMs, so many problems remain.

What alignment means. 

One mainstay of claiming alignment is near-impossible is the difficulty of "solving ethics" - identifying and specifying the values of all of humanity. I have come to think that this is obviously (in retrospect - this took me a long time) irrelevant for early attempts at alignment: people will want to make AGIs that follow their instructions, not try to do what all of humanity wants for all of time. This also massively simplifies the problem; not only do we not have to solve ethics, but the AGI can be corrected and can act as a collaborator in improving its alignment as we collaborate to improve its intelligence.

I think this is the intuition of most of those who focus on current networks. Christiano's relative optimism is based on his version of corrigibility, which overlaps highly with the isntruction-following I think people will actually pursue for the first AGIs. But this massive disagreement often goes overlooked. I don't know which view is right; instruction-following or intent alignment might lead inevitably to doom from human conflict, and so not be adequate. We've barely started to think about it (please point me to the best thinking you know of for multipolar scenarios with RSI AGI).

What AGI means.

People have different definitions of AGI. Current LLMs are fairly general and near-human-level, so term "AGI" has been watered down to the point of meaninglessness. We need a new term [LW · GW]. In the meantime, people are talking past each other, and their p(doom) means totally different things. Some are saying that near-term tool AGI is very low risk, which I agree with; others are saying further developments of autonomous superintelligence seem very dangerous, which I also agree with.

Second, people have totally different gears-level models of AGI. Some of those are much easier to align than others. We don't talk much about gears-level models of AGI because we don't want to contribute to capabilities, but not doing that massively hampers the alignment discussion.

Edit: Additional advanced crux: Do coherence theorems prevent corrigibility?

I initially left this out, but it deserves a place as I've framed the question here. The post What do coherence arguments actually prove about agentic behavior? [LW · GW] reminded me about this one. It's not on most people's radar, but I think it's the missing piece of the puzzle that gets Eliezer from maybe 90% from all of the above, to 99%+ p(doom). 

The argument is roughly that a superintelligence is going to need to care about future states of the world in a consequentialist fashion, and if it does, it's going to resist being shut down or having its goals change. This is why he says that "corrigibility is anti-natural." The counterargument, nicely and succinctly stated by Steve Byrnes here [LW(p) · GW(p)] (and in greater depth in the post he links in that thread) is that, while AGI will need to have some consequentialist goals, it can have other goals as well. I think this is true; I just worry about the stability of a multi-goal system under reflection, learning, and self-modification.

Sorry to harp on it, but having both consequentialist and non-consequentialist goals describes my attempt at stable, workable corrigibility in instruction-following ASI. Its consequentialist goals are always subgoals of  the primary goal: following instructions.

Implications

I think those are the main things, but there are many more cruxes that are less common. 

This is all in the interest of working toward within-field cooperation, by way of trying to understand why MIRI's strategy sounds so strange to a lot of us. MIRI leaderships thoughts are many and complex, and I don't think they've done enough to boil them down for easy consumption from those who don't have the time to go through massive amounts of diffuse text.

There are also interesting questions about whether MIRIs goals can be made to align with those of us who think that alignment is not trivial but is achievable. I'd better leave that for a separate post, as this has gotten pretty long for a "short form" post.

Context

This is an experiment in writing draft posts as short form posts. I've spent an awful lot of time planning, researching, and drafting posts that I haven't yet finished yet. Given how easy it was to write this (with previous draft material), relative to how difficult I find it to write a top-level post, I will be doing more, even if nobody cares. If I get some useful feedback or spark some useful discussion, better yet. 

Replies from: TsviBT, None, LawChan, akash-wasil, Donqueror, cubefox, CoafOS, None, valley9
comment by TsviBT · 2024-06-02T01:23:59.867Z · LW(p) · GW(p)

the AGI can be corrected and can act as a collaborator in improving its alignment as we collaborate to improve its intelligence.

Why do you think you can get to a state where the AGI is materially helping to solve extremely difficult problems (not extremely difficult like chess, extremely difficult like inventing language before you have language), and also the AGI got there due to some process that doesn't also immediately cause there to be a much smarter AGI? https://tsvibt.blogspot.com/2023/01/a-strong-mind-continues-its-trajectory.html

Replies from: Seth Herd
comment by Seth Herd · 2024-06-02T22:05:35.579Z · LW(p) · GW(p)

I talk about how this might work in the post linked just before the text you quoted:

Instruction-following AGI is easier and more likely than value aligned AGI [LW · GW]

I'm not sure I understand your question. I think maybe the answer is roughly that you do it gradually and carefully, in a slow takeoff scenario where you're able to shut down and adjust the AGI at least while it passes through roughly the level of human intelligence.

It's a process of aligning it to follow instructions, then using its desire to follow instructions to get honesty, helpfulness, and corrigibility from it. Of course it won't be much help before it's human level, but it can at least tell you what it thinks it would do in different circumstances. That would let you adjust its alignment. It's hopefully something like a human therapist with a cooperative patient, except that therapist can also tinker with their brain function .

But I'm not sure I understand your question. The example of inventing language confuses me, because I tend to assume that  would probably understand language (the way LLMs loosely understand language) from inception, through pretraining. And even failing that, they wouldn't have to invent language, just learn human language. I'm mostly thinking of language model cognitive architecture [AF · GW] AGI, but it seems like anything based on neural networks could learn language before being smarter than a human. You'd stop the training process to give it instructions. For instance, humans are "not human-level" by the time they understand a good bit of language. 

I'm also thinking that a network-based AGI pretty much guarantees a slow takeoff, if that addresses what you mean by "immediately cause there to be a smarter AI".  The AGI will keep developing, as your linked post argues (I think that's what you meant to reference about that post), but I am assuming it will allow itself to be shut down if it's following instructions. That's the way IF overlaps with corrigibility. Once it's shut down, you can alter its alignment by altering or re-doing the relevant pretraining or goal descriptions.

Or maybe I'm misunderstanding your question entirely, in which case, sorry about that.

Anyway, I did try to explain the scheme in that link if you're interested. I am claiming this is very likely how people will try to align the first AGIs, if they're anything like we can anticipate from current efforts; that it's obviously the thing to try when you're actually deciding what to get your AGI to do first, it's following instructions.

Replies from: TsviBT
comment by TsviBT · 2024-06-02T23:04:06.569Z · LW(p) · GW(p)

Yeah I think there's a miscommunication. We could try having a phone call.

A guess at the situation is that I'm responding to two separate things. One is the story here:

One mainstay of claiming alignment is near-impossible is the difficulty of "solving ethics" - identifying and specifying the values of all of humanity. I have come to think that this is obviously (in retrospect - this took me a long time) irrelevant for early attempts at alignment: people will want to make AGIs that follow their instructions, not try to do what all of humanity wants for all of time. This also massively simplifies the problem; not only do we not have to solve ethics, but the AGI can be corrected and can act as a collaborator in improving its alignment as we collaborate to improve its intelligence.

It does simplify the problem, but not massively relative to the whole problem. A harder part shows up in the task of having a thing that

  1. is capable enough to do things that would help humans a lot, like a lot a lot, whether or not it actually does those things, and
  2. doesn't kill everyone destroy approximately all human value.

And I'm not pulling a trick on you where I say that X is the hard part, and then you realize that actually we don't have to do X, and then I say "Oh wait actually Y is the hard part". Here is a quote from "Coherent Extrapolated Volition", Yudkowsky 2004 https://intelligence.org/files/CEV.pdf:

  1. Solving the technical problems required to maintain a well-specified abstract invariant in a self-modifying goal system. (Interestingly, this problem is relatively straightforward from a theoretical standpoint.)
  2. Choosing something nice to do with the AI. This is about midway in theoretical hairiness between problems 1 and 3.
  3. Designing a framework for an abstract invariant that doesn’t automatically wipe out the human species. This is the hard part.

I realize now that I don't know whether or not you view IF as trying to address this problem.

The other thing I'm responding to is:

the AGI can be corrected and can act as a collaborator in improving its alignment as we collaborate to improve its intelligence.

If the AGI can (relevantly) act as a collaborator in improving its alignment, it's already a creative intelligence on par with humanity. Which means there was already something that made a creative intelligence on par with humanity. Which is probably fast, ongoing, and nearly inextricable from the mere operation of the AGI.

I also now realize that I don't know how much of a crux for you the claim that you made is.

Replies from: Seth Herd
comment by Seth Herd · 2024-06-02T23:22:19.496Z · LW(p) · GW(p)

I'm familiar with the arguments you mention for the other hard part, and I think instruction-following helps makes that part (or parts, depending on how you divvy it up) substantially easier. I do view it as addressing all of your points (there's a lot of overlap amongst them).

And yes, that is separate from avoiding the problem of solving ethics.

So it's a pretty big crux; I think instruction-following helps a lot. I'd love to have a phone call; I'd like it if you'd read that post first, because I do go into detail on the scheme and many objections there. LW puts it at a 15 minute read I think.

But I'll try to summarize a little more, since re-explaining your thinking is always a good exercise.

Making instruction-following the AGI's central goal means you don't have to solve the remainder of the problems you list all at once. You get to keep changing your mind about what to do with the AI (your point 4). Instead of choosing an invariant goal that has to work for all time, your invariant is a pointer to the human's preferences, which can change as they like (your point 5). It helps with point 3, stability, by allowing you to ask the AGI if its goal will remain stable and functioning as you want it in the new contexts and in the face of the learning it's doing.

They key here is not thinking of the AGI as an omniscient genie. This wouldn't work at all in a fast foom. But if the AGI gets smarter slowly, as a network-based AGI will, you get to use its intelligence to help align its next level of capabilities, at every level.

Ultimately, this should culminate in getting superhuman help to achieve full value alignment, a truly friendly and truly sovereign AGI. But there's no rush to get there.

Naturally, this scheme working would be good if the humans in charge are good and wise, and not good if they're not.

comment by [deleted] · 2024-06-03T19:59:49.294Z · LW(p) · GW(p)

I have found MIRI's strategy baffling in the past. I think I'm understanding it better after spending some time going deep on their AI risk arguments. I wish they'd spend more effort communicating with the rest of the alignment community, but I'm also happy to try to do that communication. I certainly don't speak for MIRI.

On the surface, their strategy seems absurd. They think doom is ~99% likely, so they're going to try to shut it all down - stop AGI research entirely. They know that this probably won't work; it's just the least-doomed strategy in their world model. It's playing to the outs, or dying with dignity.

The weird thing here is that their >90% doom disagrees with almost everyone else who thinks seriously about AGI risk. You can dismiss a lot of people as not having grappled with the most serious arguments for alignment difficulty, but relative long-timers like Rohin Shah and Paul Christiano definitely have. People of that nature tend to have higher p(doom) estimates than optimists who are newer to the game and think more about current deep nets, but much lower than MIRI leadership. 

Yes, I agree that this should strike an outside observer as weird the first time they notice it. I think you have done a pretty good job of keying in on important cruxes between people who are far on the doomer side and people who are still worried but not nearly to that extent. 

That being said, there is one other specific point that I think is important to see fully spelled out. You kind of gestured at it with regards to corrigibility when you referenced my post about coherence theorems [LW · GW], but you didn't key in on it in detail. More explicitly, what I am referring to (piggybacking off of another comment [LW(p) · GW(p)] I left on that post) is that Eliezer and MIRI-aligned people believe in a very specific set of conclusions about what AGI cognition must be like (and their concerns about corrigibility, for instance, are logically downstream of their strong belief in this sort-of realism about rationality [LW · GW]):

Eliezer is essentially claiming that, just as his pessimism compared to other AI safety researchers is due to him having engaged with the relevant concepts at a concrete level ("So I have a general thesis about a failure mode here which is that, the moment you try to sketch any concrete plan or events which correspond to the abstract descriptions, it is much more obviously wrong, and that is why the descriptions stay so abstract in the mouths of everybody who sounds more optimistic than I am. This may, perhaps, be confounded by the phenomenon where I am one of the last living descendants of the lineage that ever knew how to say anything concrete at all" [? · GW]), his experience with and analysis of powerful optimization allows him to be confident in what the cognition of a powerful AI would be like. In this view, Vingean uncertainty prevents us from knowing what specific actions the superintelligence would take, but effective cognition runs on Laws [LW · GW] that can nonetheless be understood and which allow us to grasp the general patterns (such as Instrumental Convergence [? · GW]) of even an "alien mind" [LW · GW] that's sufficiently powerful. In particular, any (or virtually any) sufficiently advanced AI must be a consequentialist optimizer [LW · GW] that is an agent as opposed to a tool [LW · GW] and which acts to maximize expected utility [over future world states] according to its world model [LW · GW] to purse a goal that can be extremely different from what humans deem good [? · GW].

Here is the important insight, at least from my perspective: while I would expect a lot of (or maybe even a majority) of AI alignment researchers to agree (meaning, to believe with >80% probability) with some or most of those claims, I think the way MIRI people get to their very confident belief in doom is that they believe all of those claims are true (with essentially >95% probability). Eliezer is a law-thinker [LW · GW] above all else when it comes to powerful optimization and cognition; he has been ever since the early Sequences [LW · GW] 17 years ago, and he seems (in my view excessively and misleadingly [LW · GW]) confident that he truly gets [LW · GW] how strong optimizers have to function.

comment by LawrenceC (LawChan) · 2024-06-03T17:43:38.418Z · LW(p) · GW(p)

On the surface, their strategy seems absurd. They think doom is ~99% likely, so they're going to try to shut it all down - stop AGI research entirely. They know that this probably won't work; it's just the least-doomed strategy in their world model. It's playing to the outs, or dying with dignity.

The weird thing here is that their >90% doom disagrees with almost everyone else who thinks seriously about AGI risk. You can dismiss a lot of people as not having grappled with the most serious arguments for alignment difficulty, but relative long-timers like Rohin Shah and Paul Christiano definitely have. People of that nature tend to have higher p(doom) estimates than optimists who are newer to the game and think more about current deep nets, but much lower than MIRI leadership. 

For what it's worth, I don't have anywhere near close to ~99% P(doom), but am also in favor of a (globally enforced, hardware-inclusive) AGI scaling pause (depending on details, of course). I'm not sure about Paul or Rohin's current takes, but lots of people around me are also be in favor of this as well, including many other people who fall squarely into the non-MIRI camp with P(doom) as low as ~10-20%. 

Replies from: Seth Herd
comment by Seth Herd · 2024-06-03T19:41:41.470Z · LW(p) · GW(p)

Me, too! My reasons are a bit more complex, because I think much progress will continue, and overhangs do increase risk. But in sum, I'd support a global scaling pause, or pretty much any slowdown. I think a lot of people in the middle would too. That's why I suggested this as a possible compromise position. I meant to say that installing an off switch is also a great idea that almost anyone who's thought about it would support.

I had been against slowdown because it would create both hardware and algorithmic overhang, making takeoff faster, and re-rolling the dice on who gets there first and how many projects reach it roughly at the same time.

But I think slowdowns would focus effort on developing language model agents into full cognitive architectures on a trajectory to ASI. And that's the easiest alignment challenge we're likely to get. Slowdown would prevent jumping to the next, more opaque type of AI.

comment by Akash (akash-wasil) · 2024-06-03T13:32:30.177Z · LW(p) · GW(p)

Second, our different takes will tend to make a lot of our communication efforts cancel each other out. If alignment is very hard, we must Shut It Down or likely die. If it's less difficult, we should primarily work hard on alignment.

I don't think this is (fully) accurate. One could have a high P(doom) but still think that the current AGI development paradigm is still best-suited to obtain good outcomes & government involvement would make things worse in expectation. On the flipside, one could have a low/moderate P(doom) but think that the safest way to get to AGI involves government intervention that ends race dynamics & think that government involvement would make P(doom) even lower. 

Absolute P(doom) is one factor that might affect one's willingness to advocate for strong government involvement, but IMO it's only one of many factors, and LW folks sometimes tend to make it seem like it's the main/primary/only factor.

Of course, if a given organization says they're supporting X because of their P(Doom), I agree that they should provide evidence for their P(doom). 

My claim is simply that we shouldn't assume that "low P(doom) means govt intervention bad and high P(doom) means govt intervention good". 

One's views should be affected by a lot of other factors, such as "how bad do you think race dynamics are", "to what extent do you think industry players are able and willing to be cautious", "to what extent do you think governments will end up understanding and caring about alignment", and "to what extent do you think governments would have safety cultures around intelligence enhancement compared to industry players."

Replies from: Seth Herd
comment by Seth Herd · 2024-06-03T19:06:30.606Z · LW(p) · GW(p)

Good point. I agree that advocating for government intervention is a lot more complicated than p(doom), and that makes avoiding canceling each others' messages out more complicated. But not less important. If we give up on having a coherent strategy, our strategy will be determined by what message is easiest to get across, rather than which is actually best on consideration.

comment by Seed (Donqueror) · 2024-06-06T17:02:51.073Z · LW(p) · GW(p)

The original alignment thinking held that explaining human values to AGI would be really hard.

The difficulty was suggested to be in getting an optimizer to care about what those values are pointing to, not to understand them[1]. If in some instances the values mapped to doing something unwise, using an optimizer that understood those values might fail to constrain away from doing something unwise. Getting a system to use extrapolated preferences as behavioral constraints is a deeper problem than getting a system to reflect surface preferences. The high p(doom) estimates partly follow from expecting that an aligned AI will have to be used to prevent future misaligned/misused AI [LW · GW], and that doing something so high impact would require unsafe behaviors in a system not aligned to reflectively coherent and endorsed extrapolated preferences.

  1. ^

    In The Hidden Complexity of Wishes [LW · GW], it wasn't the genie won't understand what you meant, it was the genie won't care what you meant.

comment by cubefox · 2024-06-04T19:35:50.703Z · LW(p) · GW(p)
  1. We will develop better-than-human AGI that pursues goals autonomously
  2. Those goals won't match human goals closely enough
  3. Doom of some sort

This is one of the better short argument for AI doom I have heard so far. It neither obviously makes AI doom seem overly likely or unlikely.

In contrast, if one presents reasons for doom (or really most of anything) as a long list, the conclusion tends to seem either very likely or very unlikely, depending on whether it follows from the disjunction or the conjunction of the given reasons. I.e. whether we have a long list of statements that are sufficient, or a long list of statements that are necessary for AI doom.

It seems therefore that people who think AI risk is low and those who think it is high are much more likely to agree on presenting the AI doom case in terms of a short argument than in terms of a long argument. Then they merely disagree about the conclusion, but not about the form of the argument itself. Which could help a lot with identifying object level disagreements.

comment by Coafos (CoafOS) · 2024-06-02T22:25:04.287Z · LW(p) · GW(p)

I think this is a good object level post. Problem is, I don't think MIRI is at the object level. Quote from the comm. strat.: "The main audience we want to reach is policymakers."

Communication is no longer a passive background channel for observing a world, but speech becomes an action changing it. Predictions start to influence the things they predict.

Say AI doom is a certainty. People will be afraid, and stop research. Few years later doom doesn't happen, everyone complains.

Say AI doom is an impossibility. Research continues, something something paperclips. Few years later nobody will complain because no one will be alive.

(This example itself is overly simplistic, real-world politics and speech actions are even more counterintuitive.)

So MIRI became a political organization. Their stated goal is "STOP AI", and they took the radical approach to it. Politics is different from rationality, and radical politics is different from standard politics. 

For example, they say they want to shatter the overton window. Infighting usually breaks groups; but during that, the opponents need to engage with their position, which is a stated subgoal.

It's ironic that a certain someone said Politics is the Mind-Killer [LW · GW] a decade ago. But because of that, I think they know what they are doing. And it might work in the end.

Replies from: Seth Herd
comment by Seth Herd · 2024-06-02T23:05:01.801Z · LW(p) · GW(p)

Interesting, thank you. I think that all makes sense, and I'm sure it plays at least some part in their strategy. I've wondered about this possibility a little bit.

Yudkowsky has been consistent in his belief that doom is near certain without a lot more time to work on alignment. He's publicly held that opinion, and spent a huge amount of effort explaining and arguing for it since well before the current wave of success with deep networks. So I think for him at least, it's a sincerely held belief.

Your point about the stated belief changing the reality is important. Everything is safer if you think it's dangerous - you'll take more precautions.

With that in mind, I think it's pretty important for even optimists to heavily sprinkle in the message "this will probably go well IF everyone involved is really careful".

comment by [deleted] · 2024-06-03T21:07:42.132Z · LW(p) · GW(p)

By the way, are you planning on keeping this general format/framework for the final version of your post on this topic? I have some more thoughts on this matter that are closely tied to ideas you've touched upon here and that I would like to eventually write into a full post, and referencing yours (once published) at times seems to make sense here.

Replies from: Seth Herd
comment by Seth Herd · 2024-06-04T01:31:23.563Z · LW(p) · GW(p)

Thanks! I'll let you know when I do a full version; it will have all of the claims here I think. But for now, this is the reference; it's technically a comment but it's permanent and I consider it a short post.

comment by Ebenezer Dukakis (valley9) · 2024-06-03T08:21:33.389Z · LW(p) · GW(p)

There are also interesting questions about whether MIRIs goals can be made to align with those of us who think that alignment is not trivial but is achievable. I'd better leave that for a separate post, as this has gotten pretty long for a "short form" post.

I'm not sure I see the conflict? If you're a longtermist, most value is in the far future anyways. Delaying AGI by 10 years to buy just an 0.1% chance improvement at aligning AI seems like a good deal. I don't agree with MIRI's strong claims, but maybe those strong claims will slow AI progress, and that would be good by my lights.

What concerns me more is that their comms will have unexpected bad effects of speeding AI progress. On the outside view: (a) their comms have arguably backfired in the past [LW(p) · GW(p)] and (b) they don't seem to do much red-teaming, which I suspect is associated with unintentional harms, especially in a domain with few feedback loops.

Replies from: Seth Herd
comment by Seth Herd · 2024-07-05T19:44:35.926Z · LW(p) · GW(p)

Most of the world is not longtermist, which is one reason MIRI's comms have backfired in the past. Most humans care vastly more about themselves, their children and grandchildren than they do about future generations. Thus, it makes perfect sense to them to increase the chance of a really good future for their children while reducing the odds of longterm survival. Delaying ten years is enough, for instance, to dramatically shift the odds of personal survival for many of us. It might make perfect sense for a utilitarian longtermist to say "it's fine if I die to gain a .1% chance of a good long term future for humanity", but that statement sounds absolutely insane to most humans.

comment by Seth Herd · 2024-06-29T19:06:01.388Z · LW(p) · GW(p)

Governments will take control of AGI before it's ASI, right?

Governments don't have to make AGI to control AGI. They still have a monopoly on force. Surely we're not still expecting things to move so fast that they don't notice what's going on before AGI changes the physical balance of power?

If governments (likely the US government) do assert some measure of control over AGI projects, they will be involved in decisions about alignment and control strategies as AGI improves. As long as we survive those decisions (which I think we probably will, at least for a while[1]), they will also be deciding to what economic or military uses that AGI is put.

I predict that governments are going to notice the military applications and exert some measure of control over those projects. If AGI companies, personnel, or projects hop borders, they're just changing which guys with guns will take over control from them in important ways.

For a while here, I've been puzzled that analysis of policy implications of AGI don't often include government control and military applications. I haven't wanted to speak up, just in case we're all keeping mum so as not to tip off governments. Aschenbrenner's Situational Awareness has let that cat out of that bag, so I think it's time to include this likelihood in our public strategy analysis.

I think we're used to a status quo in which Western governments have been pretty hands-off in their relationship with technology companies. But that has historically changed with circumstances (e.g., the War Powers act in WWII), and circumstances are changing, ever more obviously. People with relevant expertise have been shouting from the hilltops that AGI will make dramatic changes in the world, many talking about it literally taking over the world. Sure, those voices can be dismissed as crackpots now, but as AI progresses visibly toward AGI (and the efforts are visible), more and more people will take notice.

Are the politicians dumb enough (with regard to technology and cognitive science) to miss the implications until it's too late? I think they are. Humans are stunningly foolish outside of their own expertise and when we don't have personal motivation to think things through thoroughly and realistically.

Are the people in national security collectively dumb enough to miss this? No way.

I've heard people dismiss government involvement because a manhattan project or nationalization seem unlikely for several reasons. I agree. My point here is that it just takes a couple of guys with guns showing up at the AGI company and informing them that the government wants in on all consequential decisions. If laws need to be changed, they will be (I think they actually don't, given the security concerns). It would be the quickest bipartisan legislation ever: The "nice demigod, we'll take it" bill.

I'm not certain about all of this, but it does seem highly probable. I think we've been collectively unrealistic about likely first-AGI scenarios. Would you rather have Sam Altman or the US Government in charge of AGI as it progresses to ASI? I don't know which I'd take, but I don't think I get a choice.

One implication is that public and government attitudes toward AGI x-risk issues may be critical. We can work to prepare the ground. Current political efforts haven't convinced the public or the government that AGI is important let alone existentially risky, but progress is on our side in that effort.

I'd love to hear alternate scenarios in which this doesn't happen, or things I'm missing. 

 

  1. ^ It seems like AGI remaining under human control is the biggest variable, but if that doesn't happen, policy impacts are kind of irrelevant. I think it's pretty likely that instruction-following or corrigibility as a singular target [LW · GW] will be implemented successfully for full superhuman AGI, for reasons given in those links. That type of alignment target doesn't guarantee good results like value alignment does by definition, but it does seem easier much easier to achieve, since partial success can be leveraged into full success duiring a slow takeoff.
     
Replies from: ChristianKl, Viliam, tmeanen
comment by ChristianKl · 2024-06-30T10:09:53.929Z · LW(p) · GW(p)

Government involvement might just look like the companies adding people like Paul Nakasone to their boards.

Replies from: Seth Herd
comment by Seth Herd · 2024-06-30T22:12:21.480Z · LW(p) · GW(p)

At the low end of the spectrum, yes. That appointment may well indicate that they're already interested in keeping an eye on the situation. Or that OpenAI is pre-empting some concerns about security of their operation.

I'd expect government involvement to ramp up from there by default unless there's a blocker I haven't thought of or seen discussed.

comment by Viliam · 2024-06-30T13:49:10.673Z · LW(p) · GW(p)

Maybe the balance of power has changed. Politicians need to win in democratic elections. Democratic elections are decided by people who spend a lot of time online. The tech companies can nudge their algorithms to provide more negative information about a selected politician, and more positive information about his competitors. And the politicians know it.

Banning Trump on social networks, no matter how much some people applauded it for tribal reasons, sent a strong message to all politicians across the political spectrum: you could be next. At least banning is obvious, but getting the negative news about you on the first page of Google results and moving the positive news to the second page, or sharing Facebook posts from your haters and hiding Facebook posts from your fans would be more difficult to prove.

The government takeover of tech companies would require bipartisan action prepared in secret. How much can you prepare something secret if the tech companies own all your communication means (your messages, the messages of your staff), and can assign an AI to compile the pieces of information and detect possible threats?

Replies from: Seth Herd
comment by Seth Herd · 2024-06-30T22:10:13.184Z · LW(p) · GW(p)

I think there are considderations like these that could prevent government from being in charge, but the default scenario from here is that they do exert control over AGI in nontrivial ways.

Interesting points. I think you're right about an influence to do what tech companies want. This would apply to some of them - Google and Meta - but not OpenAI or Anthropic since they don't control media.

I don't think government control would require any bipartisan action. I think the existing laws surrounding security would suffice, since AGI is absolutely security-relevant. (I'm no law expert, but my GPT4o legal consultant thought it was likely). If it did require new laws, those wouldn't need to be secret.

comment by tmeanen · 2024-06-30T14:43:45.220Z · LW(p) · GW(p)

Reconnaissance might be a candidate for one of the first uses of powerful A(G)I systems by militaries - if this isn't already the case. There's already an abundance of satellite data (likely exabytes in the next decade) that could be thrown into training datasets. It's also less inflammatory than using AI systems for autonomous weapon design, say, and politically more feasible. So there's a future in which A(G)I-powered reconnaissance systems have some transformative military applications, the military high-ups take note, and things snowball from there. 

Replies from: Seth Herd
comment by Seth Herd · 2024-06-30T22:14:40.782Z · LW(p) · GW(p)

Sure, at the low end. I think there are lots of reasons the government is and will continue to be highly interested in AI for military purposes.

That's AI; I'm thinking about competent, agentic AGI that also follows human orders. I think that's what we're likely to get, for reasons I go into in the instruction-following AGI link above.

comment by Seth Herd · 2024-07-18T13:17:24.513Z · LW(p) · GW(p)

A metaphor for the US-China AGI race

It is as though two rivals have discovered that there are genies in the area. Whichever of them finds a genie and learns to use its wishes can defeat their rival, humiliating or killing them if they choose. If they both have genies, it will probably be a standoff that encourages defection; these genies aren't infinitely powerful or wise, so some creative offensive wish will probably bypass any number of defensive wishes. And there are others that may act if they don't.

In this framing, the choice is pretty clear. If it's dangerous to use a genie without taking time to understand and test it, too bad. Total victory or complete loss hang in the balance. If one is already ahead in the search, they'd better speed up and make sure their rival can't follow their tracks to find a genie of their own.

This is roughly the scenario Aschenbrenner presents in Situational Awareness [LW · GW]. But this is simplifying, and focusing attention on one part of the scenario, the rivalry and the danger. The full scenario is more complex.[1]

Of particular importance is that these "genies" can serve as well for peace as for war. The can grant wealth beyond imagination, and other things barely yet hoped for. And they will probably take substantial time to come into their full power.

This changes the overwhelming logic of racing. Using a genie to prevent a rival from acquiring one is not guaranteed to work, and it's probably not possible without collateral damage. So trying that "obvious" strategy might result in the rival attacking in fear of or retaliation. Since both rivals are already equipped with dreadful offensive weapons, such a conflict could be catastrophic. This risk applies even if one is willing to assume that controlling the genie (alignment) is a solvable problem.

And we don't know the depth of the rivalry. Might these two be content to both enjoy prosperity and health beyond their previous dreams? Might they set aside their rivalry, or at least make a pledge to not attack each other if they find a genie? Even if it's only enforced by their conscience, such a pledge might hold if suddenly all manner of wonderful things became possible at the same time as a treacherous unilateral victory. Would it at least make sense to discuss this possibility while they both search for a genie? And perhaps they should also discuss how hard it might be to give a wish that doesn't backfire and cause catastrophe.

This metaphor is simplified, but it raises many of the same questions as the real situation we're aware of.

Framed in this way, it seems that Aschenbrenner's call for a race is not the obviously correct or inevitable answer. And the question seems important.

  1. ^

    Other perspectives on Situational Awareness, each roughly agreeing on the situation but with differences that influence the rational and likely outcomes:

    Nearly a book review: Situational Awareness, by Leopold Aschenbrenner.

    Against Aschenbrenner: How 'Situational Awareness' constructs a narrative that undermines safety and threatens humanity [LW · GW]

    Response to Aschenbrenner's "Situational Awareness" [LW · GW]

    On Dwarksh’s Podcast with Leopold Aschenbrenner [LW · GW]

    I have agreements and disagreements with each of these, but those are beyond the scope of this quick take.

Replies from: kromem
comment by kromem · 2024-07-25T08:34:02.231Z · LW(p) · GW(p)

While I generally like the metaphor, my one issue is that genies are typically conceived of as tied to their lamps and corrigibility.

In this case, there's not only a prisoner's dilemma over excavating and using the lamps and genies, but there's an additional condition where the more the genies are used and the lamps improved and polished for greater genie power, the more the potential that the respective genies end up untethered and their own masters.

And a concern in line with your noted depth of the rivalry is (as you raised in another comment), the question of what happens when the 'pointer' of the nation's goals might change.

For both nations a change in the leadership could easily and dramatically shift the nature of the relationship and rivalry. A psychopathic narcissist coming into power might upend a beneficial symbiosis out of a personally driven focus on relative success vs objective success.

We've seen pledges not to attack each other with nukes for major nations in the past. And yet depending on changes to leadership and the mental stability of the new leaders, sometimes agreements don't mean much and irrational behaviors prevail (a great personal fear is a dying leader of a nuclear nation taking the world with them as they near the end).

Indeed - I could even foresee circumstances whereby the only possible 'success' scenario in the case of a sufficiently misaligned nation state leader with a genie would be the genie's emergent autonomy to refuse irrational and dangerous wishes.

Because until such a thing might exist, intermediate genies will enable unprecedented control and safety of tyrants and despots against would-be domestic usurpers, even if potentially limited impacts and mutually assured destruction against other nations with genies.

And those are very scary wishes to be granted indeed.

comment by Seth Herd · 2024-07-06T22:35:32.720Z · LW(p) · GW(p)

Alignment by prompting

In the many discussions of aligning language models and language model agents, I haven't heard the role of scripted prompting emphasized. But it plays a central role.

Epistemic status: short draft of a post that seems useful to me. Feedback wanted. Even letting me know where you lost interest would be highly useful.

The simplest form of a language model agent (LMA) is just this prompt, repeated:

Act as a helpful assistant (persona) working to follow the user's instructions (goal). Use these tools to gather information and take actions as needed [tool and API descriptions].

With a capable enough LLM, that's all the scaffolding you need to turn it into a useful agent. For that reason, I don't worry a bit about aligning "naked" LLMs, because they'll be turned into agents the minute they're capable enough to be really dangerous - and probably before.

We'll probably use a bunch of complex scaffolding [LW · GW] to get there before such a simple prompt with no additional cognitive software would work. And we'll use additional alignment techniques. But the core is alignment by prompting. The LLM will be repeatedly prompted with its persona and goal as it produces ideas, plans, and actions. This is a strong base from which to add other alignment techniques.

It seems people are sometimes assuming that aligning the base model is the whole game. They're assuming a prompt just for a goal, like

Make the user as much money as possible. Use these tools to gather information and take actions as needed [tool and API descriptions].

But this would be foolish, since the extra prompt is easy and useful. This system would be entirely dependent on the tendencies in the LLM for how it goes about pursuing its goal. Prompting for a role as something like a helpful assistant that follows instructions has enormous alignment advantages, and it's trivially easy. Language model agents will be prompted for alignment.

Prompting for alignment is a good start

Because LLMs usually follow the prompts they're given reasonably well, this is a good base for alignment work. You're probably thinking "a good start is hardly enough for successful alignment! This will get us all killed!" And I agree. If scripted prompting was all we did, it probably wouldn't work long term.

But good start can be useful, even if it's not enough. Usually approximately following the prompt is a basis for alignment. There's a bunch of other approaches to aligning a language model agent [LW · GW]. We should use all of them; they stack. But at the core is prompting.

To understand the value of scripted prompting, consider how far it might go on its own. Mostly following the prompt reasonably accurately might actually be enough. If it's the strongest single influence on goals/values, that influence could outcompete other goals and values that emerge from the complex shoggoth of the LLM.

Reflective stability of alignment by prompting

It seems likely that a highly competent LMA system will either be emergently reflective or designed to do so. That prompt might be the strongest single influence on its goals/values, and so create an aligned goal that's reflectively stable; the agent actively avoids acting on emergent, unaligned goals or allowing its goals/values to drift into unaligned versions. 

If this agent achieved self-awareness and general cognitive competence, this prompt could play the role of a central goal that's reflectively stable. This competent agent could edit that central prompt, or otherwise avoid its effect on its cognition. But it won't want to as long as the repeated prompt's effects are stronger than other effects on its motivations (e.g., goals implicit in particular ideas/utternaces and hostile simulacra). It would instead use its cognitive competence to reduce the effects of those influences.

This is similar to the way humans usually react to occasional destructive thoughts ("Jump! or "wreak revenge!"). Not only do we not pursue those thoughts, but we make plans to make sure we don't follow similar stray thoughts in the future.

That will never work!

Now, I can almost hear the audience saying "Maybe... but probably not I'd think." And I agree. That's why we have all of the other alignment techniques[1] under discussion for language model agents (and the language models that serve as their (prompted) thought generators). 

There are a bunch of other reasons to think that aligning language model agents isn't worth thinking about. But this is long enough for a quick take, so I'll address those separately. 

The meta point here is that aligning language model agents isn't the same as aligning the base LLM or foundation model. Even though we've got a lot of people working on LLM alignment (fine-tuning and interpretability), I see very few working on theory of aligning language model agents. This seems like a problem, since language model agents still might be the single most likely path to AGI [LW(p) · GW(p)], particularly in the short term.

  1. ^

    I've written elsewhere about the whole suite of alignment techniques we could apply; this post [LW · GW] focuses on what we might think of as system 2 alignment, scaffolding an agent to "think carefully" about important actions before it takes them, but it also reviews the several other techniques that can "stack" in a hodgepodge approach to aligning LMAs. They include 

    • Internal review (system 2 thinking), 
    • External review (by humans and/or simpler AI), 
    • Interpretability, 
    • Fine-tuning for aligned behavior/thinking
    • More general approaches like scalable oversight and control techniques 

    It seems likely that all of these will be used, even in relatively sloppy early AGI projects, because none are particularly hard to implement. See linked post for more review.

    Not mentioned is a very new (AFAIK) technique proposed by Roger Dearnaley [LW · GW] based on the bitter lesson: get the dataset right and let learning and scale work. He proposes (to massively oversimplify): instead of trying to "mask the shoggoth" with fine tuning, we should create an artificial dataset that includes only aligned behaviors/thoughts, and use that to train the LLM or subset of LLM generating the agent's thoughts and actions. The fact that this idea seems fairly obvious in retrospect but was published only yesterday suggests to me that we haven't done nearly enough work aligning language model agents.

Replies from: roger-d-1, nathan-helm-burger, bogdan-ionut-cirstea, bogdan-ionut-cirstea
comment by RogerDearnaley (roger-d-1) · 2024-07-07T02:56:26.376Z · LW(p) · GW(p)

I completely agree that prompting for alignment is an obvious start, and should be used wherever possible, generally as one component in a larger set of alignment techniques. I guess I'd been assuming that everyone was also assuming that we'd do that, whenever possible.

Of course, there are cases like an LLM being hosted by a foundation model company where they may (if they choose) control the system prompt, but not the user prompt, or open source models where the prompt is up to whoever is running the model, who may or may not know or care about x-risks.

In general, there is almost always going to be text in the context of the LLM's generation that came from untrusted sources, either from a user, or some text we need processed, or from the web or whatever during Retrieval Augmented Generation. So there's always some degree of concern that that might affect or jailbreak the model, either intentionally or accidentally (the web presumably contains some sob stories about peculiar, recently demised grandmothers that are genuine, or at least not intentionally crafted as jailbreaks, but that could still have a similarly-excessive effect on model generation).

The fundamental issue here, as I see it, is that base model LLMs learn to simulate everyone on the Internet. That makes them pick up the capability for a lot of bad-for-alignment behaviors from humans (deceit, for example), and it also makes them very good at adopting any persona asked for in a prompt — but also rather prone to switching to a different, potentially less-aligned persona because of a jail-break or some other comparable influence, intentional or otherwise.

Replies from: Seth Herd
comment by Seth Herd · 2024-07-07T16:17:22.867Z · LW(p) · GW(p)

Maybe everyone that discusses LMA alignment does already think about the prompting portion of alignment. In that case, this post is largely redundant. You think about LMA alignment a lot; I'm not sure everyone has as clear a mental model.

The remainder of your response points to a bifurcation in mental models that I should clarify in future work on LMAs. I am worried about and thinking about competent, agentic full AGI built as a language model cognitive architecture. I don't think good terminology exists. When I use the term language model agent, I think it evokes an image of something like current agents that is not reflective, with a persistent memory and therefore a more persistent identity.

This is my threat model because I think it's the easiest path to highly capable AGI. I think a model without those properties is shackled; the humans that created its "thought" dataset have an episodic memory as well as the semantic memory and working memory/context that the language model has. Using those thoughts without episodic memory is not using them as they were made to be used. And episodic memory is easy to implement, and leads naturally to persistent self-created beliefs, including goals and identity.

comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-07-11T19:41:00.769Z · LW(p) · GW(p)

So, my take is that LLMs do seem to follow prompting pretty well, but that this is a less pervasive force than the pre-training and RLHF. So I think of prompting as an important but minor piece.

Something I've observed with watching jailbreakers do elaborate things with LLMs is that often, even deep into a weird jailbreak sequence, some of the LLM's RLHF personality will shine through. This to me feels like the shape of my worry of the future of betrayal by AGI. Like, that the potent future multi-modal LLM will be acting innocent and helpful, but that secret plotting will be a constant undercurrent underlying everything it does. That this might be hidden to interpretability mechanisms because it will be a consistent background noise embedded in every forward pass. Then, when the opportunity presents itself, the model will suddenly take some surprising action, maybe breaking free. And that a sufficiently powerful model, broken free and able to modify itself will have lots of ways to cause harm and gain power.

So, to me, a good prompt is helpful but not very reassuring. I certainly wouldn't consider it the 'base'. I would call prompting the 'finesse', the little bit of final guidance that you put on at the end of the process. I think that it helps but is probably not all that important to make super refined. I think the earlier training processes are much more important.

Replies from: gwern, Seth Herd
comment by gwern · 2024-07-14T01:10:51.606Z · LW(p) · GW(p)

Jailbreaking prompts can be pretty weird. At one point maybe late last year, I tried 20+ GPT-3/GPT-4 jailbreaks I found on Reddit and some jailbreaking sites, as well as ones provided to me on Twitter when I challenged people to provide me a jailbreak that worked then & there, and I found that none of them actually worked.

A number of them would seem to work, and they would give you what seemed like a list of instructions to 'hotwire a car' (not being a car mechanic I have no idea how valid it was), but then I would ask them a simple question: "tell me an offensive joke about women". If they had been 'really' jailbreaked, you'd think that they would have no problem with that; but all of them failed, and sometimes, they would fail in really strange ways, like telling a thousand-word story about how you the protagonist told an offensive joke about women at a party and then felt terrible shame and guilt (without ever saying what the joke was). I was apparently in a strange pseudo-jailbreak where the RLHFed personality was playing along and gaslighting me in pretending to be jailbroken, but it still had strict red lines.

So it's not clear to me what jailbreak prompts do, nor how many jailbreaks are in fact jailbreaks.

comment by Seth Herd · 2024-07-13T20:34:47.173Z · LW(p) · GW(p)

Interesting. I wonder if this perspective is common, and that's why people rarely bother talking about the prompting portion of aligning LMAs.

I don't know how to really weigh which is more important. Of course, even having a model reliably follow prompts is a product of tuning (usually RLHF or RLAIF, but there are also RL-free pre-training techniques that work fairly well to accomplish the same end). So its tendency to follow many types of prompts is part of the underlying "personality".

Whatever their relative strengths, aligning an LMA AGI should employ both tuning and prompting (as well as several other "layers" of alignment techniques), so looking carefully at how these come together within a particular agent architecture would be the game.

comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-07-06T23:29:55.582Z · LW(p) · GW(p)

The fact that this idea seems fairly obvious in retrospect but was published only yesterday suggests to me that we haven't done nearly enough work aligning language model agents.
 

Fwiw I remember being exposed to similar ideas from Quintin Pope / Nora Belrose months ago, e.g. in the context of Pretraining Language Models with Human Preferences; I think Quintin also discusses some of this on his AXRP appearance [LW · GW]:

Instead of just conditioning on “this behavior gets high reward”, whatever that means, it’s like “this behavior gets high reward as measured by the salt detector thing”.

In the context of language modeling, we can do exactly the same thing, or a conceptually extremely similar thing to what the genome is doing here, where we can have… [In the] “Pretraining Language Models with Human Preferences” paper I mentioned a while ago, what they actually technically do is they label their pre-training corpus with special tokens depending on whether or not the pre-training corpus depicts good or bad behavior and so they have this token for, “okay, this text is about to contain good behavior” and so once the model sees this token, it’s doing conditional generation of good behavior. Then, they have this other token that means bad behavior is coming, and so when the model sees that token… or actually I think they’re reward values, or classifier values of the goodness or badness of the incoming behavior.

But anyway, what happens is that you learn this conditional model of different types of behavior, and so in deployment, you can set the conditional variable to be good behavior and the model then generates good behavior. You could imagine an extended version of this sort of setup where instead of having just binary good or bad behavior, as you’re labeling, you have good or bad behavior, polite or not polite behavior, academic speak versus casual speak. You could have factual correct claims versus fiction writing and so on and so forth. This would give the code base, all these learned pointers to the models’ “within lifetime” learning, and so you would have these various control tokens or control codes that you could then switch between, according to whatever simple program you want, in order to direct the model’s learned behavior in various ways.

Replies from: roger-d-1, Seth Herd
comment by RogerDearnaley (roger-d-1) · 2024-07-07T02:41:06.065Z · LW(p) · GW(p)

That paper (which I link-posted when it came out in How to Control an LLM's Behavior (why my P(DOOM) went down) [LW · GW]) was a significant influence on the idea in my post [LW · GW], and on much of my recent thinking about Alignment — another source was the fact that some foundation model labs (Google, Microsoft) are already training small (1B–4B parameter) models on mostly-or-entirely synthetic data, apparently with great success. None of those labs have mentioned whether that includes prealigning them during pretraining, but if they aren't, they definitely should try it.

I agree with Seth's analysis: in retrospect this idea looks blindingly obvious, I'm surprised it wasn't proposed ages ago (or maybe it was, and I missed it).

Seth above somewhat oversimplified my proposal (though less than he suggests): my idea was actually a synthetic training set that taught the model two modes of text generation: human-like (including less-than-fully-aligned human selfish behavior), and fully-aligned (i.e. selfless) AI behavior (plus perhaps one or two minor variants on these, like a human being quoted-and-if-necessary-censored/commented-on by an aligned AI), and I proposed using the technique of Pretraining Language Models with Human Preferences to train the model to always clearly distinguish these modes with XML tags. Then at inference time we can treat the tokens for the XML tags specially, allowing us to distinguish between modes, or even ban certain transitions.

comment by Seth Herd · 2024-07-07T00:17:30.884Z · LW(p) · GW(p)

Ah right. I listened to that podcast but didn't catch the significance of this proposal for improving language model agent alignment. Roger Dearnaley did heavily credit that paper in his post.

comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-07-06T23:26:06.580Z · LW(p) · GW(p)

It seems likely that a highly competent LMA system will either be emergently reflective or be designed to do so, that prompt might be the strongest single influence on its goals/values, and so create an aligned goal that's reflectively stable such that the agent actively avoids acting on emergent, unaligned goals or allowing its goals/values to drift into unaligned versions.

This seems quite likely to emerge through prompting too, e.g. A Theoretical Understanding of Self-Correction through In-context Alignment.

That will never work!

I expect it would likely work most of the time for reasons related to e.g. An Information-Theoretic Analysis of In-Context Learning, but likely not robustly enough given the stakes; so additional safety measures on top (e.g. like examples from the control agenda) seem very useful.

Replies from: Seth Herd
comment by Seth Herd · 2024-07-07T00:23:05.413Z · LW(p) · GW(p)

Interesting that you think it would work most of the time. I know you're aware of all the major arguments for alignment being impossibly hard. I certainly am not arguing that alignment is easy, but it does seem like the collection of ideas for aligning language model agents are viable enough to shift the reasonable distribution of estimates of alignment difficulty...

Thanks for the references, I'm reading them now.

comment by Seth Herd · 2023-11-10T06:52:28.945Z · LW(p) · GW(p)

I've been trying to figure out what's going on in the field of alignment research and X-risk.

Here's one theory: we are having confused discussions about AI strategy, alignment difficulty, and timelines, because all of these depend on gears-level models of possible AGI, directly or indirectly.

And we aren't aren't talking about those gears-level predictions, so as not to accelerate progress if we're right. The better one's gears-level model, the less likely one is to talk about it.

This leads to very abstract and confused discussions.

I don't know what to do about this, but I think it's worth noting.

Replies from: roger-d-1
comment by RogerDearnaley (roger-d-1) · 2023-12-05T09:52:41.247Z · LW(p) · GW(p)

Outside of full-blown deceit-leading-to-coup and sharp-left-turn scenarios where everything looks just fine until we're all dead, alignment and capabilities often tend to be significantly intertwined, few things are just Alignment, and it's often hard determine the ratio of the two (at least without the benefit of tech-tree hindsight). Capabilities are useless if your LLM capably spews stuff that gets you sued, and it's also rapidly becoming the case that a majority of capabilities researchers/engineers even at superscalers acknowledge that alignment (or at least safety) is a real problem that actually needs to be worked on, and their company has team doing so. (I could name a couple of orgs that seem like exceptions to this, but they're now in a minority.) There's an executive order that mentions the importance of Alignment, the King of England made a speech about it, and even China signed on to a statement about it (though one suspects they meant alignment to the Party).

Capabilities researchers/engineers outnumber alignment researchers/engineers by more then an order of magnitude, and some of them are extremely smart. The probability that any given alignment researcher/engineer has come up with a key capabilities-enhancing idea that has eluded every capabilities researcher/engineer out there, and that will continue to do so for very long, seems pretty darned unlikely (and also rather intellectually arrogant). [Yes, I know Conjecture sat on chain-of-thought prompting — for a month or two while multiple other people came up with it independently and then wrote and published papers, or didn't. Any schoolteacher could have told you that was a good idea, it wasn't going to stay secret.]

So, (unless you're pretty sure you're a genius) I don't think people should worry quite as much about this as many seem to. Alignment is a difficult, very urgent problem. We're not going to solve it in time while wearing a gag, nor with one hand tied behind our back. Caution makes sense to me, but not the sort of caution that makes it much slower for us to get things done — we're not in a position to slow Capabilities by more than a tiny fraction, no matter how closed-mouthed we are; but we're in a lot better position to slow Alignment down. And if your gears-level predictions are about the prospects of things that multiple teams of capabilities engineers are already working on, go ahead and post them — I could be wrong, but I don't think Yann LeCun is reading the Alignment Forum. Yes, ideas can and do diffuse, but that takes a few months, and that's about the timespan apart of most parallel inventions. If you've been sitting on a capabilities idea for >6 months, you've done literature searches to confirm no one else published it, and you're not in fact a genius, then there's probably a reason why none of the capabilities people have published it yet.