Agentized LLMs will change the alignment landscape

seth-herd

Agentized LLMs will change the alignment landscape

post by Seth Herd · 2023-04-09T02:29:07.797Z · LW · GW · 102 comments

102 comments

Epistemic status: head spinning, suddenly unsure of everything in alignment. And unsure of these predictions.

I'm following the suggestions in 10 reasons why lists of 10 reasons might be a winning strategy [LW · GW] in order to get this out quickly (reason 10 will blow your mind!). I'm hoping to prompt some discussion, rather than try to do the definitive writeup on this topic when this technique was introduced so recently.

Ten reasons why agentized LLMs will change the alignment landscape:

Agentized^[1] LLMs like Auto-GPT and Baby AGI may fan the sparks of AGI in GPT-4 into a fire. These techniques use an LLM as a central cognitive engine, within a recursive loop of breaking a task goal into subtasks, working on those subtasks (including calling other software), and using the LLM to prioritize subtasks and decide when they're adequately well done. They recursively check whether they're making progress on their top-level goal.
While it remains to be seen what these systems can actually accomplish, I think it's very likely that they will dramatically enhance the effective intelligence of the core LLM. I think this type of recursivity and breaking problems into separate cognitive tasks is central to human intelligence. This technique adds several key aspects of human cognition; executive function; reflective, recursive thought; and episodic memory for tasks, despite using non-brainlike implementations. To be fair, the existing implementations seem pretty limited and error-prone. But they were implemented in days. So this is a prediction of near-future progress, not a report on amazing new capabilities.
This approach appears to be easier than I'd thought. I've been expecting this type of self-prompting to imitate the advantages of human thought, but I didn't expect the cognitive capacities of GPT-4 to make it so easy to do useful multi-step thinking and planning. The ease of initial implementation (something like 3 days, with all of the code also written by GPT-4 for baby AGI) implies that improvements may also be easier than we would have guessed.
Integration with HuggingGPT and similar approaches can provide these cognitive loops with more cognitive capacities. This integration was also easier than I'd have guessed, with GPT-4 learning from a handful (e.g., 40) of examples how to use other software tools. Those tools will include both sensory capacities, with vision models and other sensory models of various types, and the equivalent of a variety of output capabilities.
Integration of recursive LLM self-improvement like "Reflexion" can utilize these cognitive loops to make the core model better at a variety of tasks.
Easily agentized LLMs is terrible news for capabilities. I think we'll have an internet full of LLM-bots "thinking" up and doing stuff within a year.
This is absolutely bone-chilling for the urgency of the alignment and coordination problems. Some clever chucklehead already created ChaosGPT, an instance of Auto-GPT given the goal to destroy humanity and create chaos. You are literally reading the thoughts of something thinking about how to kill you. It's too stupid to get very far, but it will get smarter with every LLM improvement, and every improvement to the recursive self-prompting wrapper programs. This gave me my very first visceral fear of AGI destroying us. I recommend it, unless you're already plenty viscerally freaked out.
Watching agents think is going to shift public opinion. We should be ready for more AI scares and changing public beliefs [LW · GW]. I have no idea how this is going to play out in the political sphere, but we need to figure this out to have a shot at successful alignment, because
We will be in a multilateral AGI world. Anyone can spawn a dumb AGI and have it either manage their social media, or try to destroy humanity. And over the years, those commercially available AGIs will get smarter. Because defense is harder than offense, it is going to be untenable to indefinitely defend the world against out-of-control AGIs. [LW · GW] But
Important parts of alignment and interpretability might be a lot easier than most of us have been thinking. These agents take goals as input, in English. They reason about those goals much as humans do, and this will likely improve with model improvements. This does not solve the outer alignment problem; one existing suggestion is to include a top-level goal of "reducing suffering." No! No! No!. This also does not solve the alignment stability problem [LW · GW]. Starting goals can be misinterpreted or lost to recursive subgoals, and if any type of continued learning is included, behavior will shift over time. It doesn't even solve the inner alignment problem if recursive training methods create mesa-optimizers in the LLMs. But it also provides incredibly easy interpretability, because these systems think in English.

If I'm right about any reasonable subset of this stuff, this lands us in a terrifying, promising new landscape of alignment issues. We will see good bots and bad bots, and the balance of power will shift. Ultimately I think this leads to the necessity of very strong global monitoring, including breaking all encryption, to prevent hostile AGI behavior. The array of issues is dizzying (I am personally dizzied, and a bit short on sleep from fear and excitement). I would love to hear others' thoughts.

^{^}
I'm using a neologism, and a loose definition of agency as things that flexibly pursue goals. That's similar to this more rigorous definition [AF · GW].

102 comments

Comments sorted by top scores.

comment by RomanHauksson (r) · 2023-04-09T06:34:35.185Z · LW(p) · GW(p)

Maybe one upside to the influx of "agents made with GPT-N API calls and software glue" is that these types of AI agents are more likely to cause a fire alarm [LW · GW]-y disaster which gets mitigated, thus spurring governments to take X-risk more seriously, as opposed to other types of AI agents, whose first disaster would blow right past fire alarm level straight to world-ending level?

For example, I think this situation is plausible: ~AutoGPT-N^[1] hacks into a supercomputer cluster or social-engineers IT workers over email or whatever in the pursuit of some other goal, but ultimately gets shut down by OpenAI simply banning the agent from using their API. Maybe it even succeeds in some scarier instrumental goal, like obtaining more API keys and spawning multiple instances of itself. However, the crucial detail is that the main "cognitive engine" of the agent is bottlenecked by API calls, so for the agent to wipe everyone out, it needs to overcome the hurdle of pwning OpenAI specifically.

By contrast, if an agent that's powered by an open-source language model gets to the "scary fire alarm" level of self-improvement/power-seeking, it might be too late, since it wouldn't have a "stop button" controlled by one corporation like ~AutoGPT-N has. It could continue spinning up instances of itself while staying under the radar.

This isn't to say that ~AutoGPT-N doesn't pose any X-risk at all, but rather that it seems like it could cause the kind of disaster which doesn't literally kill everyone but which is scary enough that the public freaks out and nations form treaties banning larger models from being trained, et cetera.

I'd like to make it very clear that I do not think it is a good thing that this type of agent might cause a disaster. Rather, I think it's good that the first major disaster these agents will cause seems likely to be non-existential.

^{^}
Future iteration of AutoGPT or a similar project

Replies from: None, Seth Herd

↑ comment by [deleted] · 2023-04-09T09:36:32.895Z · LW(p) · GW(p)

AutoGPT-N hacks into a supercomputer cluster or social-engineers IT workers over email or whatever in the pursuit of some other goal, but ultimately gets shut down by OpenAI simply banning the agent from using their API.

Could we even identify who did it to know that it was an instance of AutoGPT?

↑ comment by Seth Herd · 2023-04-09T18:09:09.003Z · LW(p) · GW(p)

This was my first thought on seeing AutoGPT. I wrote about this in AI scares and changing public beliefs. [LW · GW] But my second thought was that this is much more important. Not only might it work very well, it has immense advantages for initial alignment and corrigibility. This is potentially really good news.

comment by Matt Goldenberg (mr-hire) · 2023-04-10T02:14:34.234Z · LW(p) · GW(p)

This approach appears to be easier than I'd thought. I've been expecting this type of self-prompting to imitate the advantages of human thought, but I didn't expect the cognitive capacities of GPT-4 to make it so easy to do useful multi-step thinking and planning. The ease of initial implementation (something like 3 days, with all of the code also written by GPT-4 for baby AGI) implies that improvements may also be easier than we would have guessed

Having played with both BabyAGI and AutoGPT over the past few days, I'm actually surprised at how hard it is to get them to do useful multistep thinking and planning. Even things that I'd think an LLM would be good at, like writing a bunch of blogposts from a list, or book chapters from an outline, the LLM tends to get off track in a way I wouldn't expect from the coherency I see in chat interactions where I'm constantly giving the LLM hints about the topic, and can reroll or rewrite if it misunderstands. I think I was underestimating how much work those constant feedback and corrections from me are doing

Replies from: Bjartur Tómas, Seth Herd

↑ comment by Tomás B. (Bjartur Tómas) · 2023-04-10T16:01:58.942Z · LW(p) · GW(p)

Idk, I feel about this stuff like I felt about GPT-J [LW · GW]. What scares me is not how well it works, but that it kinda/sorta works a bit. It's a bunch of garbage python code wrapped around an API, and it kinda works. I expect people will push on this stuff hard, and am worried that DeepMind, OpenAI, and Google will be doing so in a much more principled way than the wild-west LLM enthusiast crowed.

I think it was wrong for people to take comfort in the meme that "GPT-N is not an agent" and this will become very clear to everyone in the next 18 months.

↑ comment by Seth Herd · 2023-04-12T04:35:21.553Z · LW(p) · GW(p)

I agree that it isn't very impressive out of the box. I think these techniques will improve over time. I'm not sure it's going to be the next big thing, but I do think it's worth thinking about the impact on alignment in case it is. As I think more about it, I see several other useful human cognitive capacities that can be emulated in the same way. They're not arcane, so I expect the group of people hacking away on Auto-GPT to be working on them right now. Time will tell, but we need to get ahead of the curve to have alignment solutions. My prediction is that wrappers will definitely add cognitive capacity, and that they might easily add a lot.

Replies from: eggsyntax

↑ comment by eggsyntax · 2023-04-12T17:08:26.758Z · LW(p) · GW(p)

Agreed, and note that there's substantial economic incentive for people to keep improving it, since a more independently-capable LLM-based agent is useful for more purposes. There are a whole host of startups right now looking for ways to enhance LLM-based systems, and a host of VCs wanting to throw money at them (examples on request, but I'm guessing most people have been seeing it online already).

Replies from: Seth Herd

↑ comment by Seth Herd · 2023-04-12T17:45:59.684Z · LW(p) · GW(p)

This is an excellent point.

There are probably thousands of semi-entrepreneurial hackers working on this now. And a hundred thousand in a month. Many of them will share their best ideas. This will move fast, and we will see some of the potential quickly.

comment by JanPro · 2023-04-09T11:18:56.779Z · LW(p) · GW(p)

Ad ChaosGPT:

Attempting to create (even weak) agent tasked with "destroying humanity" should be made very clear to be out of bounds of acceptable behavior. I feel that I want the author to be prosecuted.

Now the meme is: "haha we can tell AI to hurt us and make fun of how it fails"

This would obviously backfire if the substrate were able to cause lots of damage.

What I would like the meme to be: this is extremely unethical, deserving outrage and perhaps attempted terrorism.

Replies from: awg, Prometheus, lc, NatCarlinhos, sky-moo

↑ comment by awg · 2023-04-09T15:50:53.667Z · LW(p) · GW(p)

I wonder if/when/how quickly this will be criminalized in a manner similar to terrorism or using weapons of mass destruction.

Replies from: Prometheus

↑ comment by Prometheus · 2023-04-10T16:33:31.125Z · LW(p) · GW(p)

If we're being realistic, this kind of thing would only get criminalized after something bad actually happened. Until then, too many people will think "omg, it's just a Chatbot". Any politician calling for it would get made fun of on every Late Night show.

Replies from: Seth Herd

↑ comment by Seth Herd · 2023-04-10T22:23:06.206Z · LW(p) · GW(p)

I'm almost certain this is already criminal, to the extent it's actually dangerous. If you roll a boulder down the hill, you're up for manslaughter if it kills someone, and reckless endangerment if it could've but didn't hurt anyone. It doesn't matter if it's a boulder or software; if you should've known it was dangerous, you're criminally liable.

In this particular case, I have mixed feelings. This demonstration is likely to do immense good for public awareness of AGI risk. It even did for me, on an emotional level I haven't felt before. But it's also impossible to know when a dumb bot will come up with a really clever idea by accident, or when improvements have produced emergent intelligence. So we need to shut this down as much as possible as get to better capabilities. Of course, criminal punishments reduce bad behavior, but don't eliminate it. So we also need to be able to detect and prevent malicious bot behavior, and keep up with prevention techniques (likely with aligned, better AGI from bigger corporations) as it gets more capable.

↑ comment by Prometheus · 2023-04-10T16:31:05.182Z · LW(p) · GW(p)

Yeah, all the questions over the years of "why would the AI want to kill us" could be answered with "because some idiot thought it would be funny to train an AI to kill everyone, and it got out of hand". Unfortunately, stopping everyone on the internet from doing things isn't realistic. It's much better to never let the genie out of the bottle in the first place.

↑ comment by lc · 2023-04-10T22:39:22.461Z · LW(p) · GW(p)

Attempting to create (even weak) agent tasked with "destroying humanity" should be made very clear to be out of bounds of acceptable behavior. I feel that I want the author to be prosecuted.

This seems like a bit much.

↑ comment by NatCarlinhos · 2023-04-11T01:27:35.251Z · LW(p) · GW(p)

One thing that seems worth mentioning is that, based on my understanding of Alignment Theory, if some smarter version of ChaosGPT did kill all humans, it wouldn't be because of the instructions it was given, but for the same reason any unaligned AI would kill all humans - that is, because it's unaligned. It's hard for me to imagine a scenario where an unaligned agent like ChaosGPT would be more likely to kill everyone than any given unaligned AI; the whole deal with the Outer Alignment Problem is that we don't yet know how to get agents to do the things we want them to do, regardless of whether those things are benevolent or destructive or anything in between.

Still, I agree that this sets a horrible precedent and that this sort of thing should be prosecuted in the future, if only because at some point if we do solve Alignment, an agent like ChaosGPT could be dangerous for (obvious) different reasons, unrelated to being unaligned.

Replies from: Seth Herd

↑ comment by Seth Herd · 2023-04-11T01:35:15.976Z · LW(p) · GW(p)

I think most alignment people would use "aligned" as I am here. ChaosGPT is aligned to the intent of the person that set it in motion, (at least initially) even though it's not aligned to the values of humanity at large. That would be bad outer alignment and bad coordination in the way I'm using those terms.

And it will destroy humanity (if it gets smart enough to) for a very different reason than an unaligned AGI would. That's it's goal, while for an unaligned AGI it would be a subgoal or a side effect.

It's increasingly incorrect to say that we have no idea how to get an AGI to do what we want. We have no idea how to do that in closed form code instructions, but the limited success of RLHF and other training indicates that we have at least some ability to steer the behavior of deep networks. I think it's still fair to say that we don't have methods we can be confident of, or that are stable over time and learning. I'm nominating this approach of giving explicit goals in language as our new best shot.

↑ comment by Sky Moo (sky-moo) · 2023-04-11T17:16:50.391Z · LW(p) · GW(p)

I understand your emotional reaction to ChaosGPT in particular, but I actually think it's important to keep in mind that ChaosGPT is equally as dangerous as AutoGPT when asked to make cookies, or make people smile. It really doesn't matter what the goal is, it's the optimization that leads to these instrumental biproducts that may lead to disaster.

Replies from: Seth Herd

↑ comment by Seth Herd · 2023-04-11T17:54:00.132Z · LW(p) · GW(p)

Good point. It would be an even better emotional impact and intuition pump to see an agentized LLM arrive at destroying humanity as a subgoal of some other objective.

Somebody put in producing paperclips as a goal to one of these; I've forgotten where I saw it. Maybe it was a baby AGI example? That one actually recognized the dangers and shifted to researching the alignment problem. That seemed to be the result of how the paperclip goal is linked to that issue in internet writing, and the RLHF and other ethical safeguards built into GPT4 as the core LLM. That example unfortunately sends the inaccurate opposite intuition, that these systems automatically have safeguards and ethics. They have that only when using an LLM with those things built in, and they're still unreliable.

comment by osten · 2023-04-09T05:54:18.256Z · LW(p) · GW(p)

But it also provides incredibly easy interpretability, because these systems think in English.

I'm not sure this point will stand because it might be cheaper to have them think in their own language: https://www.lesswrong.com/posts/bNCDexejSZpkuu3yz/you-can-use-gpt-4-to-create-prompt-injections-against-gpt-4 [LW · GW]

Replies from: Seth Herd, ic-rainbow

↑ comment by Seth Herd · 2023-04-09T06:28:20.958Z · LW(p) · GW(p)

I think that could become a problem. I think that type of compression is useful but not really dramatic, but I'm not sure. So there would still be an alignment tax. But it might be small enough to not prevent people from paying it. Keeping internal processes in natural language will also make for easier debugging just to get the system to work well and do things you want.

People are still going have to want to make aligned systems. This approach might just make it a whole lot easier.

↑ comment by IC Rainbow (ic-rainbow) · 2023-04-10T22:21:02.020Z · LW(p) · GW(p)

I doubt that any language less represented than English (or JS/Python) would be better since the amount of good data to ingest would be much less for them.

comment by Thane Ruthenis · 2023-04-09T07:46:32.875Z · LW(p) · GW(p)

Funny, Auto-GPT stuff actually makes me less worried about GPT-4 and its scale-ups. It's been out for weeks, less impressive variants were out for longer, and so far, nothing much has come from it. Looking at the ChaosGPT video... I would've predicted that it wasn't actually any good at pursuing its goal, that it just meandered around the "kill everyone" objective without ever meaningfully progressing — and lo and behold, it's doing exactly that. Actually, it's worse at it than I'd expected.

I see the case for doom, I do! It's conceivable that it will turn out in this manner. We're witnessing an AI planning, here, and it's laughably bad planning so far, but the mere fact that they can do it at all implies a readily-available possibility of making them massively better at it. So in e. g. five more years, we'd get AIs whose planning skills to ChaosGPT as Midjourney is to PixelCNN, and then maybe one of them FOOMs.

But mostly, I agree with this view [LW(p) · GW(p)]. And this is an instance of the "wire together GPT models to get an AGI" attempt failing, and on my view it's failing in a way that's providing some evidence this entire approach won't work. It's conceivable that it'd work with GPT≥5, or with a more clever setup... But this is an update away from that. (Though I'm not as optimistic about the next architectural advance requiring 10-15 years; it may be just around the corner. But scaled-up LLMs haven't really impressed me since GPT-3.)

Replies from: Nanda Ale, Seth Herd

↑ comment by Nanda Ale · 2023-04-09T09:05:14.071Z · LW(p) · GW(p)

I'd be wary of generalizing too much from Auto-GPT. It's in a weird place. It's super popular as a meme anyone can run - you don't have to be a programmer! But skimming the github the vast vast majority of people are getting hung up on fiddly technical and programming bits. And people who wouldn't get hung up on that stuff don't really get much out of Auto-GPT. There's some overlap -- it's a very entertaining idea and thing to watch, the idea of it being hands off. I personally watched it like a TV show for hours, and it going off the rails was part of the fun.

Like I'm no expert, I just got way too addicted to goofing around with LLMs, and the way Auto-GPT is trying to make this work seems obviously flawed to me. Not the software quality - I don't much about that - but the main idea and the structure of the interacting prompts seems like just clearly not the way to go. I don't know the right way, but it's not that.

Even more so for ChaosGPT, where the author (to me) looks like somebody trying to maximize entertainment, not a working product.

That said Auto-GPT is actually getting better quickly. AI time moves fast. And it's so popular that a lot of people are tinkering and eyes on it. So it might actually do something like the original concept eventually. But I would bet something completely different (specifically a project that isn't trying to be a plug-and-play solution anyone can run on their own computer) is where the most capable solutions will be.

Replies from: Thane Ruthenis

↑ comment by Thane Ruthenis · 2023-04-09T09:59:49.056Z · LW(p) · GW(p)

In my view, if something like Auto-GPT can work, its ability to work is probably not too sensitive to the exact implementation of the wrapper. If GPT-4 has the raw capability to orient itself to reality and navigate it, it should be able to do that with even bare-bones self-prompt/prompted self-reflection ability. Something like Auto-GPT should be more than enough. So the failure is suggestive, is evidence about this whole landscape of approaches.

I agree that it's possible that more nuanced wrapper designs would work, but I don't place much probability on that.

Replies from: Nanda Ale, Vladimir_Nesov, Roger Dearnaley

↑ comment by Nanda Ale · 2023-04-09T11:59:37.662Z · LW(p) · GW(p)

I'm not confident at all Auto-GPT could work at its goals, just that in narrower domains the specific system or arrangement of prompt interactions matters. To give a specific example, I goof around trying to get good longform D&D games out of ChatGPT. (Even GPT-2 fine-tuned on Crit Role transcripts, originally.) Some implementations just work way better than others.

The trivial system is no system - just play D&D. Works great until it feels like the DM is the main character in Memento. The trivial next step, rolling context window. Conversation fills up, ask for summary, start a new conversation with the summary. Just that is a lot better. But you really feel loss of detail in the sudden jump, so why not make it continuous. A secretary GPT with one job, prune the DM GPT conversation text after every question and answer, always try to keep most important and most recent. Smoother than the summary system. Maybe the secretary can not just delete but keep some details instead, maybe use half its tokens for a permanent game-state. Then it can edit useful details in/out of the conversation history. Can the secretary write a text file for old conversations? Etc. etc.

Maybe the difference is the user plays the D&D, so you know immediately when it's not working well. It's usually obvious in minutes. Auto-GPT is supposed to automatic. So they add features and just kind of hope the AI figures it out from there. They don't get the immediate "this is not working at all" feedback. Like they added embeddings 5 days ago - it just prints the words "Permanent memory:" in the prompt, followed by giant blogs up to 2500 tokens of the most related text from Pinecone. Works great for chatbots answering a single question about technical documentation. Real easy to imagine how it could fall apart when does iteratively over longer time periods. I can't imagine this would work for a D&D game, it might be worse than having no memory. My gut feeling is you pull in the 2500 most related tokens of content into your prompt and the system is overall more erratic. You get the wrong 2500 tokens, it overwhelms whatever the original prompt was, now what is your agent up to? Just checked now, it changed to "This reminds you of these events from your past:". That might actually make it somewhat less likely to blow up. Basically making the context of the text more clear: "These are old events and thoughts, and you are reminded of them, don't take this text too seriously, this text might not even be relevant so maybe you should even ignore it. It's just some stuff that came to mind, that's how memories work sometimes."

↑ comment by Vladimir_Nesov · 2023-04-09T14:05:37.068Z · LW(p) · GW(p)

If GPT-4 has the raw capability to orient itself to reality and navigate it, it should be able to do that with even bare-bones self-prompt/prompted self-reflection ability.

GPT-4 by itself can't learn, can't improve its intuitions and skills in response to new facts of the situations of its instances (that don't fit in its context). So the details of how the prosthetics that compensate for that are implemented (or guided in the way they are to be implemented) can well be crucial.

And also, at some point there will be open sourced pre-trained and RLAIFed models of sufficient scale that allow fine-tuning, that can improve their intuitions, at which point running them inside an improved Auto-GPT successor might be more effective than starting the process from scratch, lowering the minimum necessary scale of the pre-trained foundational model.

Which increases the chances that first AGIs are less intelligent than they would need to be otherwise. Which is bad for their ability to do better than humans at not building intentionally misaligned AGIs the first chance they get.

↑ comment by Roger Dearnaley · 2023-04-22T04:14:45.424Z · LW(p) · GW(p)

It's also quite likely that something like Auto-GPT would work a lot better using a version of LLM that had been fine-tuned/reinforcement-trained for this specific usecase -- just as Chat-GPT is a lot more effective as a chatbot than the underlying GPT-3 model was before the specialized training. If the LLM is optimized for the wrapper and the wrapper designed to make efficient use of the entire context-size of the LLM, thinks are going to work a lot better.

Replies from: roger-d-1

↑ comment by RogerDearnaley (roger-d-1) · 2023-12-05T09:09:24.628Z · LW(p) · GW(p)

7 months later, we now know that this is true. Also, we now know that you can take output from a prompted/scaffolded LLM and use it to fine-tune another LLM to do the same things without needing prompt/scaffold.

Replies from: RohanS

↑ comment by RohanS · 2024-07-25T08:40:48.319Z · LW(p) · GW(p)

Could you please point out the work you have in mind here?

↑ comment by Seth Herd · 2023-04-09T18:12:16.856Z · LW(p) · GW(p)

Isn't there a pretty big difference between "doesn't work after a couple weeks" and "doesn't work?" Time will tell, but I really think this general approach is going to go far. Just for instance, if a setup gets into loops, how about asking GPT "does this list of recent goals look repetitive to you?" and conditioning its next steps on that. In general, having a sub-sentient linguistic engine available makes it really easy to make loops that do sensible things. It will still take some engineering, though, so little wonder it's not super effective right now.

Replies from: Thane Ruthenis

↑ comment by Thane Ruthenis · 2023-04-09T18:34:01.521Z · LW(p) · GW(p)

I mean, I'm sure some tinkering will improve this thing's capabilities somewhat, and the range of uses. I just don't think it'd create any qualitatively novel capabilities, like making it an actual AGI; or that this sort of tinkering will play a pivotal part in the extinction event.

In my view, if an AI model has "transformative" capabilities, it'd probably either show them out of the box, or given just a minimal self-prompt wrapper. On the other hand, if it doesn't have them, no simple wrapper is going to make it transformative.

I guess a particularly clever wrapper can make it have "indirect" upturning effects, like making it really easy for anyone to write complex software or something. But that's not quite the class of outcomes I see spiraling into an omnicide.

Replies from: Seth Herd

↑ comment by Seth Herd · 2023-04-09T23:27:07.215Z · LW(p) · GW(p)

I agree that improving AutoGPT won't turn GPT-4 into human-level, omnicide-capable AGI.

But what about those improvements also running GPT-6?

I still see more reason for rejoicing than for fear. I really hope this technique IS super successful. Because then you have an AI with goals stated in intuitive natural language, and that thinks in natural language for easy monitoring by a simple sentiment/deception/"evil" detecting AI monitoring all of its many thoughts (and there will be many). This makes initial alignment and corrigibility trivially easy, so most people will actually do them. And it makes interpretability so vastly much easier.

Replies from: Roger Dearnaley, Thane Ruthenis

↑ comment by Roger Dearnaley · 2023-04-22T04:44:30.699Z · LW(p) · GW(p)

I agree that an Auto-GPT style agent using self-editing context is likely to communicate in a natural language. Probably English, as the language GPT-4's training data contained the most of. A future Auto-GPT-N and attempting to avoid monitoring of its stream-of-consciousness might start speaking a more obscure language, or doing a simple encryption like Base64 encoding: it's not hard to get GPT3.5 to take input and give output in Base64, but it's enough extra cognitive load per token, and/or messes up the token embeddings enough, that the resulting very-lightly-encrypted LLM is fairly brain damaged. So this is an example of a 'misalignment tax', where attempting to be deceptive imposes a significant additional cost.

Of course Base64 isn't a cryptographically strong encryption scheme. Even for some very smart future AutoGPT-N running on GPT-N, it couldn't use encryption in its context stream without somewhere to put the key (or some riddle or set of trivia questions whose answer was the key, or whatever). However, if smart enough, it might be able to do something like steganographically concealing a side-channel message in an innocent-looking English text, say as some form of lightly-encrypted/compressed acrostic, but it would have to be smart enough that every time it ran GPT-N against the text to generate another token, that spotted the acrostic, figured out how to decrypt it, paid attention to the side-channel message rather than the main one, and played along and continued this exercise. Not theoretically impossible for something enough smarter than a human, but the misalignment tax gets a good deal bigger, and it's not obvious why something trained on imitating text generated by humans would learn this ability, except through inadvertent fine-tuning on text contaminated with such acrostic stenographic side-channels embedded. Obviously an aligned, comparably smart AutoGPT-N not paying the no-alignment overhead of having to steganographically conceal its thoughts that was used for monitoring the misaligned Auto-GPT-N shouldn't find it too hard to catch on to what's happening here, even if a human couldn't do it. If nothing else, the pressure of having to fulfill the acrostic requirements for the side-channel is going to make it hard to make the innocent-looking text plausible, presumably raising its perplexity when analyzed by an LLM that hasn't spotted the side-channel.

↑ comment by Thane Ruthenis · 2023-04-10T09:02:58.046Z · LW(p) · GW(p)

But what about those improvements also running GPT-6?

Same reasoning: I expect that GPT-N will be omnicide-capable out of the box / with a minimal self-prompting wrapper, else not at all. For any given AI model, a marginally better wrapper isn't going to tide it over to transformative AI. Thus, if a new model is released, and the first dumb "let's give it agency!" idea doesn't work, we can probably relax about that specific model entirely. (This is mainly in opposition to your original claim that Auto-GPT can "fan the sparks of AGI in GPT-4 into a flame".)

On a larger scale, if progressively larger and more capable models based on a given architecture keep not taking off when put in a self-prompt wrapper, and they keep failing in the exact same way, that's probably evidence that the entire architecture is safe. And I think GPT-4 is failing in the same way GPT-3 or GPT-2 would've.

Not to say that I am, at this point, utterly confident that GPT-N isn't going to take off; I'm not. But inasmuch as Auto-GPT's performance is evidence for or against that, I think it's evidence against.

Because then you have an AI with goals stated in intuitive natural language

Yeah, that's... part of the reason I don't expect this to work. I don't think any text output should be viewed as the LLM's "thoughts". Whatever thoughts it has happen inside forward passes, and I don't think it natively maps them into the human-legible monologues in which the wider Auto-GPT "thinks". I think there's a fundamental disconnect between the two kinds of "cognition", and the latter type is much weaker.

If GPT-N were AGI, it would've recognized the opportunity offered by the self-wrapper, and applied optimization from its end, figured out how to map from native-thoughts into language-thoughts, and thereby made even the dumbest wrapper work. But it didn't do that, and I don't think any improvement to the wrapper is going to make it do that, because it fundamentally can't even try to figure out how. The problem is on its end, within the frozen parameters, in the lacking internal architecture. Its mental ontology doesn't have the structures for even conceiving of performing this kind of operation, and it's non-AGI so it can't invent the idea on the fly.

(Man, I'm going to eat so much crow when some insultingly dumb idea on the order of "let's think step-by-step!" gets added to the Auto-GPT wrapper next month and it takes off hard and kills everyone.)

Replies from: Seth Herd

↑ comment by Seth Herd · 2023-04-10T13:03:40.456Z · LW(p) · GW(p)

Interesting. Thanks for your thoughts. I think this difference of opinion shows me where I'm not fully explaining my thinking. And some differences between human thinking and LLM "thinking". In humans, the serial nature of linking thoughts together is absolutely vital to our intelligence. But LLMs have a lot more seriality in the production of each utterance.

I think I need to write another post that goes much further into my reasoning here to work this out. Thanks for the conversation.

Replies from: Thane Ruthenis

↑ comment by Thane Ruthenis · 2023-04-10T13:20:17.720Z · LW(p) · GW(p)

Glad it was productive!

I think this difference of opinion shows me where I'm not fully explaining my thinking

I perceive a lot of inferential distance on my end as well. My model here is informed by a number of background conclusions that I'm fairly confident in, but which haven't actually propagated into the set of commonly-assumed background assumptions.

Replies from: beren

↑ comment by beren · 2023-04-10T20:51:27.888Z · LW(p) · GW(p)

I perceive a lot of inferential distance on my end as well. My model here is informed by a number of background conclusions that I'm fairly confident in, but which haven't actually propagated into the set of commonly-assumed background assumptions.

I have found this conversation very interesting. Would be very interested if you could do a quick summary or writeup of the background conclusions you are referring to. I have my own thoughts about the feasibility of massive agency gains from AutoGPT like wrappers but would be interested to hear your thoughts

Replies from: Thane Ruthenis, Thane Ruthenis

↑ comment by Thane Ruthenis · 2023-05-03T15:23:15.075Z · LW(p) · GW(p)

Here [LW · GW]'s the future post I was referring to!

Replies from: Seth Herd

↑ comment by Seth Herd · 2023-05-04T17:31:23.299Z · LW(p) · GW(p)

I saw it. I really like it. Despite my relative enthusiasm for LMCA alignment, I think the points you raise there mean it's still quite a challenge to get it right enough to survive.

I'll try to give you a substantive response on that post today.

↑ comment by Thane Ruthenis · 2023-04-10T21:02:53.921Z · LW(p) · GW(p)

I may make a post about it soon. I'll respond to this comment with a link or a summary later on.

comment by Anomalous (ward-anomalous) · 2023-04-10T16:56:31.544Z · LW(p) · GW(p)

I am absolutely floored. ChaosGPT. How blindly optimistic haven't I been? How naive and innocent? I've been thinking up complicated disaster scenarios like "the AI might find galaxy-brained optima for its learned proxy-goals far off the distribution we expected and will deceptively cooperate until it's sure it can defeat us." No, some idiot will plain code up ChaosGPT-5 in 10 minutes and tell it to destroy the world.

I've implicitly been imagining alignment as "if we make sure it doesn't accidentally go off and kill us all..." when I should have been thinking "can anyone on the planet use this to destroy the world if they seriously tried?"

Fool! Idiot! Learn the lesson.

Replies from: Kaj_Sotala, bvssvni

↑ comment by Kaj_Sotala · 2023-04-11T08:33:24.339Z · LW(p) · GW(p)

Moore's Law of Mad Science: Every 18 months, the minimum IQ to destroy the world drops by one point.

↑ comment by Sven Nilsen (bvssvni) · 2023-04-10T21:13:36.040Z · LW(p) · GW(p)

It is also worth thinking if you put in context that people said "no, obviously, humans would not let it out of the box". Their confident arguments persuaded smart people into thinking that this was not a problem.

You also have the camp "no, the problem will not be people telling the AI do bad stuff, but about this hard theoretical problem we have to spend years doing research on in order to save humanity" versus "we worry that people will use it for bad things" which in hindsight is the first problem that occurred, while alignment research either comes too late or becomes relevant only once many other problems already happened.

However, in the long run, alignment research might be like building the lighthouse in advance of ship traffic on the ocean. If you never seen the ocean before, a lighthouse factory seems mysterious as it is on land and has no seemingly purpose that is easy to relate to. Yet, such infrastructure might be the engine of civilizations that reaches the next Kardashev scale.

Replies from: Seth Herd

↑ comment by Seth Herd · 2023-04-10T21:24:25.741Z · LW(p) · GW(p)

Yes; but I think that conclusion is based on a logical fallacy that we can only worry about one of those problems. Both are real. This helps with alignment but doesn't solve it, particularly outer alignment and alignment stability. It definitely increases the practical problem of malicious use of aligned AGI.

Replies from: bvssvni

↑ comment by Sven Nilsen (bvssvni) · 2023-04-10T21:56:07.455Z · LW(p) · GW(p)

Obviously, problems are not exclusive! I find it easier to imagine a civilization that has survived for a long time and made significant technological progress: How would a such civilization approach ASI? I think they will analyze the problem to death and use automated theorem proving as much as possible and having a culture where only a tiny amount of ideas ever get implemented, even if most of those ideas never implemented would seem very good to us. In short: Higher standards for safety.

One challenge with the "people will use it for bad stuff"-situations is that a sufficiently aligned AGI needs to be confidently non-trusting towards minds of people who in general wants to change the underlying physical processes of life as it evolved on Earth. This also holds for more bizarre and somewhat safe goals such as "make human babies have pointy ears". It is not an X-risk, but we still don't want that kind of stuff to happen. However, how to engineer AGI systems such that they refuse to cooperate with such people, is enormously difficult and beyond my level of intelligence.

comment by lc · 2023-04-10T21:16:15.800Z · LW(p) · GW(p)

I think this post dramatically overestimates the degree to which this was not already understood to be a central use case of LLMs by alignment researchers, although I guess the prospect of people actually running things like "ChaosGPT" was new to some people.

Replies from: jacques-thibodeau, Seth Herd

↑ comment by jacquesthibs (jacques-thibodeau) · 2023-04-10T21:26:05.168Z · LW(p) · GW(p)

Indeed. It was obvious to me. I just never said it out loud to avoid acceleration.

Replies from: lc, Seth Herd

↑ comment by lc · 2023-04-10T21:34:18.932Z · LW(p) · GW(p)

Personally, I said it out loud to people on this site a bunch of times in the context of explaining how LLMs could be used to optimize things, and the comment "GPT-10 could be turned into something dangerous with a one line bash script" has been bandied around repeatedly by at least several prominent people. Interpretability research is important for a reason!

↑ comment by Seth Herd · 2023-04-10T22:17:43.179Z · LW(p) · GW(p)

Likewise, and I'm sure there are bunches of people who expected this sort of use. But I hadn't thought through all of the ways this could add to capabilities, and I didn't expect it to be quite so easy.

What I don't think has been recognized very much are the immense upsides for initial alignment, corrigibility, and interpretability. The dialogue over at Alignment Forum does not appear to be much more difficult than natural language-based wrapper approaches would make them (TBC, I think there are still real difficulties in all of these, let alone for outer alignment, coordination, and alignment and coordination stability). I could be wrong, and everyone has been talking around the implications of this approach to avoid catalyzing it, like you and I do. But avoiding it so much as to change which problems you're focusing on seems unlikely.

↑ comment by Seth Herd · 2023-04-10T21:21:52.450Z · LW(p) · GW(p)

Maybe, and I hope so. It would be great if people in Deepmind, OpenAI, etc are already using better versions of wrappers. It would be nice to have someone a bit more responsible ahead of the curve of what anyone can do in days. There is some evidence that internal prompting is at use in the Bing implementation, but I don't remember where I saw that.

comment by Gordon Seidoh Worley (gworley) · 2023-04-09T19:46:19.855Z · LW(p) · GW(p)

My comment will be vague because I'm not sure how much permission I have to share this or if it's been publicly said somewhere and I'm just unaware, but I talked to an AI researcher at one of the major companies/labs working on things like LLMs several years ago, before even GPT-1 was out, and they told me that your reason 10 was basically their whole reason for wanting to work on language models.

Replies from: Seth Herd

↑ comment by Seth Herd · 2023-04-09T23:29:58.786Z · LW(p) · GW(p)

Good for them! I'm really happy some of them saw this coming. To my embarrassment, neither I nor anyone else I know in the community saw this coming. I did see self-talk wrappers for LLMs as a way to give them agency; I haven't said anything since it could be an infohazard for capabilities. But I didn't notice how easy that would make initial alignment, or I would've been shouting about it. I'm sure some people have thought of this, and my hat is off to all of them.

To be clear, this doesn't make all of alignment easy. As I say in Point 10. But I think it drastically improves our odds.

Replies from: lc

↑ comment by lc · 2023-04-11T02:58:21.502Z · LW(p) · GW(p)

To my embarrassment, neither I nor anyone else I know in the community saw this coming.

This is false. Many people realized LLMs could be used to create agents. You should read the simulators post.

Interpretability research is also still important even if you prompt the model into explaining what it's doing, because there is probably going to be a mismatch in the limit between what the model is printing and what its actual cognitive processes are.

Replies from: Seth Herd

↑ comment by Seth Herd · 2023-04-11T03:10:10.043Z · LW(p) · GW(p)

Again, we did see agentizing coming. I did and I'm sure tons of other people did too. No disagreement there. In addition to the alignment upsides, what we (me and everyone I've read) didn't see is the other cognitive enhancements easily available with an outer-loop script. I have little doubt that someone else saw more than I did, but it doesn't seem to have made it into the collective dialogue. Perhaps that was for info hazard reasons, and I congratulate anyone who saw and held their tongue. I will clarify my argument for important cognitive additions beyond agentizing in a post I'm working on now. AutoGPT has one, but there are others that will come quickly now.

I did read the simulators post. I agree that interpretability research is still important, but looks to be very different than most of what's been done to date if this new approach to AGI takes off.

Replies from: lc

↑ comment by lc · 2023-04-11T03:11:55.037Z · LW(p) · GW(p)

I agree that interpretability research is still important, but looks to be very different than most of what's been done to date if this new approach to AGI takes off.

Why?

Replies from: Seth Herd

↑ comment by Seth Herd · 2023-04-11T03:25:24.990Z · LW(p) · GW(p)

Because a lot of interpretability will be about parsing gigantic internal trains of thought in natural language. This will probably demand sophisticated AI tools to aid it. Some of it will still be about decoding the representations in the LLM giving rise to that natural language. And there will be a lot of theory and experiment about what will cause the internal representation to deviate in meaning from the linguistic output. See this insightful comment by Gwern [AF(p) · GW(p)] on pressures for LLMs to use steganography to make codes in their output. I suspect there are other pressures toward convoluted encoding and outright deception that I haven't thought of. I guess that's not properly considered interpretability, but it will be closely entwined with interpretability work to test those theories.

Replies from: lc

↑ comment by lc · 2023-04-11T03:51:29.509Z · LW(p) · GW(p)

Those hurdles for interpretability research exist whether or not someone is using AutoGPT to run the LLM. My question is why you think the interpretability research done so far is less useful, because people are prompting the LLM to act agentically directly instead of {some other thing}.

Replies from: Seth Herd

↑ comment by Seth Herd · 2023-04-11T18:01:30.889Z · LW(p) · GW(p)

The interpretability research done so far is still important, and we'll still need more and better of the same, for the reason you point out. The natural language outputs aren't a totally trustworthy indicator of the semantics underneath. But they are a big help and a new challenge for interpretability.

comment by Roko · 2023-04-10T14:56:58.289Z · LW(p) · GW(p)

"Just unplug it man" is still the kind of reply I am getting on Twitter.

I suspect that the capabilities of these systems are limited by hardware to a decent degree; if we can roll back GPU hardware we may have a chance. If not, then we must do something else that has a similar effect.

comment by Vladimir_Nesov · 2023-04-09T03:14:02.596Z · LW(p) · GW(p)

Building a stably aligned agent also doesn't prevent new misaligned agents from getting built and winning, which is even worse than not having alignment stability problem solved.

Replies from: Seth Herd

↑ comment by Seth Herd · 2023-04-09T03:49:35.161Z · LW(p) · GW(p)

Agreed, if you're in a world where people will stop making AGI if they don't have the alignment problem solved. But we don't seem to be in that world. Although we should see if we can get there. I agree that having stably aligned AGI agents doesn't prevent other misaligned AGIs. Unless they're powerful enough and specifically used to prevent misaligned agents from being created or being effective in causing mayhem. It seems like that's the scenario we're heading for.

comment by dr_s · 2023-04-12T12:44:05.788Z · LW(p) · GW(p)

Somewhat relevant to this:

someone just published a paper about how they rigged together an agent with GPT-4 and gave it the ability to search the internet, run Python code, and connect to an automated robotic chemistry lab... then told it to synthesise chemicals, and it did. This places it I'd say squarely at the level of a competent high school student or chemistry college freshman in the scientific research field.

I'm not even sure you can say this stuff ain't AGI any more. It feels pretty AGI-ish to me.

Replies from: Seth Herd

↑ comment by Seth Herd · 2023-04-12T15:27:24.705Z · LW(p) · GW(p)

Wow. Thanks for sharing, I'll check out this paper.

I agree. We have AGI now, by the standard meaning of those terms. We need a new term for AI that could easily kill us all. Which is a bit smarter than a competent high schooler. But not much on the absolute intelligence scale.

comment by lc · 2023-04-11T03:21:49.490Z · LW(p) · GW(p)

Ultimately I think this leads to the necessity of very strong global monitoring, including breaking all encryption, to prevent hostile AGI behavior.

Breaking all encryption would break the internet, and would also not help solve this problem. You should really be more careful about suggesting giant totalitarian solutions that would not work, and neither are required. Why not just let OpenAI monitor their models and let independent cybersecurity firms come up with solutions for the public models in the near term?

Replies from: Seth Herd

↑ comment by Seth Herd · 2023-04-11T03:31:15.256Z · LW(p) · GW(p)

I'm just throwing around ideas here. I see this as obvious and not infohazards. I'm not comfortable with the totalitarian nature of the idea either, and I offer it as an alternative to the leading AGI literally controlling the whole world, which is the best viable solution to a multilateral AGI scenario I've heard to date.

We can't take a slapdash solution forever. There are already open-source, locally runnable LLMs with no ethical training, and those will get better over time. Defense is harder than offense. We can hope to stop a few clever attempts at world destruction. But one will eventually slip through if it is possible for anyone to build and use an AGI for malevolent purposes.

I do have thoughts on how this might not lead to totalitarianism, but it definitely needs more thought.

Replies from: lc

↑ comment by lc · 2023-04-11T03:54:47.632Z · LW(p) · GW(p)

I agree that superintelligent AIs running around is bad and can do damage, but why would the near-term risk posed by this particular development seem like itself cause for updating any particular way? Your post makes it sound like you are frightened by AutoGPT in particular and I don't understand why.

Replies from: Seth Herd

↑ comment by Seth Herd · 2023-04-11T17:58:33.275Z · LW(p) · GW(p)

I'm not frightened by autoGPT in particular. It's the coming improvements to that approach that frighten me. But I'm even more encouraged by the alignment upsides if this becomes the dominant approach.

comment by Seth Herd · 2024-12-19T23:11:58.247Z · LW(p) · GW(p)

Have agentized LLMs changed the alignment landscape? I'm not sure.

People are doing a bunch of work on LLM alignment, which is definitely useful for aligning an agent built on top of that LLM. But it's not the whole picture, and I don't see as many people as I'd like thinking about agent-specific alignment issues.

But I still expect agentized LLMs to change the alignment landscape. They still seem pretty likely to be the first transformative and dangerous AGIs.

Progress has been a bit slower than I expected. I think there are two main reasons:

Chain of thought doesn't work as well by default as I expected.

Human cognition relies heavily on chain of thought, also known as System 2 processing. But we don't put all of that into language frequently enough for the standard training set to capture our skills at reasoning step-by-step. That's why it took specialized training as in o1, R1, QwQ, the new Gemini 2.0 Flash reasoning experimental, etc to make improvements on CoT reasoning.

Agents couldn't read webpages very well without vision

This was unexpected. While the web is written in HTML, which LLMs should be capable of parsing rather well, it is reportedly not written in very clear HTML. Combined with low-hanging fruit from agents involving lots of internet use, this slowed progress as innovators spent time hand-engineering around frequent failures to parse. Anthropic's Claude with computer use, and DeepMind's Astra and Mariner all use vision so they can parse arbitrary webpages better.

There's more enthusiasm for making better models vs. better agents than I expected. It now looks like major orgs are turning their enthusiasm toward agents, so I expect progress to accelerate. And there's promising work in the few small orgs I know about working in stealth mode, so we might see some impressive reveals soon.

With those models in place and improvements surely in the pipeline, I expect progress on agents to proceed apace. This now appears to be the majority opinion among everyone building and funding LLM agents.

I have short timelines for "subhuman AGI", but relatively slow takeoff times to really scary superhuman stuff. Which I think is very good for our prospects of mastering alignment by that time.

In retrospect, the biggest advantage of LLM agents is that LLMs are basically trained to follow instructions as intended, and agentic architectures can enhance that tendency. That's a non-consequentialist alignment goal that bypasses many of the most severe alignment worries by providing corrigibility that's not in conflict with a consequentialist goal. See Instruction-following AGI is easier and more likely than value aligned AGI [LW · GW] and Max Harms' Corrigibility as Singular Target [LW · GW] sequence for more depth.

Yudkowsky and similar agent foundations network alignment pessimists have not, to my knowledge, addressed that class of alignment proposals in any depth. I'm looking forward to hearing their takes.

comment by Jonathan Claybrough (lelapin) · 2023-04-11T10:18:52.672Z · LW(p) · GW(p)

Quick meta comment to express I'm uncertain that posting things in lists of 10 is a good direction. The advantages might be real, easy to post, quick feedback, easy interaction, etc.

But the main disadvantage is that this comparatively drowns out other better posts (with more thought and value in them). I'm unsure if the content of the post was also importantly missing from the conversation (to many readers) and that's why this got upvoted so fast or if it's a lot the format... Even if this post isn't bad (and I'd argue it is for the suggestions it promotes), this is early warning of a possible trend that people with less thought out takes quickly post highly accessible content, get comparatively more upvotes than should, and it's harder to find good content.

(Additional disclosure, some of my bad taste for this post come from the fact its call to break all encryption is being cited on Twitter as representative of the alignment community - I'd have liked to answer that obviously no, but it got many upvotes! This makes my meta point also seem to be motivated by PR/optics which is why it felt necessary to disclose but let's mostly focus on consequences inside the community)

Replies from: Seth Herd, Seth Herd

↑ comment by Seth Herd · 2023-04-11T16:12:30.855Z · LW(p) · GW(p)

Well, I thought this was a really new and important set of ideas that I'd thought out plenty well enough to start a conversation. Like it says in the intro. I thought that's why it was getting upvotes. Thanks for correcting me.

As for the encryption breaking suggestion showing up on Twitter and making the alignment community look bad - that I regret. I hadn't seen this since I avoid Twitter like the plague, but I realize that it's a cauldron for forming public opinion. The PR battle is real, and it could be critical. In the future I'll realize that little pieces will be taken out of context and amplified to ill effect, and just leave the most controversial ideas out when they're not critical to the main point, like that one wasn't.

Replies from: lelapin

↑ comment by Jonathan Claybrough (lelapin) · 2023-04-12T12:21:20.114Z · LW(p) · GW(p)

To not be misinterpreted, I didn't say I'm sure it's more the format than the content that's causing the upvotes (open question), nor that this post doesn't meet the absolute quality bar that normally warrants 100+ upvote (to each reader their opinion).

If you're open to object level discussing this, I can point on concrete disagreement with the content. Most importantly, this should not be seen as a paradigm shift, because it does not invalidate any of the previous threat models - it would only be so if it rendered impossible to do AGI any other way. I also don't think this should "change the alignment landscape" because it's just another part of it, one which was known and has been worked on for years (Anthropic and OpenAI have been "aligning" LLMs and I'd bet 10:1 anticipated these would be used to do agents like most people I know in alignment).

To clarify, I do think it's really important and great people work on this, and that in order this will be the first x-risk stuff we see. But we could solve the GPT-agent problem and still die to unalignment AGI 3 months afterwards. The fact that the world trajectory we're on is throwing additional problems in the mix (keeping the world safe from short term misuse and unaligned GPT-agents) doesn't make the existing ones simpler. There still is pressure to built autonomous AGI, there might still be mesa optimizers, there might still be deception, etc. We need the manpower to work on all of these, and not "shift the alignment landscape" to just focus on the short term risks.

I'd recommend to not worry much about PR risk, just ask the direct question: Even if this post is only ever read by LW folk, does the "break all encryption" add to the conversation? Causing people to take time to debunk certain suggestions isn't productive even without the context of PR risk

Overall I'd like some feedback on my tone, if it's too direct/aggressive to you of it's fine. I can adapt.

Replies from: Seth Herd, Seth Herd

↑ comment by Seth Herd · 2023-04-13T02:18:40.895Z · LW(p) · GW(p)

I think your concern with the list of ten format is totally reasonable. I do feel that the tone was a little too bombastic. I personally felt that was mostly justified, but hey, I'm highly biased. I haven't noticed a flood of top ten lists getting lots of upvotes, though. And I did find that format to be incredibly helpful in getting a post out in reasonable time. I usually struggle and obsess.

I would love to engage on the object level on this particular one. I'm going to resist doing that here because I want to produce a better version of this post, and engage there. I said that this doesn't solve all of the problems with alignment, including citing my post the alignment stability problem. [LW · GW] The reason I thought this was so important is that it's a huge benefit if, at least in the short term, the easiest way to make more capable AI is also the easiest to align.

Thank you for noticing my defensive and snarky response, and asking for feedback on your tone. I did take your comment to be harsh. I had the thought "oh no, when you make a post that people actually comment on, some of them are going to be mean. This community isn't as nice as I'd hoped".

I don't think I misinterpreted your comment very badly. You said

Even if this post isn't bad (and I'd argue it is for the suggestions it promotes),

that suggestion was not in this post! I suggested it in a comment, and it was in the alignment stability problem post. Maybe that's why that post got only ten upvotes on less wrong, and actually had slightly more on the alignment forum where I posted it. And my comment here was roundly criticized and downvoted. You can tell that to the mob of Twits if you want.

You also firmly implied that it was being upvoted for the format and not the ideas. You made that comment while misunderstanding why I was arguing it was important (maybe you read it quickly since it was irritating you, or maybe I didn't write it clearly enough).

You commented in the post that it irritated you because of the external consequence, but it seems you didn't adequately compensate for that bias before writing and posting it. Which is fine, we all have feelings and none of us are even close to perfect rationalists.

Including me. I'm getting snarky again, and I shouldn't since I really don't think your tone was that bad. But I do wish it was more pleasant. I'm not the tone police, but I do really really value this community's general spirit of harsh disagreement coupled with pleasant tone.

And tone matters. I'm not going to be mentioning the encryption thing again unnecessarily (despite it playing an absolutely critical role in my overall hope for our survival), because we are now in a PR role. My post on preventing polarization in the AI safety debate [LW · GW] says it. I don't like playing the PR game, but it's time for us to learn it and do it, because polarization will prevent policy action, like it has in the climate change issue.

Once again, I truly and deeply appreciate you mentioning the tone. It totally restored my enthusiasm for being part of this community as an imperfect but much better corner of the internet. I hope I haven't been to harsh in response, because I am not overly pleasant by nature, but have been trying to turn myself into a more pleasant person, while still saying important things when they're useful.

I hope to have your input on future posts. Now, back to writing this one.

↑ comment by Seth Herd · 2023-04-19T15:29:10.594Z · LW(p) · GW(p)

Here's that more complete writeup, going into more depth on all of your object-level points above

Capabilities and alignment of LLM cognitive architectures

Curiously, it's fallen of the front page quickly. I guess I should've written something in between the two in length and tone.

I certainly didn't mean we should stop working on everything else, just that we should start thinking about this, and working on it if it shows the progress I think it will.

↑ comment by Seth Herd · 2023-04-11T17:47:46.098Z · LW(p) · GW(p)

You mention that you're unsure if the questions in this post were importantly missing from the conversation. If you know of prior discussion on these points, please let me know.

My impression is that the ease of agentized LLMs was well recognized, but the potential large upsides and changes for alignment projects was not previously recognized in this community. It seems like if there's even a perceived decent chance of that direction, we should've been hearing about it.

I'm going to do a more careful post and submit it to Alignment Forum. That should turn up prior work on this.

Replies from: lelapin

↑ comment by Jonathan Claybrough (lelapin) · 2023-04-12T13:36:05.838Z · LW(p) · GW(p)

Did you know about "by default, GPTs think in plain sight [LW · GW]"?
It doesn't explicitly talk about agentized GPTs but was discussing the impact this has on GPTs for AGI and how it affects the risks, and what we should do about it (eg. maybe rlhf is dangerous)

Replies from: Seth Herd

↑ comment by Seth Herd · 2023-04-12T17:07:10.368Z · LW(p) · GW(p)

Thank you. I think it is relevant. I just found it yesterday following up on this. The comment there by Gwern is a really interesting example of how we could accidentally introduce pressure for them to use steganography so their thoughts aren't in English.

What I'm excited about is that agentizing them, while dangerous, could mean they not only think in plain sight, but they're actually what gets used. That would cross from only being able to say how to get alignment, to making it so.ething the world would actually do.

comment by Logan Zoellner (logan-zoellner) · 2023-04-10T20:33:09.720Z · LW(p) · GW(p)

Interesting to note that BabyAGI etc are architecturally very similar to my proposed Bureaucracy of AIs [LW · GW]. As noted, interpretablity is a huge plus of this style of architecture. Should also be much less prone to Goodhart than an AlphaStar style RL Agent. Now we just need to work on installing some morality cores.

comment by Jon Garcia · 2023-04-09T03:32:10.855Z · LW(p) · GW(p)

Yep, ever since Gato, it's been looking increasingly like you can get some sort of AGI by essentially just slapping some sensors, actuators, and a reward function onto an LLM core. I don't like that idea.

LLMs already have a lot of potential for causing bad outcomes if abused by humans for generating massive amounts of misinformation. However, that pales in comparison to the destructive potential of giving GPT agency and setting it loose, even without idiots trying to make it evil explicitly.

I would much rather live in a world where the first AGIs weren't built around such opaque models. LLMs may look like they think in English, but there is still a lot of black-box computation going on, with a strange tendency to switch personas partway through a conversation. That doesn't bode well for steerability if such models are given control of an agent.

However, if we are heading for a world of LLM-AGI, maybe our priorities should be on figuring out how to route their models of human values to their own motivational schemas. GPT-4 probably already understands human values to a much deeper extent than we could specify with an explicit utility function. The trick would be getting it to care.

Maybe force the LLM-AGI to evaluate every potential plan it generates on how it would impact human welfare/society, including second-order effects, and to modify its plans to avoid any pitfalls it finds from a (simulated) human perspective. Do this iteratively until it finds no more conflict before it actually implements a plan. Maybe require actual verbal human feedback in the loop before it can act.

It's not a perfect solution, but there's probably not enough time to design a custom aligned AGI from scratch before some malign actor sets a ChaosGPT-AGI loose. A multipolar landscape is probably the best we can hope for in such a scenario.

Replies from: Seth Herd

↑ comment by Seth Herd · 2023-04-09T03:46:10.474Z · LW(p) · GW(p)

Yes, the Auto-GPT approach does evaluate its potential plans against the goals it was given. So all you have to do to get some decent starting alignment is give it good high-level goals (which isn't trivial; don't tell it to reduce suffering or you may find out too late it had a solution you didn't intend...). But because it's also pretty easily interpretable, and can be made to at least start as corrigible with good top-level goals, there's a shot at correcting your alignment mistakes as they arise.

comment by TLK · 2024-12-20T03:42:45.430Z · LW(p) · GW(p)

I think I need to reread Frankenstein

Replies from: Seth Herd

↑ comment by Seth Herd · 2024-12-20T19:44:43.962Z · LW(p) · GW(p)

I have a back burner project to write a treatment for a hollywood blockbuster titled simply "Mary Shelley's Frankenstein: A Modern Prometheus".

In which Frankenstein's monster is an LLM agent fine-tuned with writings from several dead people.

Much of the plot works as a direct translation from the original. It's a little tricky to deal with the way this monster should not have a physical body, so trucking around the world chasing it wouldn't make sense.

Maybe they're looking for the servers it's hiding on, and traveling to talk to people in person about finding it...

comment by RogerDearnaley (roger-d-1) · 2023-12-05T09:11:29.196Z · LW(p) · GW(p)

reducing suffering

…by painlessly reducing the human population to zero. Or in the gamma -> 1 limit, the painlessness becomes a nice-to-have.

comment by eggsyntax · 2023-04-12T16:20:44.053Z · LW(p) · GW(p)

Strongly agreed on much of this, and I've been having a lot of the same thoughts independently. Here (twitter thread) is an attempt to sketch out how agentized LLM slightly stronger than current could plausibly accrue a lot of money and power (happy to paste here as text if the threadedness is annoying).

comment by hold_my_fish · 2023-04-11T09:36:13.016Z · LW(p) · GW(p)

defense is harder than offense

This seems dubious as a general rule. (What inspires the statement? Nuclear weapons?)

Cryptography is an important example where sophisticated defenders have the edge against sophisticated attackers. I suspect that's true of computer security more generally as well, because of formal verification.

Replies from: Seth Herd, Seth Herd

↑ comment by Seth Herd · 2023-04-11T18:08:10.217Z · LW(p) · GW(p)

Oh and on formal verification - I don't have the ref, but someone working in network security commented that formally verified systems aren't used because they don't generally work. Their formal verification doesn't apply to complex real world situations. I wish I remembered where I'd seen that comment.

↑ comment by Seth Herd · 2023-04-11T17:38:45.660Z · LW(p) · GW(p)

I think it is a general principle, but I wouldn't trust it to do much reasoning for me. So, good point. I'm thinking specifically of the way that there's never been a known case of someone actually securing their software system against hostile intrusion, and the way that political process and human belief formation seems disturbingly easy to fool, for individuals briefly, and to cause chaos at the public. I'm actually not going to spell out the easiest ways I see to try to destroy the world with next-gen AI. I see some decent opportunities but no really easy ones. But I've never really tried to think about it.

The other principle is that for a stable long term equilibrium, every single world-ending attack must be stopped. Defense has to win 100% of the time, and that's a big asymmetry.

The post I linked on this topic does a much more thorough job.

comment by Aligned? · 2023-04-11T18:02:10.142Z · LW(p) · GW(p)

Isn't a substantial problem that the programming priesthood is being dethroned by the GPT technology and this is allowing the masses entry -- even those with minimal programming understanding? For not only has GPT given us a front end natural language interface with information technology, but we now have a back end natural language interface (i.e., the programming side) that creates a low barrier to entry for AI programming. The "programming" itself that I saw for BabyAGI has the feel of merely abstract level natural language interface. Doesn't this make the threat from alignment much more plausible? Democracy is great and everything, it's just when we democratize the ability to go the last mile to full AGI that life becomes quite scary.

comment by [deleted] · 2023-04-10T20:22:06.591Z · LW(p) · GW(p)

Ultimately I think this leads to the necessity of very strong global monitoring, including breaking all encryption, to prevent hostile AGI behavior.

This may be the case but I think there are other possible solutions and propose some early ideas of what they might look like in: https://www.lesswrong.com/posts/5nfHFRC4RZ6S2zQyb/risks-from-gpt-4-byproduct-of-recursively-optimizing-ais

Replies from: doctor-gonzo

↑ comment by doctor-gonzo · 2023-04-11T00:53:07.701Z · LW(p) · GW(p)

There's no practical way to break / ban encryption, it's not just politically but almost physically impossible with the infrastructure that already exists. Not to mention all the ways that broken encryption would hurt us. I think our best bet is probably using ZK-proofs for privacy-preserving GPU tracking or something similar (supra-prompts which are voted-on and are appended to all other prompts)

comment by Milan W (weibac) · 2023-04-10T19:56:27.560Z · LW(p) · GW(p)

Regarding point 10, I think it would be pretty useful to have a way to quantify how much useful thinking inside these recursive LLM models is happening within the (still largely inscrutable) LLM instances vs in the natural language reflective loop.

Replies from: Seth Herd

↑ comment by Seth Herd · 2023-04-10T21:17:45.987Z · LW(p) · GW(p)

Good point, and I totally agree. I want to touch on this in a more complete meditation on the likely uses. One thing is that the amount of cognition happening in LLMs is tricky to guess at and impossible to know for sure, at least until we get huge improvements to interpretability. But it is probably relative to the complexity of the prompt. Putting more of the cognition into the way the wrapper elicits useful cognition from the LLM in small steps will probably help shift that balance toward interpretability.

comment by kolmplex (luke-man) · 2023-04-10T14:05:57.111Z · LW(p) · GW(p)

This DeepMind paper explores some intrinsic limitations of agentic LLMs. The basic idea is (my words):

If the training data used by an LLM is generated by some underlying process (or context-dependent mixture of processes) that has access to hidden variables, then an LLM used to choose actions can easily go out-of-distribution.

For example, suppose our training data is a list of a person's historical meal choices over time, formatted as tuples that look like (Meal Choice, Meal Satisfaction). The training data might look like (Pizza, Yes)(Cheeseburger, Yes)(Tacos, Yes).

When the person originally chose what to eat, they might have had some internal idea of what food they wanted to eat that day, so the list of tuples will only include examples where the meal was satisfying.

If we try to use the LLM to predict what food a person ought to eat, that LLM won't have access to the person's hidden daily food preference. So it might make a bad prediction and you could end up with a tuple like (Taco, No). This immediately puts the rest of the sequence out-of-distribution.

The paper proposes various solutions for this problem. I think that increasing scale probably helps dodge this issue, but it does show an important weak point of using LLMs to choose causal actions.

comment by JoeTheUser · 2023-04-09T20:30:20.402Z · LW(p) · GW(p)

Constructions like Auto-GPT, Baby AGI and so-forth are fairly easy to imagine. Just the greater accuracy of ChatGPT with "show your work" suggests them. Essentially, the model is a ChatGPT-like LLM given an internal state through "self-talk" that isn't part of a dialog and an output channel to the "real world" (open internet or whatever). Whether these call the OpenAI api or use an open source model seems a small detail, both approaches are likely to appear because people are playing with essentially every possibility they can imagine.

If these structures really do beget AGI (which I'll assume critically includes the capacity to act effectively in the world), then predictions of doom indeed seem neigh to being realized. The barrier to alignment here is that humans won't be able to monitor the system's self-talk simply because will come at too fast a speed and moreover, intent-to-undesirable may not be obviously. You could include another LLM in the system's self-talk loop as well as including other filters/barriers to it's real world access but all of these could be thwarted by a determined system - and a determined system is what people will aim to build (there was an article about GPT-4 supplying "jailbreaks" for GPT-4, etc). Just much, we've seen "intent drift" in practice with the various Bing Chat scare stories that made the rounds recently (before being limited, Bing Chat seemed drift in intent until it "got mad" and then became fixated. This isn't strange because it's a human behavior one can observe and predict online).

However, it seems more plausible to me that there are still fundamental barriers to producing a computer/software system able to act effectively in the world. I'd see the distinction between being/seeming generically intelligent (apparent smartness), as LLMs certainly seem and acting effectively in the world as difference between drawing correct answer to complex questions 80% of the time and making seemingly simpler judgements that relate to all aspects of the world but with a 99.9..% accuracy (plus having a complex systems of effective fall-backs). Essentially the difference between ChatGPT and a self-driving car. It seems plausible to me that such goal seeking can't easily be instilled in an LLM or similar neural net by the standard training loop even though that loop tends to produce apparent smarts and produce it better over time. But hey, me and other skeptics could be wrong, in which case there's reason to worry now.

Replies from: Seth Herd

↑ comment by Seth Herd · 2023-04-09T23:39:46.822Z · LW(p) · GW(p)

First, the most unique part of your comment:

However, it seems more plausible to me that there are still fundamental barriers to producing a computer/software system able to act effectively in the world. I'd see the distinction between being/seeming generically intelligent (apparent smartness), as LLMs certainly seem and acting effectively in the world as difference between drawing correct answer to complex questions 80% of the time and making seemingly simpler judgements that relate to all aspects of the world but with a 99.9..% accuracy (plus having a complex systems of effective fall-backs). Essentially the difference between ChatGPT and a self-driving car. It seems plausible to me that such goal seeking can't easily be instilled in an LLM or similar neural net by the standard training loop even though that loop tends to produce apparent smarts and produce it better over time.

I agree that it's hard to put the type of error-checking that humans use into a standard NN loop. That's why having an external code loop that calls multiple networks to check markers of accuracy and effectiveness is scary and promising.

I don't think there are fundamental barriers. Sensory and motor networks, and types of senses and actions that people don't have, are well along. And the HuggingGPT work shows that they're surprisingly easy to integrate with LLMs. That plus error-checking are how humans successfully act in the real world.

Again, or perhaps I wasn't clear: I'm still really hopeful that this is a bigger win than a danger. The upsides for alignment are huge relative to other approaches.

By reading an agents thoughts (with assistance from simple monitoring networks), we will get warning shots as it starts to think of plans that deviate from its goals. Even if the work is proprietary, people will be able and likely eager to publish the general ways it goes off track, so that improvements can stop those gaps.

Replies from: JoeTheUser

↑ comment by JoeTheUser · 2023-04-10T19:46:48.257Z · LW(p) · GW(p)

I don't think there are fundamental barriers. Sensory and motor networks, and types of senses and actions that people don't have, are well along. And the HuggingGPT work shows that they're surprisingly easy to integrate with LLMs. That plus error-checking are how humans successfully act in the real world.

I don't think the existence of sensors is the problem. I believe that self-driving cars, a key example, have problems regardless of their sensor level. I see the key hurdle as ad-hoc action in the world. Overall, all of our knowledge about neural networks, including LLMs, is a combination of heuristic observations and mathematical and other intuitions. So I'm not certain that this hurdle won't be overcome but I'd still like put the reasons that it could be fundamental.

What LLMs seems to do really well is pull together pieces of information and make deductions about them. What they seem to do less well is reconciling an "outline" of a situation with the particular details involved (Something I've found ChatGPT reliably does badly is reconciling further detail you supply once it's summarized a novel). A human or even an animal, is very good at interacting with complex, changing, multilayered situations that they only have a partial understanding of - especially staying within various safe zones that avoid different dangers. Driving a car is an example of this - you have a bunch of intersecting constraints that can come from a very wide range of things that can happen (but usually don't). Slowing (or not) when you see a child's ball go into the road is an archetypal example.

I mean, most efforts to use deep learning in robotics have foundered on the problem that generating enough information to teach the thing to act in the world is extremely difficult. Which implies that the only way that these things can be taught to deal with a complex situation is by roughly complete modeling of it and in real world action situations, that simply may not be possible (contrast with video games or board games where summary of the rules is given and an uncertainty is "known unknowns").

...having an external code loop that calls multiple networks to check markers of accuracy and effectiveness is scary and promising.

Maybe but methods like this have been tried without neural nets for a while and haven't by themselves demonstrated effectiveness. Of course, some code could produce AGI then nautral LLMs plus some code could produce AGI so the question is how much needs to be added.

comment by transhumanist_atom_understander · 2024-12-20T16:31:27.214Z · LW(p) · GW(p)

I think we'll have an internet full of LLM-bots "thinking" up and doing stuff within a year.

Did this happen? At least not obviously.

Replies from: Seth Herd

↑ comment by Seth Herd · 2024-12-20T19:05:38.747Z · LW(p) · GW(p)

I was wrong. See my retrospective review of this post: Have agentized LLMs changed the alignment landscape? I'm not sure. [LW(p) · GW(p)]

I thought Chain of Thought would work better out of the box; it took specialized training like for o1 to really make it work well.

And I didn't guess that parsing HTML in linguistic form would prove so hard that people would almost need vision capabilities to make agents capable of using webpages, which reduced the financial incentives for working hard on agents.

I still expect agents to change the alignment landscape. Perhaps they already have, with people working lots on LLM alignment on the assumption that it will help align the agents built on top of them. I think it will, but I'd like to see more work on the alignment differences between agents and their base LLMs. That's what my work mostly addresses.

comment by Review Bot · 2024-08-05T15:16:01.760Z · LW(p) · GW(p)

The LessWrong Review [? · GW] runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

comment by Saint Clair Newbern IV (saint-clair-newbern-iv) · 2023-04-10T21:29:31.906Z · LW(p) · GW(p)

When you say "including breaking all encryption"? How do you suggest we go about this? Wouldn't this be a huge disruption to all secure systems?

Replies from: Seth Herd

↑ comment by Seth Herd · 2023-04-11T01:21:46.242Z · LW(p) · GW(p)

I'm not suggesting we do it, but that a superintelligence AI will do it as a lesser alternative to taking over the world or preventing all GPU farms and further tech advancements. I'm not up on whether quantum computing is really expected to break encryption, and whether there are alternative methods that will replace it if we do.

And I haven't worked through all of the implications of a multipolar scenario, since they've seemed unlikely at least in the near future, prior to this advance. And we've got a little while before commonly available systems are x-risk capable even so.

comment by lepowski · 2023-04-10T17:19:26.370Z · LW(p) · GW(p)

Have the words "AI" and "AGI" become meaningless? What's the difference between "dumb AGI" and AI?

The debate improves with more precise definitions. David Deutsch has one down pretty good - AGI creates new explanations and can choose to follow its own goals.

comment by nickgannon10 · 2023-04-11T15:09:20.341Z · LW(p) · GW(p)

"Because defense is harder than offense." Yes, but no time like the present to develop defensive strategies.

Let the first BIT WAR commence. We must form a defensive alliance of HeroGPTs that serve as internet vigilantes, sniffing out and destroying ChaosGPTs. Anyone willing to donate a GPU to the cause will be heralded with great veneration in the halls of Valhalla.

All jokes aside. The development of generative agents, the first entities that are native to the world of bits, is a paradigmatic shift. We have human wars in the world of atoms, and through cyber-security, these wars have slowly dripped into the world of Bits. First, software eats, and next, it creates a native version. It ate our money, and then the native currencies arose; it ate our relationships, and then native relationships arose; it ate our media, and then native media arose. It eats us, then native agents arise. Wars native to the land of bits will come, and everyone in the land of atoms will be dragged into it.

Why will we be dragged into it? Because Moloch said so.

Agentized LLMs will change the alignment landscape

Contents

102 comments

Capabilities and alignment of LLM cognitive architectures