AI misalignment risk from GPT-like systems?
post by fiso64 (fiso) · 2022-06-19T17:35:41.095Z · LW · GW · 1 commentThis is a question post.
Contents
Answers 3 MSRayne 2 HausdorffSpace None 1 comment
Right now, it seems that the most likely way we're gonna get an (intellectually) universal AI is by scaling models such as GPT. That is, models trained by self-supervised learning on massive piles of data, perhaps with a similar architecture to the transformer.
I do not see any risk due to misalignment here.
One failure mode I've seen discussed is that of manipulative answers, as seen in Predict-O-Matic [LW · GW]. Maybe those AIs will learn that manipulating users to do actions with low entropy outcomes decreases the overall prediction error?
But why should a GPT-like ever output manipulative answers? I am not denying the possibility that a GPT successor develops human level intelligence. When it learns to predict the next word, it may genuinely go through an intellectual process which was created as it was forced to compress its predictions due to the ever increasing amounts of data it had to go through.
However, nowhere in the process of constructing a valid response does there seem to be an incentive to produce responses which manipulate the environment, be it to make it easier to predict, or to make it more in-line with the AI's predictions. After all, it wasn't trained in a responsive environment as an agent, but on a static dataset. And when it is in use, it's just a frozen model, so there is obviously no utility function.
Am I wrong here? Are there any other failure modes I did not think of?
Answers
It could summon demons by accident. [LW · GW]
↑ comment by fiso64 (fiso) · 2022-06-20T05:44:23.718Z · LW(p) · GW(p)
That makes a lot of sense, thanks for the link. It is not as dangerous of a situation as a true agent AGI as this failure mode involves a (relatively stupid) user error. I trust researchers not to make that mistake, but it seems like there is no way to safely make those systems available to the public.
A way to make this more plausible I thought of after reading this is that of accidentally making it think it's hostile. Perhaps you make a joking remark about paperclip maximizers, or maybe it just so happens that the chat history is similar to the premise of a story about a hostile AGI in its dataset, and it thinks you're making a reference. Suddenly, it's trying to model an unaligned AGI. This system can then generate outputs which deceive you into doing something stupid, such as running the shell script described in the linked post, or creating a seemingly aligned AGI agent with its suggestions.
Replies from: MSRayne↑ comment by MSRayne · 2022-06-20T12:56:24.931Z · LW(p) · GW(p)
Yeah, exactly. That said I don't think the event in the story is a "stupid" user error. It's quite a reasonable one. Suppose nobody considered this problem and this language model was installed in a next-gen smart home assistant, and someone asked it to order them the best possible pizza... in general, I think it's dangerous to assume anyone is "smart enough" to avoid anything, because if common sense was common the world would make more sense.
One speculative way I see it, that I've yet to expand on, is that GPT-N, to minimize prediction error in training, could simulate some sort of entity enacting some reasoning, to minimize the prediction error in non-trivial settings. In a sense, GPT would be a sort of actor interpreting a play through extreme method acting. I have in mind something like what the protagonist of "Pierre Menard, author of Don Quixote" tries to do to replicate the book Don Quixote word by word.
This would mean that, for some set of strings, GPT-N would boot and run some agent A, when seeing string S, just because "being" that agent performed well in similar training strings. This agent, if complex and capable enough (which may need to be, if that was what it needed to predict previous data), this agent himself could, maybe through the placement of careful answer tokens that would guarantee its stability, would be a dangerous and possibly malicious agent.
And, of course, sequence modeling as a paradigm can also be used for RL training.
↑ comment by Lone Pine (conor-sullivan) · 2022-06-20T15:26:35.423Z · LW(p) · GW(p)
I feel like there is a failure mode in this line of thinking, that being 'confusingly pervasive consequentialism'. AI x-risk is concerned with a self-evidently dangerous object, that being superoptimizing agents. But whenever a system is proposed that is possibly intelligent without being superoptimizing, an argument is made, "well, this thing would do its job better if it was a superoptimizer, so the incentives (either internal or external to the system itself) will drive the appearance of a superoptimizer." Well, yes, if you define the incredibly dangerous thing as the only way to solve any problem, and claim that incentives will force that dangerous thing into existence even if we try to prevent it, then the conclusion flows directly from the premise. You have to permit the existence of something that is not a superoptimizer in order to solve the problem. Otherwise you are essentially defining a problem that, by definition, cannot be solved, and then waving your hands saying "There is no solution!"
Replies from: danilo-naiff↑ comment by DFNaiff (danilo-naiff) · 2022-06-20T22:47:09.288Z · LW(p) · GW(p)
I believe I understand your point, but there are two things that I need to clarify, that kind of bypasses some of these criticism:
a) I am not assuming any safety technique applied to language models. In a sense, this is the worst-case scenario, one thing that may happen if the language model is run "as-it-is". In particular, the scenario I described would be mitigated if we could possibly stop the existence of stable sub-agents appearing in language models, although how to do this I do not know.
b) The incentives for the language models to be a superoptimizer don't necessarily need to be that strong, if we consider that we could have many instantiations of GPT-N being used, and only one of them needs to be that kind of stable malicious agent I tried (and probably failed) to describe. One of these stable agents would only need to appear once, in some setting where it can both stabilize itself (maybe through carefully placed prompts), and gain some power to cause harm in the world. If we consider something like the language model being used like GPT-3, in multiple different scenarios, this becomes a weaker assumption.
That being said, I agree with your general line of criticism, of not imagining intelligent but not superoptimizing agents being possible, although whether superoptimizer are attractors for generally intelligent agents, and under which conditions, is an open (and crucially important) question.
↑ comment by fiso64 (fiso) · 2022-06-20T06:43:02.836Z · LW(p) · GW(p)
I posted a somewhat similar response to MSRayne, with the exception that what you accidentally summon is not an agent with a utility function, but something that tries to appear like one and nevertheless tricks you into making some big mistake.
Here, what you get is a genuine agent which works across prompts by having some internal value function which outputs a different value after each prompt, and acts accordingly, if I understand correctly. It doesn't seem incredibly unlikely, as there is nothing in the process of evolution that necessarily has to make humans themselves be optimizers, but it happened anyways because that is what best performed in the overall goal of reproduction. This AI will still probably have to somehow convince the people communicating with it to give it "true" agency independent of the user's inputs. Seems like an instrumental value in this case.
1 comment
Comments sorted by top scores.
comment by Mitchell_Porter · 2022-06-20T01:25:13.554Z · LW(p) · GW(p)
One of the holy grails of AI has been "common sense knowledge" - the kind of comprehensive general knowledge about the concrete everyday world, that humans begin to acquire when just a few years old, and which we then keep refining throughout our lives. Before the large language models, the only halfway successful approach to this was Cyc, and they dealt with the problem by simply spoonfeeding their AI with tens of thousands of everyday propositions, laboriously added to its knowledge base by hand.
But as we have discovered, large language models, designed simply to learn and imitate patterns in very large collections of Internet text, can do a surprisingly good job of talking as if they are a person with a typical person's knowledge. There seems to be very little understanding of how they do this. But let's postulate that what they develop, are "chatbot schemas", conversational agents which roughly mimic the internal changes of state in a thinking and communicating human being, along with fragments of knowledge that can be drawn upon by the "chatbots".
A language model, then, is a kind of mirror held to the corpus of human writings, a mirror of sufficient fineness that it reveals some of the cognitive and conceptual structure implicit within those writings. But also an enchanted mirror that we can talk to, that summons persons and places that never existed, but which are fashioned according to the logic it has discerned in our productions.
Left to itself, the language model is passive, and random when it responds. But having discovered a learning process sufficiently deep and general that it cheaply produces imitations of agents with common-sense knowledge, the human race is now trying to harness that power, refine it, make it more predictable, turn it into part of a true AGI. In my opinion, that's where the sharpest dangers lie: not that a coherently malevolent agent will spontaneously crystallize inside a straightforward language model, but that a language model, reshaped and trained to be a dutiful part of a larger cognitive architecture, will also be part of what pushes that larger "mind" beyond human understanding or control.