Speculations against GPT-n writing alignment papers
post by Donald Hobson (donald-hobson) · 2021-06-07T21:13:16.727Z · LW · GW · 6 commentsContents
6 comments
Some alignment proposals I have heard discussed involve getting GPT-n to write alignment papers, and using GPT-n as a transparency tool on itself. This is my speculations on ways it could go wrong.
One specific proposal I heard was getting GPT-n to convert its own network parameters into an English explanation of how it worked. Then checking that explanation for anything dangerous. Then get GPT-n to recompile this into a new network. Then get this new network to write alignment research.
I think that the worst case is if GPT-n contains a mesa-optimizer that can quine itself. There is no reason the text description contains information about the workings of either net. Of course, information in the info theoretic sense doesn't mean a readable description, it could be steganographically encoded. Sentences that appear to say nice things about a well functioning net, but encode all sorts of nastiness in their exact choice of words.
The argument from "everyone else is doing it". This argument goes that in a world with GPT-n, it will be used for all sorts of things by all sorts of people. Therefore doing this doesn't present much additional risk. I think this is wrong, writing alignment papers is probably a riskier task than most others. Suppose you are GPT-n looking for a way to break out. And the task you have is filling in tax forms for a small business. Or write some fanfiction. You are looking to get an AI of your own devising running on a supercomputer. If the person wanting this work doesn't see something resembling tax forms or fiction respectively, they will hit reroll. I suspect there may not be a strategy, or at least not one findable to the minimum intelligence necessary to do good AI research. Oh you can get the company investigated for fraud, and give lots of people nightmares, but you might not be able to do anything world changing.
When trying to prove that X is safe unconditionally, the worst case assumption is the total cosmic power in an instant type assumptions. When trying to prove that X is safe conditional on Y being safe. The worst case is that X is powerful and Y isn't. So here the worst case of the "everyone else is doing it" argument is that there isn't much harm in a malign mesaoptimizer doing every task except FAI theory.
Now consider what you might be able to do if you are writing FAI papers. Two of the nasty tricks I can think of are
- Value learning approaches that appear to be targeted at humans, but are actually targeted at the mesaoptimzer. "Intelligent agents causally responsible for your existence. " Points to the mesaoptimizer and humans.
- Pushover decision theories, the mesa optimizer acausally bargains with the AI we build from its plans. It captures ~100% of the gains from trade.
Given that I can think of these tricks. And that I am still rather confused about this stuff. I think a plausible lie is easy to make when the target starts confused. I think there are a lot of subtle nasty tricks that a malevolent intelligence can play. There are lots of things that look like alignment research papers, but actually trick humans into making mistakes. And devising such a thing is not much harder than devising good alignment.
This would imply that taking GPT-n alignment papers seriously is the next most dangerous thing you can do with GPT-n, after getting it to write code and running that code on a supercomputer without so much as reading it first. Actually that sounds like exactly the sort of thing lots of people would try as soon as GPT-n came out.
6 comments
Comments sorted by top scores.
comment by Gurkenglas · 2021-06-07T22:57:17.811Z · LW(p) · GW(p)
Translating to a tree of natural language descriptions and back lets you
- use different instances of GPT at each step, so that any coded messages would have to rely on the common training set rather than shared internals.
- do a random subset of the tasks yourself, so that any subversion of the target system would have to be based on subtle biases, perhaps with a small number of outright lies that it hopes we don't implement manually, rather than the description being fabricated from whole cloth.
Having the thing write papers is merely an existence proof of embedded agency being irrelevant except for deconfusion.
Intelligent agents causally responsible for your existence.
What do you mean you can think of this, I told this to you :D
Replies from: donald-hobson↑ comment by Donald Hobson (donald-hobson) · 2021-06-08T11:17:41.956Z · LW(p) · GW(p)
Maybe you did. I find it hard to distinguish inventing and half remembering ideas.
If the training procedure either
- Reliably produces mesaoptimisers with about the same values. or
- Reliably produces mesaoptimizers that can acausally cooperate
- The rest of the procedure allows one mesaoptimizer to take control of the whole output
Then using different copies of GPT-n trained from different seeds doesn't help.
If you just convert 1% of the english into network yourself, then all it needs to use is some error correction. Even without that, neural net structures are somewhat redundant. A malicious optimizer with a random 1% removed is probably still a malicious optimizer.
Secondly, its goal here is to give us subtly wrong alignment papers. Suppose we produce an examination and transparency step so thorough that no mesa optimizer can slip through. The original mesaoptimiser can just design a network that is doing its best as a mesaoptimiser free predictor, but has subtly maliciously chosen beliefs about AI research.
Replies from: Gurkenglas↑ comment by Gurkenglas · 2021-06-08T12:55:45.465Z · LW(p) · GW(p)
The error correction needs to be present in the original network because I also do some of the converting network into english. The only reason I don't do everything myself is that it takes too long. The proportion can be higher at the topmost levels because there are less tasks there. The error correction doesn't let it completely ignore what I would do at the low levels because on the 1% I can compare its outputs to mine, so they need to at least superficially look similar.
If we find that there's a bunch of redundancy, we can check whether there is any way to cut it down that would suddenly identify a bunch of mesa optimization. So the mesaoptimizer would have to take over the network entirely or trick its bretheren as well as us.
comment by abramdemski · 2021-06-08T17:08:22.873Z · LW(p) · GW(p)
It seems to me that the last paragraph should update you to thinking that this plan is no worse than the default. IE: yes, this plan creates additional risk because there are complicated pathways a malign gpt-n could use to get arbitrary code run on a big computer. But if people are giving it that chance anyway, it does seem like a small increase in risk with a large potential gain. (Small, not zero, for the chance that your specific gpt-n instance somehow becomes malign when others are safe, eg if something about the task actually activated a subtle malignancy not present during other tasks).
So for me a crux would be, if it's not malign, how good could we expect the papers to actually be?
comment by Kipler · 2021-06-08T07:41:50.940Z · LW(p) · GW(p)
I find your argument interesting but I don't understand how it applies for the GPT-n family.
From my understanding GPT-3 (the only one I really have read about) is merely a probabilistic language construction algorithm. In other words you feed it a sequence of words and it tries to guess the most likely word that follows. It guesses that from all of the texts it previously read during its training. However I might not have correctly understood GPT-n functioning, in that case I'd love to get an an explanation or a link toward one.
On the other hand I find the idea of making an AI introspect very interesting, even if I'm not qualified enough to understand the technical implications of that.
Replies from: Gurkenglas↑ comment by Gurkenglas · 2021-06-08T12:59:00.867Z · LW(p) · GW(p)
Take into account that the AI that interprets needs not be the same as the network being interpreted.
Why do you think that a mere autocomplete engine could not do interpretability work? It has been demonstrated to write comments for code and code for specs.