A trick for Safer GPT-N 2020-08-23T00:39:31.197Z
What should an Einstein-like figure in Machine Learning do? 2020-08-05T23:52:14.539Z


Comment by razied on Using GPT-N to Solve Interpretability of Neural Networks: A Research Agenda · 2020-09-04T12:33:06.473Z · LW · GW

GPT-N is trained to predict next words across human-readable text, how do you convert the Modular GPT weights to feed them to GPT-N to get human-readable descriptions? That sort of input is vanishingly rare on the internet, people aren't in the habit of looking at a list of deep learning weights and then outputing descriptions of what they do. I am very dubious that GPT-N will just naturally pick up that ability from human text.

Comment by razied on A trick for Safer GPT-N · 2020-08-24T23:14:25.786Z · LW · GW

Here are the problems I found (so far) with this approach and my attempt at solutions:

Problem 1: We won't think of all possibilities for bad stuff happening, and we might miss a crucial failure mode in our text corpus.

Solution 1: We might try to outsource the creation of some of those possibilities to GPT-N. This would be a much bigger problem if GPT were explicitly optimizing against us, but it isn't.

Problem 2: GPT-N might predict that the world ends in such a way that stuff like the headlines of newspapers don't change, or that things go badly rapidly enough that humanity doesn't have time to notice it.

Solution 1: we can ask GPT-N for the probabilities at fine-grained time slices, and of text modalities that respond quickly to events, like tweets or forum posts.

Solution 2: for the case where things go badly rapidly enough that we don't notice it at all, we can ask GPT-N for the probability of pictures (say a picture of downtown New York, still intact with living humans). Pictures respond faster to events in the real world. The only way to beat this is with something like a complete takeover of all forms of electronic communication at the same time. Or with total complete destruction of everything at the same time, ideally we need to have one of the cameras up on a satellite, taking pictures of the earth, so it wouldn't be immediately destroyed.

Solution 3: we need to sign all the items in our dataset with cryptographic keys stored on computers in Faraday cages. A Twitter account which publishes signed messages/pictures every day or so should be made by organizations and serve as "dead man indicators". This ensures that a potential fast doom scenario cannot only be electronic, it needs to be physical.

Problem 3: The optimization pressure scales with the number of questions we ask of GPT-N, especially if the questions we ask are informed by GPT's past refusals or answers. If we get to a situation where GPT refuses to answer all our questions, and we keep asking them, then we are optimizing against the safety conditions.

Solution 1: this means that pretty much the first thing we need to ask GPT-N is how to safely extend the number of questions we can ask.

Comment by razied on Thoughts on the Feasibility of Prosaic AGI Alignment? · 2020-08-23T18:31:51.017Z · LW · GW

I am not sure at all about a specific probability for this exact chain of events. I think the secrecy part is quite likely (90%) to happen once a lab actually gets something human level, no matter their commitment to openness, i think seeing their model become truly human-level would scare the shit out of them. Patching obvious security holes also seems 90% likely to me, even Yann Lecun would do that. The real uncertainties come from whether the lab would try to use the model to solve AI safety, or whether they would think their security patches are enough, and push for monetizing the model directly, I'm pretty sure Deepmind and OpenAI would do something like that, I'm unsure about the others.

Regarding the probability of transformative AI being prosaic, i'm thinking 80%. GPT-3 has basically guaranteed that we will explore that particular approach as far as it can go. When I look at all the ways that I can think of making GPT better, of training it faster, of merging image and video understanding into it, of giving it access to true Metadata for each example, longer context length, etc. I see just how easy it is to improve it.

I am completely unsure about timelines. I have a small project going on where I'll try to get a timeline probability estimate from estimates of the following factors:

  1. cheapness of compute (including next generation computing possibilities)

  2. data growth. Text, video, images, games, vr interaction

  3. investment rate (application vs leading research)

  4. Response of investment rate to increased progress

  5. Response of compute availability to investment

  6. researcher numbers as a function of increased progress

  7. different approaches that could lead to AGI (simulation: minecraft style. Joint Text comprehension with image and video understanding, generative stuff?)

  8. level of compute required for AGI

  9. effect of compute availability on speed of algorithm discovery (architecture search)

  10. discovery of new model architectures

  11. discovery of new training algorithms

  12. discovery of new approaches(like GANs, alphaZero, etc.)

  13. switch to secrecy and impact on speed of progress

  14. impact of safety concerns on speed

Comment by razied on Thoughts on the Feasibility of Prosaic AGI Alignment? · 2020-08-22T23:54:14.968Z · LW · GW

My estimate for the most likely Good Path is something like the following:

1- build a superhuman-level GPT-N

2- enforce absolute secrecy and very heavily restrict access to the model.

3- patch obvious security holes.

4- make it model future progress in AI safety by asking it to predict the contents of highly cited papers from the 2050s.

5- rigorously vet and prove the contents of those papers

6- build safe AGI from those papers

Comment by razied on The Fusion Power Generator Scenario · 2020-08-15T14:54:29.578Z · LW · GW

The practical problem with that is probably that you need to manually decide which papers go in which category. GPT needs such an enormous amount of data that any curating done needs to be automated. So metadata like authors, subject, date, website of provenance are quite easy to obtain for each example, but really high level stuff like "paper is about applying the methods of field X in field Y" is really hard.

Comment by razied on How GPT-N will escape from its AI-box · 2020-08-13T17:16:05.168Z · LW · GW

future versions of such models could well work in other ways than text continuation, but this would require new ideas not present in the current way these models are trained, which is literally by trying to maximise the probability they assign to the true next word in the dataset. I think the abstraction of "GPT-N" is useful if it refers to a simply scaled up version of GPT-3, no clever additional tricks, no new paradigms, just the same thing with more parameters and more data, if you don't assume this then "GPT-N" is no more specific than "Deep Learning-based AGI", and we must then only talk in very general terms.

Regarding the exploits, you need to massage your question in a way that GPT-N predicts that its answer is the most likely thing that a human would write after your question. Over the whole internet, most of the time when someone asks someone else to answer a really hard question, the human who writes the text immediatly after that question will either a) be wrong or b) avoid the question. GPT-N isn't trying to be right, to it, avoiding your question or being wrong is perfectly fine, because that's what it was trained to output after hard questions.

To generate such an exploit, you need to convince GPT-N that the text it is being shown is actually coming from really competent humans, so you might try to frame your question as the beginning of a computer science paper, maybe written far in the future, and which has lots of citations, written by a collaboration of people GPT-N knows are competent. But then GPT-N might predict that those humans would not publish such a dangerous exploit, so it would yet again evade you. After a bit of trial and error, you might well corner GPT-N into producing what you want, but it will not be easy.

Comment by razied on How GPT-N will escape from its AI-box · 2020-08-12T23:45:09.851Z · LW · GW

How is this clever javascript code the most likely text continuation of the human's question? GPT-N outputs text continuations, unless the human input is "here is malicious javascript code, which hijacks the browser when displayed and takes over the world: ... " then GPT-N will not output something like it. In fact, such code is quite hard to write, and would not really be what a human would write in response to that question, so you'd need to do some really hard work to actually get something like GPT-N (assuming same training setup as GPT-3) to actually output malicious code. Of course, some idiot might in fact ask that question, and then we're screwed.

Comment by razied on The Fusion Power Generator Scenario · 2020-08-09T11:15:28.455Z · LW · GW

You can probably avoid the generation of crank works and fiction by training a new version of GPT in which every learning example is labeled with <year of publication> and <subject matter>, which GPT has access to when it predict an example. So if you then generate a prompt and condition of something like <year: 2040> <subject matter: peer-reviewed physics publication>, you can easily tell GPT to avoid fiction and crank works, as well as make it model future scientific progress.

Comment by razied on The Fusion Power Generator Scenario · 2020-08-09T02:36:17.433Z · LW · GW

Oh I have no doubt that this is no guarantee of safety, but with the likelihood of AGI being something like GPT-N going up (and the solution to Alignment being nowhere in sight), I'm trying to think of purely practical solutions to push the risks as low as they will go. Something like keeping the model parameters secret, maybe not even publicizing the fact of its existence, using it only by committee, and only to attempt to solve alignment problems, whose proposed solutions are then checked by the Alignment community. Really the worst-case scenario is if we have something powerful enough to pose massive risks, but not powerful enough to help solve alignment, but that doesnt seem too likely to me. Or that the solution to alignment the AI proposes turns out to be really hard to check.

Comment by razied on The Fusion Power Generator Scenario · 2020-08-09T01:54:57.794Z · LW · GW

It seems that with a tool-AI like GPT-N, the solution would probably be to dramatically restrict its use to its designers, who should immediately ask it how to solve alignment, which by assumption it can do. The real risk is in making the tool-AI public.

Comment by razied on From self to craving (three characteristics series) · 2020-05-23T14:50:08.661Z · LW · GW

Always great to see contemplative practice being represented here! If people want a quantitative estimate of the value of doing these practices, I've heard Shinzen Young say a few times that he'd rather have one day of his experience than 20 years of rich/famous/powerful life. Culadasa has agreed with this estimate in public, and Daniel Ingram agreed right away to this to me in private video chat. This places the lower bound of meditation-induced increase in life satisfaction somewhere around 4 orders of magnitude.