AI-Based Code Generation Using GPT-J-6B

post by Tomás B. (Bjartur Tómas) · 2021-06-16T15:05:26.381Z · LW · GW · 13 comments

This is a link post for

Above is a link to an interesting post about synthetic code generation with a transformer model trained on The Pile, which contains a large chuck of GitHub and StackOverflow. Due to CommonCrawl's deficiency in this area, the much smaller GPT-J-6B outperforms OpenAI’s largest publicly available GPT-3 models. The performance is impressive enough that one wonders how capable a 100+ billion parameter model trained on The Pile will be, let alone what an AlphaGo-level engineering effort towards the end of synthetic code generation would achieve. 

As the The Pile was created to provide a dataset for 100 billion paramater+ models, we may not have to wait long. The examples in the post are clearly trivial, but I personally take this to be something of a fire alarm. I was not previously aware of how poorly-optimized GPT-3 was for code generation, and I have updated toward surprising gains in this area in the next few years.

I no longer consider agents with superhuman performance in competitive programming to be a ridiculous thing to pursue. 

It is useful to remind myself of how shocked I would be to see such things in 2012. In 2012 I would have taken this as a sign that AGI was near. 

Scenario-based planning postulates that one should predict symptoms emblematic of a given scenario and then robotically assume you are in said scenario once a sufficient number of these symptoms come to pass. I am unsure whether there is wisdom in this approach, but I find it a discomfiting line of thought.


Comments sorted by top scores.

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2021-06-16T21:05:51.550Z · LW(p) · GW(p)
Scenario-based planning postulates that one should predict symptoms emblematic of a given scenario and then robotically assume you are in said scenario once a sufficient number of these symptoms come to pass. I am unsure whether there is wisdom in this approach, but I find it a discomfiting line of thought.

Interesting, I don't know much about scenario-based planning but this seems like maybe a literature I should read up on? How strong is the evidence for this claim?

Replies from: Bjartur Tómas
comment by Tomás B. (Bjartur Tómas) · 2021-06-16T21:39:32.055Z · LW(p) · GW(p)

I don't know too much about it. But I do know it was used extensively by Shell; they credited it with allowing them to respond to the Oil Shock much quicker than their competitors. They had analyzed the symptoms of a similar scenario (which was considered extremely outlandish at the time of scenario's creation) and begin to notice eerie similarities between those symptoms and their present reality.

I see it as a sort of social technology that tries to assist an organization (and perhaps an individual) in resisting becoming the proverbial slowly-boiling frog. 

As to evidence of its efficacy, I am only aware of anecdotal evidence. There appears to be an extensive Wikipedia page on the topic but I have not read it - my knowledge comes mostly from hearing Vernor Vinge speak about the technique,  as he assisted in scenario-creation for several companies. 

Ever since I heard Vinge speak about this, I have occasionally tried to think about the present as if it were a scenario I developed in the past: what sort of scenario would it be, how surprised would my past self be, and so on. Seeing how much The Pile improved GPT-J's performance on this task trigged such thoughts.

comment by gwern · 2021-08-07T16:56:04.270Z · LW(p) · GW(p)

There's a finetuned model now too: "Genji is a transformer model finetuned on EleutherAI's GPT-J 6B model. This particular model is trained on Python-only code approaching 4GB in size."

Replies from: Bjartur Tómas
comment by Tomás B. (Bjartur Tómas) · 2021-08-10T17:57:08.741Z · LW(p) · GW(p)

Thanks! Any thoughts on Codex? Do you think insane progress in code generation will continue for at least a few years?

Replies from: gwern
comment by gwern · 2021-08-12T18:11:06.778Z · LW(p) · GW(p)

The approach of just generative self-supervised learning on existing source code corpuses is picking the low-hanging fruit. As impressive as it is to see Codex just knock out a web scraper, coding is very much a colonization wave sort of place: standalone code is fine, but the bulk of the work has always been maintenance and debugging of existing systems, not spitting out little self-contained scripts. Because of this asymmetry, Codex is a meaningful step towards UFAI, but is a smaller step in terms of automating programmers.

Replies from: gwern
comment by gwern · 2021-08-18T01:53:07.896Z · LW(p) · GW(p)

To elaborate on this a little more: maintenance is the kind of nasty field where '99% accurate' may still not be nearly good enough if you want to unlock big productivity gains of the sort you get by replacing humans entirely, rather than merely saving a few minutes here or there looking up API docs etc. Amdahl's law is not mocked: if a human has to manually review and step in, then it cannot deliver more than modest factor gains, any more than learning to type really fast will deliver life-changing productivity gains. Maintenance is almost by definition about the long tail of subtle bugs, system interactions, faulty assumptions, and business-driven requirement changes.* If you're a SWE at Google, you don't spend very much time writing little self-contained greenfield scripts of 100-500 lines. You'll spend a lot more time doing, say, code reviews of new pulls, which involve no writing of greenfield code at all. Something like Codex can help knock out the occasional script or help in learning a new system or be a very useful substrate for static analysis tools (like Coverity on steroids), but I can confidently predict that Codex is not going to make programmers even 10x more productive. Utility doesn't increase smoothly with accuracy: it plateaus and jumps. You don't want to use a voice transcription system which makes 10% errors, but at 5% it might suddenly become useful.

But ironically, in many ways, developing DL code is far simpler. Sometimes, solving a much harder problem is much easier. DL is much more self-contained and amenable to self-modification. The complexity of the learned tasks resides in the weights, not the seed algorithm which learns the NN; the seed algorithm may be extremely simple and short, a few hundred lines at most, including all the boilerplate and wrappers and ceremony. You can write backprop and CNNs in a few hundred lines for a self-contained CPU implementation. Available DL libraries let you create & train an arch like GPT in a few dozen lines (Karpathy does minGPT in <300 lines of bloated code). Rip Van Winkle is an interesting exercise in estimating complexity, in a Kolmogorov sort of way, of a formerly-SOTA CNN ResNet at 1,032 bits. Evolutionary search programs like AutoML-Zero can recapitulate backprop and other core algorithms in a few lines. We also see this in the breakthroughs themselves: why do MLPs suddenly work? Because you add like 1 line to re-normalize or gate intermediate activations, while 99.9% of the code remains the same. Why did resnets suddenly make 'deep' (>10) layer NNs work? Because you add like 1-3 lines to define a shortcut connection. Why did NNs suddenly start working around 2009? Because you added 1 line for the right initialization, and 1 line for a ReLU instead of sigmoid nonlinearity. Why did X work - we could go on all day. (Why is one person a genius and another person ordinary? Differences at a few thousand alleles which could be encoded in less than a kilobyte. Why did humans take over the world and chimpanzees are in zoos if our genomes are like 99.9% identical? Everything is fragile.) The space of all possible programs of a few hundred self-contained lines to bootstrap a general meta-learning agent is vast... but it's also exactly the sort of task where a self-supervised agent can acquire most of the necessary bits from the environment, solving basic problems like how to create valid ASTs (the sort of knowledge that isn't in AutoML-Zero-esque systems, and mostly accounts for their boil-the-ocean inefficiency), and then use the tiny bit of supervision from evolutionary RL losses to close the gap by selecting only plausible modifications to test, running a feasible number of iterations, and modifying the last handful of key lines.

Thus, an asymmetry in code-generating AIs. A code-generating AI could be almost completely useless for 'easy' maintenance tasks like fixing bugs in production code because it comes with so much overhead and unreliability that it isn't worth the hassle, but also still offer enormous exponential gains in ranking candidates for the 'hard' problem of rewriting a core DL algorithm. It is unfortunate that we live in a world where you can apparently be 99.9% of the way to a human or an AGI and the result be completely useless, rather than 99.9% as powerful, because it means you may get no warning signs before that last 1-line fix; but that looks like the world we live in, as opposed to a gradualist world where half-working AIs take over half the global economy or something.

* If you've paid attention to the popups on, you've probably noticed that they've changed a number of times; the Wikipedia popups, specifically, have now gone through 8 completely different implementations. The 8th iteration, ironically, is very similar to the 1st iteration: it requests from the Wikipedia APIs an article summary and displays it; that's all. I & Obormot have spent a breathtaking amount of time on this, not because the actual coding itself takes up substantial time (none of it is remotely impressive algorithmically), but because the hard part is understanding what even should be done in the first place and what tradeoff between static, dynamic, inlined vs external, popup vs popin etc works best, implementing and testing in the real world to see how it felt in practice and what users thought, how it scaled as I fixed bugs & found edge-cases... By the 8th iteration, what we'd learned was that static or inlined couldn't work at scale or provide recursion in any feasible way and were deadends, and the main motivation for those - displaying hyperlinked excerpts - was moot because we were using the wrong WP API in the first iteration, and there was a 'mobile' API which, I discovered after hours of docs reading, provided useful rather than butchered excerpts and worked fine all along. "Time is a circle."

comment by Brendan Long (korin43) · 2021-06-16T18:12:25.087Z · LW(p) · GW(p)

These responses look to me more like the AI is really good at searching StackOverflow and returning the code from a random answer, not really writing code itself. I guess being able to replace programmers who Google stuff and copy the first answer from StackOverflow without understanding it will save some time, but it doesn't really feel transformational to me.

Replies from: matthew-barnett
comment by Matthew Barnett (matthew-barnett) · 2021-06-17T04:16:26.304Z · LW(p) · GW(p)

Have you tried, y'know, testing your belief? ;)

You can Google its answers. I've been googling its answers and am not generally finding direct copy-pastes for each, though I'm also a bit confused about why Google is obtaining no results for short strings such as "(s[0:length] == s[length::-1])".

ETA: even if it's copying code but modifying it slightly so that the variable names match, it seems like (1) this is itself pretty impressive if it actually works reliably, and (2) I don't think the claim is that current tech is literally shovel ready to replace programmers. That would be a strawman. It's about noticing the potential of this tech before it reaches its destination.

comment by Matthew Barnett (matthew-barnett) · 2021-06-17T04:05:33.230Z · LW(p) · GW(p)

I no longer consider agents with superhuman performance in competitive programming to be a ridiculous thing to pursue. 

Dan Hendrycks and Steven Basart et al. recently released APPS, an ML benchmark for measuring the performance of ML models at the task of writing code. One part of their benchmark measures the performance of code on competitive programming questions. I wrote a Metaculus question on when people expect this benchmark to be solved -- operationalized as getting above 80% strict accuracy on the competitive programming section.

Initial results are encouraging. GPT-Neo 2.7B passes nearly 20% of test cases on average for introductory coding problems, when the model is allowed to give 5 attempts (see Table 4 in the paper). A fine-tuned GPT-J-6B is likely to be even better.

Replies from: p.b.
comment by p.b. · 2021-06-17T12:52:34.106Z · LW(p) · GW(p)

The APPS repository also gives the fine-tuned weights for GPT-Neo-2.7 and code to run it. Though without a GPU it takes roughly forever. 

I asked Dan Hendrycks for the performance of GPT-J-6B on APPS on the Eleuther AI discord. He didn't say they were definitely going to test it, but my take-away was that it might happen. 

I could image a test driven automated programming evolving in the next ten to twenty years, were a LM-guided search tries to create functions according to a description that pass all the test cases. 

comment by Tomás B. (Bjartur Tómas) · 2021-09-25T15:10:31.382Z · LW(p) · GW(p)

This post is obsolete now with Codex, but it is interesting how (even knowing little about ML myself) just hanging around ML people on discord, let me get a better sense of the future than I would have otherwise. Perhaps a post like The Power of Lurking might be worthwhile. 

comment by movin nymu (movin-nymu) · 2021-08-07T16:01:47.863Z · LW(p) · GW(p)

GPT-J-6B was impressive, but still not human level

comment by mu-lippy · 2021-10-05T13:13:00.923Z · LW(p) · GW(p)