How will OpenAI + GitHub's Copilot affect programming?

post by smountjoy · 2021-06-29T16:42:35.880Z · LW · GW · 22 comments

This is a question post.

GitHub Copilot is powered by OpenAI Codex, a new AI system created by OpenAI. OpenAI Codex has broad knowledge of how people use code and is significantly more capable than GPT-3 in code generation, in part, because it was trained on a data set that includes a much larger concentration of public source code.

Will Copilot or similar systems become ubiquitous in the next few years? Will they increase the speed of software development or AI research? Will they change the skills necessary for software development?

Is this the first big commercial application of the techniques that produced GPT-3?

For anyone who's used Copilot, what was your experience like?


answer by jimrandomh · 2021-06-29T19:31:09.214Z · LW(p) · GW(p)

This will probably make the already-bad computer security/infosec situation significantly worse. In general, people pasting snippets they don't understand is bad news; but at least in the case of StackOverflow, there's voting and comments, which can catch the most egregious stuff. On the other hand, consider the headline snippet on the Copilot's landing page:

// Determine whether the sentiment of text is positive
// Use a web service
async function isPositive(text: string): Promise<boolean> {
  const response = await fetch(``, {
    method: "POST",
    body: `text=${text}`,
    headers: {
      "Content-Type": "application/x-www-form-urlencoded",
  const json = await response.json();
  return json.label === "pos";

Everything below the function prototype was AI generated. And, this code has TWO security vulnerabilities! First, it's using an http URL, rather than https. And second, if the input string has newlines, it can put values into fields other than the ones intended. In this specific case, with the sentiment analysis API demo, there's not much to do with that (the only documented field available other than text is language), but if this is representative, then we are probably in for some bad times.

comment by Alex Ray (alex-ray) · 2021-06-30T06:16:16.092Z · LW(p) · GW(p)

(Disclaimer: I work at OpenAI, and I worked on the models/research behind copilot.  You should probably model me as a biased party)

This will probably make the already-bad computer security/infosec situation significantly worse.

I'll take the other side to that bet (the null hypothesis), provided the "significantly" unpacks to something reasonable.  I'll possibly even pay to hire the contractors to run the experiment.

I think a lot of people make a lot of claims about new tech that will have a significant impact that end up falling flat.  A new browser will revolutionize this or that; a new website programming library will make apps significantly easier, etc etc.

I think a good case in point is TypeScript.  JavaScript is the most common language on the internet.  TypeScript adds strong typing (and all sorts of other strong guarantees) and has been around for a while.  However I would not say that TypeScript has significantly impacted the security/infosec situation.

I think my prediction is that Copilot does not significantly affect the computer security/infosec situation.

It's worth separating out that this line of research -- in particular training large language models on code data -- probably has a lot more possible avenues of impact than a code completer in VS Code.  My prediction is not about the sum of all large language models trained on code data.

I also do think we agree that it would be good if models always produced the code-we-didnt-even-know-we-wanted, but for now I'm a little bit wary of models that can do things like optimize code outside of our ability to notice/perceive.

Replies from: vanessa-kosoy, Taran
comment by Vanessa Kosoy (vanessa-kosoy) · 2021-06-30T19:56:26.899Z · LW(p) · GW(p)

Are you saying that (i) few people will use copilot, or (ii) many people will use copilot but it will have little effect on their outputs or (iii) many people will use copilot and it will boost their productivity a lot but will have little effect on infosec? Your examples sound more like supporting i or ii than supporting iii, but maybe I'm misinterpreting.

Replies from: alex-ray
comment by Alex Ray (alex-ray) · 2021-07-01T00:48:30.227Z · LW(p) · GW(p)

I think all of those points are evidence that updates me in the direction of the null hypothesis, but I don't think any of them is true to the exclusion of the others.

I think a moderate amount of people will use copilot.  Cost, privacy, and internet connection will factor to limit this.

I think copilot will have a moderate affect on users outputs.  I think it's the best new programming tool I've used in the past year, but I'm not sure I'd trade it for, e.g. interactive debugging (reference example of a very useful programming tool)

I think copilot will have no significant differential effect on infosec, at least at first.  The same way I think the null hypothesis should be a language model produces average language, I think the null hypothesis is a code model produces average code (average here meaning it doesn't improve or worsen the infosec situation that jim is pointing to).

In general these lead me to putting a lot of weight on 'no significant impact' in aggregate, though I think it is difficult for anything to have a significant impact on the state of computer security.

(Some examples come to mind: Snowden leaks (almost definitely), Let'sEncrypt (maybe), HTTPSEverywhere (maybe), Domain Authentication (maybe)) 

Replies from: mtrazzi
comment by Michaël Trazzi (mtrazzi) · 2021-07-01T21:42:30.548Z · LW(p) · GW(p)

Summary of the debate

1. jim originally said that copilot produces code with vulnerability, which, if used extensively, could generate loads of vulnerabilities, giving more opportunities for exploits overall. jim mentions it worsening "significantly" infosec

2. alex responds that given that the model tries to produce the code it was trained on, it will (by def.) produce average level code (with average level of vulnerability), so it won't change the situation "significantly" as the % of vulnerabilities per line of code produced (in the world) won't change much

3. vanessa asks if the absence of change from copilot results from a) lack of use b) lack of change in speed/vulnerability code production from using (ie. used as some fun help but without strong influence on the safety on the code as people would still be rigorous) c) change in speed/productivity, but not in the % of vulnerability

4. alex answers that indeed it makes users more productive and it helps him a lot, but that doesn't affect overall infosec in terms of % of vulnerability (same argument as 2). He nuances his claim a bit saying that a) it would moderatly affect outputs b) some stuff like cost will limit how much it affect those c) it won't change substantially at first (conjunction of two conditions).

What I think is the implicit debate

i) I think jim kind of implicitly assume that whenever someone writes code by himself, he would be forced to have good habits for security etc., and that whenever the code is automatically generated then people won't use their "security" muscles that much & assume the AI produced clean work... which apparently (given the examples from jim) does not by default. Like a Tesla not being safe enough at self-driving.

ii) I think what's missing from the debate is that the overall "infosec level" depends heavily on what a few key actors decide to do, those being in charge of safety-critical codebases for society-level tools (like nukes). So one argument could be that, although the masses might be more productive for prototyping etc., the actual infosec people might just still be as careful / not use it, so the overall important infosec won't change, and thus the overall infosec won't change.

iii) I think vanessa point kind of re-states i) and disagrees with ii) by saying that everyone will use this anyway? Because by definition if it's useful it will change their code/habits, otherwise it's not useful?

iv) I guess alex's implicit points are that code generation with Language Models producing average human code was going to happen anyway & that saying it is a significant change is an overstatement, & we should probably just assume no drastic change in %vulnerability distribution at least for now.

Replies from: Taran
comment by Taran · 2021-07-02T08:24:00.211Z · LW(p) · GW(p)

I think jim kind of implicitly assume that whenever someone writes code by himself, he would be forced to have good habits for security etc.,

This part I think is not quite right.  The counterfactual jim gives for Copilot isn't manual programming, it's StackOverflow.  The argument is then: right now StackOverflow has better methods for promoting secure code than Copilot does, so Copilot will make the security situation worse insofar as it displaces SO.

comment by Taran · 2021-06-30T10:06:12.410Z · LW(p) · GW(p)

I think my prediction is that Copilot does not significantly affect the computer security/infosec situation.

This is my prediction too, but there are two strands to the argument that I think are worth teasing apart:

First, how many people will use Copilot?  The base rate for infosec impact of innovations is very low, because most innovations are taken up slowly or not at all.  Typescript is typical: most people who could use Typescript use Javascript instead (see for example the TIOBE rankings), so even if Typescript prevents all security problems it can't impact the overall security situation much.  Garbage collection is another classic example: it was in production systems in the late 60s, but didn't become mainstream until the 90s with the rise of Java and Perl.  There was a span of 20+ years where GC didn't much affect the use-after-free landscape, even though GC prevents 100% of use-after-free bugs.

(counterpoint: StackOverflow was also an innovation, it was taken up right away, and Copilot is more like StackOverflow than it is like a traditional technical innovation.  I don't really buy this because Copilot seems it'll be much harder to get started with even once it's out of beta)

Second, are users of Copilot more or less likely to write security bugs?  Here my prediction points the other way: Copilot does generate security bugs, and users are unusually unlikely to catch them because they'll tend to use it in domains they're unfamiliar with.  Somewhat more weakly I think it'll be worse than the counterfactual where they don't have Copilot and have to use something else, for the reasons jimrandomh lists.

I'm curious whether you see the breakdown the same way, and if so, how you see the impact of Copilot conditional on its being widely adopted.

comment by paulfchristiano · 2021-06-30T00:27:39.393Z · LW(p) · GW(p)

(I can't speak to any details of Copilot or Codex and don't know much about computer security; this is me speaking as an outside alignment researcher.)

A first pass of improving the situation would be to fine-tune the model for quality, with particular attention to security (both using higher-quality demonstrations and eventually RL fine-tuning). This is an interesting domain from an alignment perspective because (i) I think it's reasonable to aim at narrowly superhuman performance [AF · GW] even with existing models, (ii) security is a domain where you care a lot about rare failures and could aim for error rates close to 0 for many kinds of severe failures.

comment by Dustin · 2021-06-30T21:28:26.788Z · LW(p) · GW(p)

My gut feeling is that the security vulnerabilities that Copilot introduces here are fairly commonly done by your average JS programmer without Copilot.

  1. Non-https is the URL that tells you to use in its examples. And as johnfuller notes, has a bad https certificate.
  2. I encounter the incorrect usage of body constantly.

I'm not exactly sure what this means for your prediction.

On the one hand, it's wrong code.  On the other hand, there's a lot of unknowns right now. 

  • Will people who would've otherwise written the right code get lulled into complacency and just accept whatever Copilot tells them?  
  • Maybe Copilot will on average write more secure code than the average developer? 
  • If so, maybe that outweighs the number of people lulled into complacency? 
  • Is that snippet suggested because that code base already using in other places or is that just the generic snippet Copilot inserts for doing POSTs?
Replies from: paulfchristiano
comment by paulfchristiano · 2021-07-01T01:24:06.061Z · LW(p) · GW(p)

I do think that the quality of code output by ML systems will increase much more rapidly than the quality of code output by human engineers. It's just not that hard to do things like generating millions of completions and fine-tuning the model to avoid all the common security problems, whereas the analogous education project with human engineers is quite challenging.

comment by johnfuller · 2021-06-30T13:31:01.536Z · LW(p) · GW(p)

If you go to the site, which the URL points to, you'll see that the docs give you the address which was used in your example. I tried sending a request to https and it works, but the certificate is expired. My guess is that this site hasn't been getting much attention from the administrators. 

 I don't know how the service determines what code to write. If you were telling me what you wanted to do, you might also need to tell me what sort of validation you would need. Are you placing this text in a database? Are you directly returning the text to the user to appear in a browser? Are you stuffing the text into a log file? Different uses may require different validation. 

The service shouldn't stop people from reviewing the code. I suppose as with self driving tech, people will still shoot themselves in the foot.

The answer the original question, this is a trend which is only going to become more powerful and isn't going anywhere. Most programmers will likely need to set aside some time to see how this service works and how it may help.

At this point, I'm looking at this service as being another way to generate boilerplate code. But the caveat is that you'll have to spend more time reviewing code, as typical boilerplate is static. If you read the source of boilerplate generators, you'll know what you will get. Though you shouldn't trust any code you didn't write, you might be okay with somehow not looking closely at your generated code (validation and other security measures would likely still need to be added.)

I could see this service moving the line of quality of code to a higher level. Most programmers make worse mistakes than those you'll see from Google Co-pilot. Many of these programmers would benefit by seeing a generated second-opinion. At the top level of expertise, I doubt this would get much usage. Beginners use boilerplate. At the highest levels of expertise, boilerplate just gets in the way.

answer by jimrandomh · 2021-06-29T19:07:53.877Z · LW(p) · GW(p)

One of the major, surprising consequences of this is that it's likely to become infeasible to develop software in secret. If AI tools like this confer a large advantage, and the tools are exclusively cloud hosted, then maintaining the confidentiality of a code base will mean forgoing these tools, and if the impact of these tools is large enough, then forgoing them may not be feasible.

On the plus side, this will put malware developers at something of a disadvantage. It might increase the feasibility of preventing people from running dangerous AI experiments (which sure seem like they're getting closer by the day). On the minus side, this creates an additional path to internet takeover, and increases centralization of power in general.

comment by Jonathan_Graehl · 2021-06-29T20:14:47.190Z · LW(p) · GW(p)

surely private installations of the facility will be sold to trade-secret-protecting teams

comment by Dustin · 2021-06-29T20:44:33.822Z · LW(p) · GW(p)

I wonder how much closed-source software is hosted on non-public GH repos? GH Enterprise exists, and seems widely-used.

comment by capybaralet · 2021-06-30T20:08:11.194Z · LW(p) · GW(p)

One of the major, surprising consequences of this is that it's likely to become infeasible to develop software in secret. 

Or maybe developers will find good ways of hacking value out of cloud hosted AI systems while keeping their actual code base secret.  e.g. maybe you have a parallel code base that is developed in public, and a way of translating code-gen outputs there into something useful for your actual secret code base.


answer by Dustin · 2021-06-29T19:25:11.470Z · LW(p) · GW(p)

It entirely depends on how good it is.

I gave Kite a real go for a month and kept hoping it would get better.  It didn't and I stopped using it because it hindered more than it helped.


(Also, I bet the folks at Kite are not happy right now!)

answer by DanB · 2021-07-01T21:01:03.355Z · LW(p) · GW(p)

My prediction is that the main impact is to make it easier for people to throw together quick MVPs and prototypes. It might also make it easier for people to jump into new languages or frameworks.

I predict it won't impact mainstream corporate programming much. The dirty secret of most tech companies is that programmers don't actually spend that much time programming. If I only spend 5 hours per week writing code, cutting that time down to 4 hours while potentially reducing code quality isn't a trade anyone will really want to make.

answer by spawley · 2021-07-02T10:02:38.884Z · LW(p) · GW(p)

I expect its impact will mostly be in helping junior devs get around the "blank page" problem quickly. I don't see it having much impact beyond that, not with GPT-3 being the backend

answer by Dustin · 2021-06-30T21:37:57.702Z · LW(p) · GW(p)

If Copilot improves programming significantly-enough it might be a huge blow to makers of other IDE's and text editors unless they provide an API for others to use. I don't expect re-implementing Copilot-esque prediction is in the wheelhouse of places like Jetbrains or any of the open source editors. 

According to the FAQ they're focusing on VS Code which comes from the same parent company as Github (Microsoft).

answer by Razied · 2021-06-30T21:17:19.546Z · LW(p) · GW(p)

I expect that it will help with the tedious parts of programming that require you to know details about specific libraries. It will make those parts of programming that are "google -> stackoverflow -> find correct answer -> copy-paste into code" much fater, but I expect it to struggle on truly novel stuff, on scientific programming and anything research-level. If this thing can auto-complete a new finite-element PDE solver in a way that isn't useless, I will be both impressed and scared.


Comments sorted by top scores.

comment by ChristianKl · 2021-06-30T16:12:20.058Z · LW(p) · GW(p)

I don't really understand why code generation is the focus. I would expect that code review is an area where automated process that aren't fully reliable would generate more value. 

Replies from: smountjoy
comment by smountjoy · 2021-06-30T20:20:50.575Z · LW(p) · GW(p)

Agreed, but code generation is a more natural fit for a GPT-style language model. GPT-3 and Codex use massive training sets; I would guess that the corpus of human code reviews is not nearly so big.