Linkpost: Github Copilot productivity experiment

post by Daniel Kokotajlo (daniel-kokotajlo) · 2022-09-08T04:41:41.496Z · LW · GW · 4 comments

This is a link post for https://github.blog/2022-09-07-research-quantifying-github-copilots-impact-on-developer-productivity-and-happiness/

We recruited 95 professional developers, split them randomly into two groups, and timed how long it took them to write an HTTP server in JavaScript. One group used GitHub Copilot to complete the task, and the other one didn’t. We tried to control as many factors as we could–all developers were already familiar with JavaScript, we gave everyone the same instructions, and we leveraged GitHub Classroom to automatically score submissions for correctness and completeness with a test suite. We’re sharing a behind-the-scenes blog post soon about how we set up our experiment!

In the experiment, we measured—on average—how successful each group was in completing the task and how long each group took to finish.

  • The group that used GitHub Copilot had a higher rate of completing the task (78%, compared to 70% in the group without Copilot).
  • The striking difference was that developers who used GitHub Copilot completed the task significantly faster–55% faster than the developers who didn’t use GitHub Copilot. Specifically, the developers using GitHub Copilot took on average 1 hour and 11 minutes to complete the task, while the developers who didn’t use GitHub Copilot took on average 2 hours and 41 minutes.

 

My opinion: Because of the usual reasons (publication bias, replication crisis, the task being "easy," etc.) I don't think we should take this particularly seriously until much more independent experiments have been run. However, it's worth knowing about at least. 

Related: https://ai.googleblog.com/2022/07/ml-enhanced-code-completion-improves.html

We compare the hybrid semantic ML code completion of 10k+ Googlers (over three months across eight programming languages) to a control group and see a 6% reduction in coding iteration time (time between builds and tests) when exposed to single-line ML completion. These results demonstrate that the combination of ML and SEs can improve developer productivity. Currently, 3% of new code (measured in characters) is now generated from accepting ML completion suggestions.

4 comments

Comments sorted by top scores.

comment by Erich_Grunewald · 2022-09-08T18:30:50.450Z · LW(p) · GW(p)

This is interesting, though I expect it's an upper bound on Copilot productivity boosts:

  • Writing an HTTP server is a common, clearly defined task which has lots of examples online.
  • JavaScript is a popular language (meaning there's lots of training data for Copilot).
  • I imagine Copilot is better for building a thing from ground up, whereas the programming most programmers do most days consists in extending, modifying and fixing existing stuff, meaning more thinking and reading and less typing.
Replies from: Julian Bradshaw
comment by Julian Bradshaw · 2022-09-10T08:47:32.385Z · LW(p) · GW(p)

In my experience it's best at extending things, because it can predict from the context of the file you're working in. If I try to generate code from scratch, it goes off in weird directions I don't actually want pretty quickly, and I constantly have to course-correct with comments.

Honestly I think the whole "build from ground up"/"extending, modifying, and fixing" dichotomy here is a little confused though. What scale are we even talking?

A big part of Copilot's efficiency gains come from very small-scale suggestions, like filling out the rest of a for loop statement. It can generally immediately guess what you want to iterate over. I happened to be on a plane without internet access recently, decided to do a bit of coding anyway, needed to write a for loop, and then was seriously off-put by the fact that I couldn't write the whole thing by just pressing "Tab". I had to actually think about how a stupid for loop is written! What a waste of time!

Replies from: Erich_Grunewald
comment by Erich_Grunewald · 2022-09-10T11:41:31.288Z · LW(p) · GW(p)

Honestly I think the whole "build from ground up"/"extending, modifying, and fixing" dichotomy here is a little confused though. What scale are we even talking?

I meant to capture something like "lines of code added/modified per labour time spent", and to suggest that Copilot would reap more benefits the higher that number is (all else equal).

comment by Tao Lin (tao-lin) · 2022-09-08T16:01:34.726Z · LW(p) · GW(p)

This is in line with my experience. However, the fact that this was an http server is important - I get far more value from copilot on JS http servers than other programs, and http servers are a class that has many no code options - how long would it take them if they were allowed to use pure SQL or a no-code solution?