↑ comment by Zac Hatfield Dodds (zac-hatfield-dodds) ·
2021-04-07T15:24:55.308Z · LW(p) · GW(p)
Testing in full generality is certainly AGI-complete (and a nice ingredient for recursive self-improvement!), but I think you're overestimating the difficulty of pattern-matching your way to decent tests. Chess used to be considered AGI-complete too; I'd guess testing is more like poetry+arithmetic in that if you can handle context, style, and some details it comes out pretty nicely.
I expect GPT-4 to be substantially better at this 'out of the box' due to
- the usual combination of larger, better at generalising, scaling laws, etc.
- super-linear performance gains on arithmetic-like tasks due to generalisation, with spillover to code-related tasks
- the extra github (and blog, etc) data is probably pretty helpful given steady adoption since ~2015 or so
Example outputs from Ghostwriter vs GPT-3:
$ hypothesis write gzip.compress
from hypothesis import given, strategies as st
def test_roundtrip_compress_decompress(compresslevel, data):
value0 = gzip.compress(data=data, compresslevel=compresslevel)
value1 = gzip.decompress(data=value0)
assert data == value1, (data, value1)
while GPT-3 tends to produce examples like (first four that I generated just now):
expected = x
result = bytes(gzip(x))
assert bytes(result) == expected
compressed, uncompressed = gzip_and_unzip(xs)
assert is_equal(compressed, uncompressed)
assert gzip(xs) == xs
zipped_xs = gzip(xs)
uncompressed_xs = zlib.decompress(zipped_xs)
assert zipped_xs == uncompressed_xs
So it's clearly 'getting the right idea', even without any fine-tuning at all, but not there yet. It's also a lot worse at this without a natural-language description of the test we want in the prompt:
This document was written by a very smart AI programmer,
who is skilled in testing Python, friendly, and helpful.
Let's start by testing the two properties of sorting a list: the output
is a permutation of the input, and the elements of the output are
result = sorted(xs)
assert set(result) == set(xs)
Our second test demonstrates a round-trip test. Given any string
of bytes, if you use gzip to compress and then decompress, the result
will be equal to the input:
Of course translating natural language into Python code is pretty cool too!
Replies from: Ericf
↑ comment by Ericf ·
2021-04-07T20:53:27.448Z · LW(p) · GW(p)
I object to the characterization that it is "getting the right idea." It seems to have latched on to "given a foo of bar" -> "@given(foo.bar)" and that "assert" should be used, but the rest is word salad, not code.
Replies from: zac-hatfield-dodds