Posts

Comments

Comment by Hailey Collet (hailey-collet) on OpenAI's Sora is an agent · 2024-02-19T15:50:21.484Z · LW · GW

The thing about the simulation capability that worries me most isn't plugging it in as-is, but probing the model, finding where the simulator pieces are and extracting them. This is obviously complicated, but for example something as simple as a linear probe identifying which entire layers are most involved and initializing a new model for training with those layers integrated, a model which doesn't have to output video (obviously your data/task/loss metric would have to ensure it gets used/updated/not overwritten, but choosing things where it would be useful should be enough)... I'm neither qualified to elaborate further nor inclined to do so, but the broad concern here is more efficient application of the simulation capability / integrating it into more diverse models.

Comment by Hailey Collet (hailey-collet) on PaLM-2 & GPT-4 in "Extrapolating GPT-N performance" · 2023-06-01T19:36:02.496Z · LW · GW

30,000ft takeaway I got from this: we're ~ < 2 OOM from 95% performance. Which passes the sniff test, and is also scary/exciting

Comment by Hailey Collet (hailey-collet) on Forecasting ML Benchmarks in 2023 · 2023-03-28T22:12:11.751Z · LW · GW

93% in 2025 FEELS high, but ... Meta was already low for 2023, median 83% but GPT-4 scores 86.4%. If you plot 100%-MMLU sota training compute FLOPS (e.g. RoBERTa with 1.32*10^21 flops scores 27.9% so 72.1% gap, GPT-3.5 @ 3.14*10^23=30% gap, GPT4 @ ~1.742*10^25=13.6% gap), it should take roughly 41x the training compute of GPT-4 to achieve 93.1% ... so it totally checks.

(My estimate for GPT4 compute is based on the 1 trillion parameter leak, approximate number of V100 GPUs they have - they didn't have A100 let alone H100 in hand during GPT4 training interval - possible range of training intervals scaling laws and the training time=8TP/nx law, etc., and ran some squiggles ... it should be taken with a grain of salt, but the final number doesn't change in very meaningful ways for any reasonable assumptions so, e.g., it might take 20x or 80x but it's not going to 500x the training compute to get to 93%)

Comment by Hailey Collet (hailey-collet) on Is your job replaceable by GPT-4? (as of March 2023) · 2023-03-24T07:59:28.613Z · LW · GW

In the short term, job loss will happen through compression of teams more than anything ... Some of seniors taking on junior work, e.g. if you have a development team of multiple senior & junior engineers, I could see some juniors getting canned with how things are now. But the restriction of exactly as it is right now is pretty severe. You don't have to train new models to vastly increase its real world capabilities.

Comment by Hailey Collet (hailey-collet) on Exploring GPT4's world model · 2023-03-24T02:46:25.911Z · LW · GW

5thed, or whatever. I won't assume anything about the author, the conclusions of this article are nonsense.

Comment by Hailey Collet (hailey-collet) on A chess game against GPT-4 · 2023-03-20T23:53:23.835Z · LW · GW

I had it play hundreds of games against stockfish, mostly at lowest skill level, using the API. After a lot of experimentation, I was giving it a fresh slate every prompt. The prompt was basically telling it it was playing chess, what color it was, and the PGN (it did not do as well with the FEN or both in either order). If it made invalid move (s), the next prompt (s) for that turn I added a list of the invalid moves it had attempted. After a few tries I had it forfeit the game.

I had a system set up to rate it, but it wasn't able to complete nearly enough games. As described, it finished maybe 1 in 40. I added a list of all legal moves on second and third attempt for a turn. It was then able to complete about 1 in 10 and won about half of them. Counting the forfeits and calling this a legal strategy, that's something like a 550 iirc? But. It's MUCH worse in the late-middle and end games, even with the fresh slate every turn. Until that point - including well past any opening book it could possibly have "lossless in its database" (not how it works) - it plays much better, subjectively 1300-1400.

Comment by Hailey Collet (hailey-collet) on GPT-4 · 2023-03-15T16:12:37.238Z · LW · GW

Ahh, I should have thought of having it repeat the history! Good prompt engineering. Will try it out. The gpt4 gameplay in your lichess study is not bad!

I tried by just asking it to play and use SAN. I had it explain its moves, which it did well, and it also commented on my (intentionally bad) play. It quickly made a mess of things though, clearly lost track of the board state (to the extent it's "tracking" it ... really hard to say exactly how it's playing past common opening) even though it should've been in the context window.

Comment by Hailey Collet (hailey-collet) on GPT-4 · 2023-03-14T23:51:08.891Z · LW · GW

How did you play? Just SAN?

Comment by Hailey Collet (hailey-collet) on A proposed method for forecasting transformative AI · 2023-02-12T19:23:42.601Z · LW · GW

I used the first chart, the compute required for GPT3, and my personal assessment that ChatGPT clearly meets the cutoff for tweet length, very probably meets it for short blog (but not by a wide margin), and clearly does not meet it for research paper, to create my own 75th percentile estimate for human slowdown of 25-75. It moves the P(TAI<=year) = 50% from ~2041 to ~2042, and the 75% from ~2060 to ~2061. Big changes! 😂

Comment by Hailey Collet (hailey-collet) on A proposed method for forecasting transformative AI · 2023-02-12T17:06:33.992Z · LW · GW

Your assertion that we don't have many things to reduce cost per transistor may be true, but is not supported by the rest of your comment or links - reduction in transitor size and similar performance improving measures are not the only way to improve cost performance.