August 2020 newsletter 2020-09-01T21:04:58.299Z · score: 24 (7 votes)
July 2020 newsletter 2020-08-20T16:39:27.202Z · score: 27 (6 votes)
June 2020 newsletter 2020-07-02T14:19:08.696Z · score: 15 (3 votes)
GPT-3 Fiction Samples 2020-06-25T16:12:05.422Z · score: 61 (19 votes)
May newsletter (w/GPT-3 commentary) 2020-06-02T15:40:37.155Z · score: 32 (12 votes)
OpenAI announces GPT-3 2020-05-29T01:49:04.855Z · score: 65 (33 votes)
"AI and Efficiency", OA (44✕ improvement in CNNs since 2012) 2020-05-05T16:32:20.335Z · score: 48 (14 votes)
April 2020 newsletter 2020-05-01T20:47:44.867Z · score: 10 (1 votes)
March 2020 newsletter 2020-04-03T02:16:02.871Z · score: 12 (2 votes)
February 2020 newsletter 2020-03-04T19:05:16.079Z · score: 14 (3 votes)
January 2020 newsletter 2020-01-31T18:04:21.945Z · score: 19 (4 votes)
Subscripting Typographic Convention For Citations/Dates/Sources/Evidentials: A Proposal 2020-01-08T22:20:20.290Z · score: 59 (14 votes)
Dec 2019 newsletter 2020-01-04T20:48:48.788Z · score: 16 (6 votes)
Nov 2019 newsletter 2019-12-02T21:16:04.846Z · score: 14 (4 votes)
October 2019 newsletter 2019-11-14T20:26:34.236Z · score: 12 (3 votes)
September 2019 newsletter 2019-10-04T16:44:43.147Z · score: 22 (4 votes)
"AI-GAs: AI-generating algorithms, an alternate paradigm for producing general artificial intelligence", Clune 2019 2019-09-10T21:33:08.837Z · score: 14 (4 votes)
August 2019 newsletter (popups.js demo) 2019-09-01T17:52:01.011Z · score: 12 (4 votes)
"Designing agent incentives to avoid reward tampering", DeepMind 2019-08-14T16:57:29.228Z · score: 29 (9 votes)
July 2019 newsletter 2019-08-01T16:19:59.893Z · score: 24 (5 votes)
How Should We Critique Research? A Decision Perspective 2019-07-14T22:51:59.285Z · score: 49 (12 votes)
June 2019 newsletter 2019-07-01T14:35:49.507Z · score: 30 (5 votes)
On Seeing Through 'On Seeing Through: A Unified Theory': A Unified Theory 2019-06-15T18:57:25.436Z · score: 27 (7 votes)
On Having Enough Socks 2019-06-13T15:15:21.946Z · score: 21 (6 votes)
May newsletter 2019-06-01T17:25:11.740Z · score: 17 (5 votes)
"One Man's Modus Ponens Is Another Man's Modus Tollens" 2019-05-17T22:03:59.458Z · score: 34 (5 votes)
April 2019 newsletter 2019-05-01T14:43:18.952Z · score: 11 (2 votes)
Recent updates to (2017–2019) 2019-04-28T20:18:27.083Z · score: 36 (8 votes)
"Everything is Correlated": An Anthology of the Psychology Debate 2019-04-27T13:48:05.240Z · score: 49 (7 votes)
March 2019 newsletter 2019-04-02T14:17:38.032Z · score: 19 (3 votes)
February newsletter 2019-03-02T22:42:09.490Z · score: 13 (3 votes)
'This Waifu Does Not Exist': 100,000 StyleGAN & GPT-2 samples 2019-03-01T04:29:16.529Z · score: 39 (12 votes)
January 2019 newsletter 2019-02-04T15:53:42.553Z · score: 15 (5 votes)
"Forecasting Transformative AI: An Expert Survey", Gruetzemacher et al 2019 2019-01-27T02:34:57.214Z · score: 17 (8 votes)
"AlphaStar: Mastering the Real-Time Strategy Game StarCraft II", DeepMind [won 10 of 11 games against human pros] 2019-01-24T20:49:01.350Z · score: 62 (23 votes)
Visualizing the power of multiple step selection processes in JS: Galton's bean machine 2019-01-12T17:58:34.584Z · score: 27 (8 votes)
Littlewood's Law and the Global Media 2019-01-12T17:46:09.753Z · score: 37 (8 votes)
Evolution as Backstop for Reinforcement Learning: multi-level paradigms 2019-01-12T17:45:35.485Z · score: 18 (4 votes)
December newsletter 2019-01-02T15:13:02.771Z · score: 20 (4 votes)
Internet Search Tips: how I use Google/Google Scholar/Libgen 2018-12-12T14:50:30.970Z · score: 54 (13 votes)
November 2018 newsletter 2018-12-01T13:57:00.661Z · score: 35 (8 votes)
October links 2018-11-01T01:11:28.763Z · score: 31 (8 votes)
Whole Brain Emulation & DL: imitation learning for faster AGI? 2018-10-22T15:07:54.585Z · score: 15 (5 votes)
New /r/gwern subreddit for link-sharing 2018-10-17T22:49:36.252Z · score: 46 (14 votes)
September links 2018-10-08T21:52:10.642Z · score: 18 (6 votes)
Genomic Prediction is now offering embryo selection 2018-10-07T21:27:54.071Z · score: 39 (14 votes)
August links 2018-09-25T15:57:20.808Z · score: 18 (5 votes)
July newsletter 2018-08-02T13:42:16.534Z · score: 24 (8 votes)
June newsletter 2018-07-04T22:59:00.205Z · score: 36 (8 votes)
May newsletter 2018-06-01T14:47:19.835Z · score: 73 (14 votes)


Comment by gwern on Where is human level on text prediction? (GPTs task) · 2020-09-21T15:17:22.252Z · score: 2 (1 votes) · LW · GW

It's in the figure.

Comment by gwern on Where is human level on text prediction? (GPTs task) · 2020-09-21T03:08:01.301Z · score: 7 (5 votes) · LW · GW

Might as well finish out this forecasting exercise...

If we assume compute follows the current trend of peak AI project compute doubling every 3.4 months, then 2.2e6× more compute would be log2(2.2e6) = 22 doublings away - or 22*(3.4/12) = 6.3 years, or 2027. (Seems a little unlikely.)

Going the other direction, Hernandez & Brown 2020's estimate is that, net of hardware & algorithmic progress, the cost of a fixed level of performance halves every 16 months; so if GPT-3 cost ~$5m in early 2020, then it'll cost $2.5m around mid-2021, and so on. Similarly, a GPT-human requiring 2.2e6× more compute would presumably cost on the order of $10 trillion in 2020, but after 14 halvings (18 years) would cost $1b in 2038.

Metaculus currently seems to be roughly in between 2027 and 2038 right now, incidentally.

Comment by gwern on Why GPT wants to mesa-optimize & how we might change this · 2020-09-21T00:57:28.696Z · score: 2 (1 votes) · LW · GW

It still is, it's just that beam search (or other search strategies) seem to be mostly useful for closed-end short text generation; translating a sentence apparently is a task with enough of a right-or-wrong-ness to it that beam search apparently taps into no pathologies. But they get exposed for open-ended longform generation.

Comment by gwern on Where is human level on text prediction? (GPTs task) · 2020-09-20T19:35:00.580Z · score: 7 (5 votes) · LW · GW

It's probably a lower bound. These datasets tend to be fairly narrow by design. I'd guess it's more than 2x across all domains globally. And cutting the absolute loss by 50% will be quite difficult. Even increasing the compute by 1000x only gets you about half that under the best-case scenario... Let's see, to continue my WebText crossentropy example, 1000x reduces the loss by about a third, so if you want to halve it (we'll assume that's about the distance to human performance on WebText) from 1.73 to 0.86, you'd need (2.57 * (3.64 * (10^3 * x))^(-0.048)) = 0.86 where x = 2.2e6 or 2,200,000x the compute of GPT-3. Getting 2.2 million times more compute than GPT-3 is quite an ask over the next decade or two.

Comment by gwern on Where is human level on text prediction? (GPTs task) · 2020-09-20T18:22:25.349Z · score: 6 (4 votes) · LW · GW

Looking more into reported perplexities, the only benchmark which seems to allow direct comparison of human vs GPT-2 vs GPT-3 is LAMBADA.

LAMBADA was benchmarked at a GPT-2 perplexity of 8.6, and a GPT-3 perplexity of 3.0 (zero-shot) & 1.92 (few-shot). OA claims in their GPT-2 blog post (but not the paper) that human perplexity is 1-2, but provides no sources and I couldn't find any. (The authors might be guessing based on how LAMBADA was constructed: examples were filtered by whether two independent human raters provided the same right answer.) Since LAMBADA is a fairly restricted dialogue dataset, although constructed to be difficult, I'd suggest that humans are much closer to 1 than 2 on it.

So overall, it looks like the best guess is that GPT-3 continues to have somewhere around twice the absolute error of a human.

Comment by gwern on Mati_Roy's Shortform · 2020-09-20T17:35:14.399Z · score: 7 (4 votes) · LW · GW

and error and hyperparameter tuning that would probably increase the cost several-fold.

All of which was done on much smaller models and GPT-3 just scaled up existing settings/equations - they did their homework. That was the whole point of the scaling papers, to tell you how to train the largest cost-effective model without having to brute force it! I think OA may well have done a single run and people are substantially inflating the cost because they aren't paying any attention to the background research or how the GPT-3 paper pointedly omits any discussion of hyperparameter tuning and implies only one run (eg the dataset contamination issue).

Comment by gwern on Where is human level on text prediction? (GPTs task) · 2020-09-20T17:22:34.430Z · score: 8 (4 votes) · LW · GW

To simplify Daniel's point: the pretraining paradigm claims that language draws heavily on important domains like logic, causal reasoning, world knowledge, etc; to reach human absolute performance (as measured in prediction: perplexity/cross-entropy/bpc), a language model must learn all of those domains roughly as well as humans do; GPT-3 obviously has not learned those important domains to a human level; therefore, if GPT-3 had the same absolute performance as humans but not the same important domains, the pretraining paradigm must be false because we've created a language model which succeeds at one but not the other. There may be a way to do pretraining right, but one turns out to not necessarily follow from the other and so you can't just optimize for absolute performance and expect the rest of it to fall into place.

(It would have turned out that language models can model easier or inessential parts of human corpuses enough to make up for skipping the important domains; maybe if you memorize enough quotes or tropes or sayings, for example, you can predict really well while still failing completely at commonsense reasoning, and this would hold true no matter how much more data was added to the pile.)

As it happens, GPT-3 has not reached the same absolute performance because we're just comparing apples & oranges. I was only talking about WebText in my comment there, but Omohundro is talking about Penn Tree Bank & 1BW. As far as I can tell, GPT-3 is still substantially short of human performance.

Comment by gwern on Where is human level on text prediction? (GPTs task) · 2020-09-20T17:06:19.567Z · score: 12 (8 votes) · LW · GW

I think Omohundro is wrong here. His GPT-3 perplexity of 20.5 must be for Penn Tree Bank. However, his 'humans' perplexity of 12 is for a completely different dataset! Tracing his citations from his video to Shen et al 2017, which uses 1 Billion Word Benchmark. 1BW was not reported in the GPT-3 paper because it was one of the datasets affected by contamination and dropped from evaluation.

I've never read the Penn Tree Bank or 1BW so I can't compare. At best, I'd guess that if 1BW is collected from "English newspapers", that's less diverse than the Brown Corpus which goes beyond newspapers, and so perplexities will be lower on 1BW than PTB. However, some searching turned up no estimates for human performance on either PTB or WebText, so I can't guess what the real human vs GPT-3 comparison might be. I'm also a little puzzled what the 'de-tokenizers' are that the Radford GPT paper mentions are necessary for doing the perplexity calculations at all...

(There are a lot of papers estimating English text entropy in terms of bits per character, but because of the BPEs and other differences, I don't know how to turn that into a perplexity which could be compared to the reported GPT-3 performance on Penn Tree Bank/WebText/LAMBADA/etc, which is why I didn't include a human baseline in my comment there - I just don't know.)

So, am I right in thinking that if someone took random internet text and fed it to me word by word and asked me to predict the next word, I'd do about as well as GPT-2 and significantly worse than GPT-3?


Comment by gwern on Why GPT wants to mesa-optimize & how we might change this · 2020-09-20T01:44:04.861Z · score: 6 (3 votes) · LW · GW

Why is beam search missing? One possibility is that GPT-3 already does internal lookahead. OpenAI tried beam search, found it didn't improve text generation, and didn't bother adding it as an option. In other words, GPT-3 is already mesa-optimizing 😲

Beam search has never worked for likelihood-trained NNs, since at least char-RNNs back in 2015. Beam search does trigger repetition and other pathologies in GPT, see "The Curious Case of Neural Text Degeneration", Holtzman et al 2019. And while unlikelihood training seems to help, it's not a silver bullet, and is a bit ad hoc (especially if you think of it in terms of reinforcement learning).

Comment by gwern on Superintelligence via whole brain emulation · 2020-09-20T00:25:23.327Z · score: 5 (2 votes) · LW · GW

Thus, it is probably important to be careful about not accelerating non-WBE neuromorphic AI while attempting to accelerate whole brain emulation. For instance, it seems plausible to me that getting better models of neurons would be useful for creating neuromorphic AIs while better brain scanning would not, and both technologies are necessary for brain uploading, so if that is true, it may make sense to work on improving brain scanning but not on improving neural models.

But what research improves brain imaging but not DL... One thing to point out about whole brain emulation vs 'de novo' AI is that it may be, in practice, nearly impossible to get WBEs without having already, much earlier, kickstarted 'de novo' AI.

If you can scan and run successfully a single whole brain, you got there by extensive brain imaging and brain scanning of much smaller chunks of many brains, and it seems like there is a lot of very transferable knowledge from the structure and activities of a human brain to artificial neural networks, which I dub "brain imitation learning". Not only do ANNs turn out to have fairly similar activation patterns as human brains in some respects (primarily visual cortex stuff), the human brain's activation patterns encode a lot of knowledge about how visual representations work which can be used to learn & generalize. (A particularly interesting example from this month is "Self-Supervised Natural Image Reconstruction and Rich Semantic Classification from Brain Activity", Gaziv et al 2020.) You might consider this a version of the pretraining paradigm or lexical hypothesis - the algorithms of general intelligence, and world knowledge, are encoded in the connectivity and activation patterns of a human brain and so training on large corpuses of such data to imitate the connectivity & activation patterns will provide an extremely powerful prior/initialization à la GPT-3 pretraining on large text datasets.

So, it is entirely possible that by the time you get to BCIs or whole-brain scanning apparatuses, these are providing high-volume data embeddings or structural/architectural constraints which help push deep learning approaches over the finish line to AGI by providing informative priors & meta-learning capabilities by conditioning on <100% data from many brains. (In fact, if you believe this won't happen, you have to explain what on earth is being done with all this extremely expensive data for decades on end, as it slowly ramps up from scanning insect-sized chunks to full monkey brains before finally an entire human brain is scanned 100% & they flip the giant red switch to make Mr John Smith, test subject #1918, wake up inside a computer. What is everyone doing before that?)

Whatever these DL systems may be, they won't be a single specific person, and they won't come with whatever safety guarantees people think an upload of Mr John Smith would come with, but they will come years or decades before.

Comment by gwern on ricraz's Shortform · 2020-09-18T00:33:50.179Z · score: 3 (2 votes) · LW · GW

Yes, there's something to that, but you have to be careful if you want to use that as an objection. Maybe you wouldn't easily think of it, but that doesn't exclude the possibility of you doing it: you can come up with algorithms you can execute which would spit out Egan-like ideas, like 'emulate Egan's brain neuron by neuron'. (If nothing else, there's always the ol' dovetail-every-possible-Turing-machine hammer.) Most of these run into computational complexity problems, but that's the escape hatch Egan (and Scott Aaronson has made a similar argument) leaves himself by caveats like 'given enough patience, and a very large notebook'. Said patience might require billions of years, and the notebook might be the size of the Milky Way galaxy, but those are all finite numbers, so technically Egan is correct as far as that goes.

Comment by gwern on ricraz's Shortform · 2020-09-17T21:56:05.791Z · score: 15 (6 votes) · LW · GW

Equivocation. "Who's 'we', flesh man?" Even granting the necessary millions or billions of years for a human to sit down and emulate a superintelligence step by step, it is still not the human who understands, but the Chinese room.

Comment by gwern on on “learning to summarize” · 2020-09-13T22:03:35.593Z · score: 6 (3 votes) · LW · GW

The latter. I didn't notice it was a link to a different paper, but I think my point stands: the better results in this paper compared to the previous finetuning paper can't be due to adding the KL constraint because they already had one. It has to be something else they changed, like more/better labels or bigger models.

Comment by gwern on on “learning to summarize” · 2020-09-13T17:41:02.620Z · score: 4 (2 votes) · LW · GW

The original paper & codebase definitely had KL penalties on the PPO policy. I spent a fair bit of time fiddling with it and letting it go high to see what adversarial ABC music examples it found in the hopes that it would train the reward model better when I labeled them. Didn't seem to work, it would just find similar and only slightly different examples.

Comment by gwern on on “learning to summarize” · 2020-09-12T18:01:05.221Z · score: 4 (2 votes) · LW · GW

No, not yet. (IMO, the power of differentiability is greatly underused. Everyone is locked into a 'optimize parameters based on data & loss' mindset, and few ever use the alternatives like 'optimize data/trajectory based on parameters & loss' or 'optimize loss based on data/parameters.)

Comment by gwern on on “learning to summarize” · 2020-09-12T15:01:38.306Z · score: 4 (2 votes) · LW · GW

FWIW, Gwern reports trying OpenAI’s approach and finding the RL side specifically frustrating and unstable; this is pretty normal with RL, and compatible with the reward-model part being very successful in its own domain. It’s not clear whether OpenAI got the RL part to work well because they did something right, or because they have lots of resources and can keep trying over and over until it works.

At the time, I figured that it was probably a sample-efficiency problem: the reward model just wasn't picking up on the subtle esthetics I wanted it to. I see this as supported by their new results: large models are more sample-efficient, so unsurprisingly, it works a lot better - the reward model can finally manage to understand what the preferences are, so it can provide a real signal to the RL training.

They seem to think it has more to do with label quality / better raters, which I didn't think was my problem (who better than me to rate my preferred ABC samples?), but better label quality is sort of like better sample-efficiency; I haven't read the paper in enough detail to see if they ablated model size vs label n vs label quality to get an idea of where the improvement is coming from.

Again, wouldn’t it be nice if we could avoid the need for this thing and just train on the preferences directly

Accept no substitutes! Gradient ascent directly on the differentiable reward/environment model!

Some new links on that topic:

Comment by gwern on Building brain-inspired AGI is infinitely easier than understanding the brain · 2020-09-10T17:18:02.178Z · score: 4 (2 votes) · LW · GW

There are also some links in

Comment by gwern on [AN #116]: How to make explanations of neurons compositional · 2020-09-09T20:49:20.426Z · score: 10 (6 votes) · LW · GW

The composition paper seems to exemplify what I talk about as my intuition for how NNs work. The models are both very small and trained on little data, but image classification seems to be much easier than NLP (which is why the DL revolution came to image classification many years before NLP), so it's enough to train the CNN to have fairly meaningful disentangled representations of the kind we expect; their RNN model, however, continues to grope through relatively superficial associations and tricks, as the text database is relatively tiny. I'd predict that if they analyze much larger networks, like BiT or GPT-3, they'd find much more composition, and much less reliance on polysemanticity, and less vulnerability to easy 'copy-paste' adversarial examples.

Comment by gwern on August 2020 newsletter · 2020-09-08T00:59:51.696Z · score: 2 (1 votes) · LW · GW

Fortunately, it's not as if the IEEE have any special immunity to Sci-Hub.

Comment by gwern on August 2020 newsletter · 2020-09-07T21:48:13.814Z · score: 2 (1 votes) · LW · GW

A lot of the links are spillover from continued editing of my GPT-3 pages, which is probably why.

And yes, it is intentional. The IEEE is remarkably humorless about people hosting PDFs.

Comment by gwern on [AN #115]: AI safety research problems in the AI-GA framework · 2020-09-02T17:30:06.262Z · score: 3 (2 votes) · LW · GW

Is this page completely unreadable for anyone else?

Comment by gwern on interpreting GPT: the logit lens · 2020-09-02T00:05:28.867Z · score: 6 (3 votes) · LW · GW

Related layer visualizations: "Looking for Grammar in All The Right Places".

Comment by gwern on interpreting GPT: the logit lens · 2020-09-02T00:02:46.452Z · score: 4 (3 votes) · LW · GW

Doing it with GPT-3 would be quite challenging just for compute requirements like RAM. You'd want to test this out on GPT-2-117M first, definitely. If the approach works at all, it should work well for the smallest models too.

Comment by gwern on From GPT to AGI · 2020-09-01T15:59:14.240Z · score: 4 (2 votes) · LW · GW

If you tweaked GPT-3 (let's assume the total parameter count remained the same so layers were made a little narrower or somesuch) to have a 30k BPE context, I think the RAM requirements would explode to the point where even the small layers couldn't fit their forward pass onto a single GPU. You can forget about training it too.

Comment by gwern on From GPT to AGI · 2020-08-31T21:31:39.035Z · score: 2 (1 votes) · LW · GW

As explained in the link, that is the size of the context window; past 30k, even TPU pod RAM is too small to run 117M with wider context windows as the RAM usage continues to explode quadratically.

I'm not sure what your calculation is supposed to be.

Comment by gwern on Zibbaldone With It All · 2020-08-31T17:48:08.500Z · score: 2 (1 votes) · LW · GW

Why can't any individual 'item' be shown with context like a dozen lines before/after (eg fading out)?

Comment by gwern on From GPT to AGI · 2020-08-31T16:43:33.884Z · score: 9 (5 votes) · LW · GW

I would expect that with increased model size it will be possible to increase the attention field by a lot without much need for additional AI insight.

It's not model size/parameters, it's the cost of the self-attention at runtime.The number of parameters to expand self-attention is linear but the runtime memory consumption goes up quadratically. Even a GPT-2-117M can use up to like 300GB RAM if you increase the window to 30k. You need more efficient attention or alternative architectures.

Comment by gwern on interpreting GPT: the logit lens · 2020-08-31T16:05:15.293Z · score: 14 (10 votes) · LW · GW

I think this might suggest there is some fundamentally better way to do sampling from GPT models? I'm having trouble writing out the intuition clearly, so I'll leave it for later posts.

Unroll the sampling process: hook up all the individual GPT instances into a single long model, bypass the discretizing/embedding layers to make it differentiable end-to-end, and do gradient ascent to find the sequence which maximizes likelihood conditional on the fixed input.

Comment by gwern on Zibbaldone With It All · 2020-08-29T00:20:11.713Z · score: 2 (1 votes) · LW · GW

I call this "anti-spaced repetition": the benefit is from surfacing connections for material you've forgotten (as opposed to reviewing material you still remember so as to strengthen retention). You can optimize time spent reviewing older material by using the spacing effect to estimate things which have been forgotten for the longest - same equation, just optimizing for something else.

Comment by gwern on Zibbaldone With It All · 2020-08-29T00:16:49.215Z · score: 5 (3 votes) · LW · GW

I call this "anti-spaced repetition": the benefit is from surfacing connections for material you've forgotten (as opposed to reviewing material you still remember so as to strengthen retention). You can optimize time spent reviewing older material by using the spacing effect to estimate which things have been forgotten for the longest - same equation, just optimizing for something else.

Comment by gwern on How hard would it be to change GPT-3 in a way that allows audio? · 2020-08-28T15:40:32.220Z · score: 23 (10 votes) · LW · GW

The stock GPT model, because it uses dense attention which works best at hundreds / thousands length, isn't suitable for any kind of raw audio, which involves extremely long sequences of millions of tokens at the millisecond level. (A WAV may be scores of megabytes long; even a highly optimized lossy encoding like MP3 or Vorbis is still megabytes for most music.) If you tried, it'd be a failure because 1024 or 2048 tokens would encode all of a few milliseconds of audio at best, and it's impossible to meaningful predict based on a few milliseconds; most sounds or phonemes or musical notes are far longer than that! You can use it for very high level encodings like ABC notation or, if you brute force it a bit, you can generate MIDI via ABC. See This will let you generate folk or instrumental style music with a few instruments in a simple style. (Note the hack that iGPT resorts to, with pixel-encoding, to make even tiny images of 64px workable with enormous compute - because that's a 64^2^ RGB image is a 'sequence' of l=64*64*3=12,288, which is well into the painful territory for dense GPT.)

MuseNet goes one level below ABC by operating on a MIDI encoding of music. This requires shifting from dense attention to a more scalable attention, in its case, Sparse Transformers, which can handle lengths of tens of thousands with acceptable compute requirements & quality. MuseNet was better but still fairly limited. (Not raw audio, a few instruments, definitely no voices etc.)

Jukebox operates at the raw audio level, and it does this by using much larger models scaled up (<10b parameters), conditioned on lyrics/artist metadata (from n~1m songs, IIRC), and a hybrid architecture: not just Sparse Transformers, but VAE-style codebooks providing discrete embeddings of the music style for more global consistency compared to a pure autoregressive token-by-token approach like GPT/MuseNet. Jukebox is extremely impressive: it generates raw audio, for most genres of music, in the style of specific artists, and it even learns to synthesize singing voices (!). It doesn't quite have the global coherency that GPT or MuseNet samples can achieve, like choruses, because I think its attention window is still de facto limited to something like 20 seconds, which limits learning & long-range coherency; but I think fixing that's just a matter of adding on another layer in the hierarchy and maybe another order parameters, and that would fix much of the remaining quality gap.

Jukebox suggests that if you created a large enough model, you could probably dispense with the VAE part and just use pure Transformers.

Comment by gwern on Are We Right about How Effective Mockery Is? · 2020-08-27T23:43:10.232Z · score: 2 (1 votes) · LW · GW

Almost certainly. And of course, Scott doesn't mention "Richard Dawkins" (the example was /r/atheism) so no wonder my keyword search didn't turn it up... Oh well.

Comment by gwern on Are We Right about How Effective Mockery Is? · 2020-08-27T18:37:05.919Z · score: 19 (10 votes) · LW · GW

I vaguely remember, somewhere on LW or SSC probably, a discussion of Richard Dawkins and the New Atheists - as absurd and risible as their screeds and juvenile mockery of Judeochristian religion were, one could point out that it seemed to work on a lot of people. Apparently there were a lot of people who saw the mockery, saw that there was even semi-plausible arguments against religion, saw that you could be a perfectly successful normal person like Dawkins without being smote by lightning (or ostracization) for saying that God doesn't exist, and that was enough to trigger deconversion; theism was taken so for granted in their local environment that they hadn't really thought about it meaningfully, or they had had growing doubts but no awareness of the debate, or thought they were alone etc, and seeing the New Atheists be a thing mocking religion was enough.

Comment by gwern on On Suddenly Not Being Able to Work · 2020-08-26T15:52:20.558Z · score: 5 (3 votes) · LW · GW

Looks like a ceiling effect: a large fraction of turns just have easy or obvious movies for a player, which everyone is more than capable of solving near-perfectly (they hit the ceiling) so bad conditions don't much affect blunders (because the conditions aren't bad enough to pull an appreciable number of moves/players down below the ceiling to start making huge blunders), but the bad conditions do still affect the hard moves, and increase the errors in those.

(Imagine drawing a curve with a vertical line at the minimum skill necessary to compete in these. Everything to the left of it is an 'easy' move and all players solve it, while to the right, they are 'hard' moves where players increasingly likely make more expensive mistakes. Bad conditions move the curve diagonally up-right: the vertical line remains the same since the players don't change, and the number of moves which flip from 'easy' to 'hard' changes by a relatively small %, as only a few moves cross the line, but all the moves to the right of it become harder and the mistakes increasingly expensive.)

"Assessing Human Error Against a Benchmark of Perfection", Anderson et al 2016, indicates that human GMs match the chess engine's predicted best move about half the time. This suggests that a lot of moves are 'solved' in the sense that either the move is very obvious (the opening book, forced moves), or the baseline of competency at GM level easily handles them - leaving only the other half of hard moves as critical moves which contribute to victory or defeat. Table A.1 seems to imply that ~55% of moves are classified as errors (15k/27k), so seems similar.

Comment by gwern on Forecasting Thread: AI Timelines · 2020-08-25T01:18:30.730Z · score: 26 (11 votes) · LW · GW

I was looking at the NIPS growth numbers last June and I made a joke:

AI researcher anthropics: 'researchers [should] tend to think AI is ~20 years away because given exponential growth of researchers & careers of ~30 years, the final generation of researchers will make up a majority of all researchers, hence, by SSA+Outside View, one must assume 20 years.'

(Of course, I'm making a rather carbon-chauvinistic assumption here that it's only human researchers/researcher-years which matter.)

Comment by gwern on SDM's Shortform · 2020-08-23T15:39:10.050Z · score: 7 (4 votes) · LW · GW

I'm not sure what's going on here - is it the initial prompt saying it was 'testing physical and common sense reasoning'? Was that all it took?

Entirely possibly. Other people have mentioned that using any prompt (rather than just plopping the stories in) solves a lot of them, and Summers-stay says that Marcus & Davis did zero prompt programming and had no interest in the question of what prompt to use (quite aside from the lack of BO). I think they found the same thing, which is why they provide the preemptive excuse in the TR writeup:

Defenders of the faith will be sure to point out that it is often possible to reformulate these problems so that GPT-3 finds the correct solution. For instance, you can get GPT-3 to give the correct answer to the cranberry/grape juice problem if you give it the following long-winded frame as a prompt:

Comment by gwern on Will OpenAI's work unintentionally increase existential risks related to AI? · 2020-08-21T19:42:59.146Z · score: 16 (6 votes) · LW · GW

The scaling hypothesis implies that it'll happen eventually, yes: but the details matter a lot. One way to think of it is Eliezer's quip: the IQ necessary to destroy the world drops by 1 point per year. Similarly, to do scaling or bitter-lesson-style research, you need resources * fanaticism < a constant. This constant seems to be very small, which is why compute had to drop all the way to ~$1k before any researchers worldwide were fanatical enough to bother trying CNNs and create AlexNet. Countless entities, and companies, could have used this 'obvious way to do better than everyone else, for market competitiveness' for years - or decades - before hand. But they didn't.

For the question of who gets there first, 'a handful of years' is decisive. So this is pretty important if you want to think about the current plausible AGI trajectories, which for many people (even excluding individuals like Moravec, or Shane Legg who has projected out to ~2028 for a long time now), have shrunk rapidly to timescales on which 'a handful of years' represents a large fraction of the outstanding timeline!

Incidentally, it has now been 86 days since the GPT-3 paper was uploaded, or a quarter of a year. Excluding GShard (which as a sparse model is not at all comparable parameter-wise), as far as I know no one has announced any new (dense) models which are even as large as Turing-NLG - much less larger than GPT-3.

Comment by gwern on Matt Botvinick on the spontaneous emergence of learning algorithms · 2020-08-20T23:58:40.867Z · score: 23 (7 votes) · LW · GW

The argument that these and other meta-RL researchers usually make is that (as indicated by the various neurons which fluctuate, and I think based on some other parts of their experiments which I would have to reread it to list) what these RNNs are learning is not just a simple play-the-winner heuristic (which is suboptimal, and your suggestion would require only 1 neuron to track the winning arm) but amortized Bayesian inference where the internal dynamics are learning the sufficient statistics of the Bayes-optimal solution to the POMDP (where you're unsure what of a large family of MDPs you're in): "Meta-learning of Sequential Strategies", Ortega et al 2019; "Reinforcement Learning, Fast and Slow", Botvinick et al 2019; "Meta-learners' learning dynamics are unlike learners'", Rabinowitz 2019; "Bayesian Reinforcement Learning: A Survey", Ghavamzadeh et al 2016, are some of the papers that come to mind. Then you can have a fairly simple decision rule using that as the input (eg Figure 4 of Ortega on a coin-flipping example, which is a setup near & dear to my heart).

To reuse a quote from my backstop essay: as Duff 2002 puts it,

"One way of thinking about the computational procedures that I later propose is that they perform an offline computation of an online, adaptive machine. One may regard the process of approximating an optimal policy for the Markov decision process defined over hyper-states as 'compiling' an optimal learning strategy, which can then be 'loaded' into an agent."

Comment by gwern on Radical Probabilism · 2020-08-20T14:42:49.011Z · score: 21 (7 votes) · LW · GW

Jeffrey's thesis is at

Comment by gwern on Mesa-Search vs Mesa-Control · 2020-08-19T18:58:35.589Z · score: 3 (2 votes) · LW · GW

And the Transformer can recompute whatever function the RNN is computing over its history, no, as I said? Whatever a RNN can do with its potentially limited access to history, a Transformer can recompute with its full access to history as if it were the unrolled RNN. It can recompute that for every bit, generate the next one, and then recompute on the next step with that as the newest part of its history being conditioned on.

Comment by gwern on Which way should we update about uncertain events, over time? · 2020-08-19T18:55:27.053Z · score: 3 (2 votes) · LW · GW

This is one of those questions where there are a lot of possible answers depending on what model you have. It's a kind of order statistics/extreme values and model uncertainty question. So you could use Laplace's law of succession, you could use the hope function, you could try to fit some underlying continuous trend to predict binary outcomes or rates...

Comment by gwern on Mesa-Search vs Mesa-Control · 2020-08-19T18:30:12.062Z · score: 3 (2 votes) · LW · GW

Er, maybe your notation is obscuring this for me, but how does that follow? Where is the RNN getting this special randomness from? Why aren't the internal activations of a many-layer Transformer perfectly adequate to first encode, 'storing z', and then transform?

Comment by gwern on Mesa-Search vs Mesa-Control · 2020-08-19T14:25:00.794Z · score: 6 (3 votes) · LW · GW

I'm a little confused as to why there's any question here. Every algorithm lies on a spectrum of tradeoffs from general to narrow. The narrower a class of solved problems, the more efficient (in any way you care to name) an algorithm can be: a Tic-Tac-Toe solver is going to be a lot more efficient than AIXI.

Meta-learning works because the inner algorithm can be far more specialized, and thus, more performant or sample-efficient than the highly general outer algorithm which learned the inner algorithm.

For example, in Dactyl, PPO trains a RNN to adapt to many possible robot hands on the fly in as few samples as possible; it's probably several orders of magnitude faster than online training of an RNN by PPO directly. "Why not just use that RNN for DoTA2, if it's so much better than PPO?" Well, because DoTA2 has little or nothing to do with robotic hands rotating cubes, an algorithm that excels at robot hand will not transfer to DoTA2. PPO will still work, though.

Comment by gwern on Open & Welcome Thread - August 2020 · 2020-08-19T01:54:06.529Z · score: 10 (5 votes) · LW · GW

Many entities have sanity-checking tools. They fail. Many have careful developers. They fail. Many have automated tests. They fail. And so on. Disasters happen because all of those will fail to work every time and therefore all will fail some time. If any of that sounds improbable, as if there would have to be a veritable malevolent demon arranging to make every single safeguard fail or backfire (literally, sometimes, like the recent warehouse explosion - triggered by welders trying to safeguard it!), you should probably read more about complex systems and their failures to understand how normal it all is.

Comment by gwern on Open & Welcome Thread - August 2020 · 2020-08-19T01:13:30.380Z · score: 7 (3 votes) · LW · GW

Even if you disclaim configuration errors or updates (despite this accounting for most of a system's operating lifespan, and human/configuration errors accounting for a large fraction of all major errors at cloud providers etc according to postmortems), an error may still happen too fast to notice. Recall that in the preference learning case, the bug manifested after Christiano et al went to sleep, and they woke up to the maximally-NSFW AI. AlphaZero trained in ~2 hours wallclock, IIRC. Someone working on an even larger cluster commits a change and takes a quick bathroom break...

Comment by gwern on Mesa-Search vs Mesa-Control · 2020-08-18T23:10:29.575Z · score: 29 (10 votes) · LW · GW

But if GPT-3 can accomplish the same things empirically, who cares? GPT-3 is entirely reconstructing the “learned information” from the history, at every step. If it can accomplish so much this way, should we count its lack of recurrence against it?

I think that's exactly it. There's no real difference between a history, and a recurrence. A recurrence is a (lossy) function of a history, so anything a recurrent hidden state can encode, a sufficiently large/deep feedforward model given access to the full history should be able to internally represent as well.

GPT with a context window of 1 token would be unable to do any kind of meta-learning, in much the same way that a RNN with no hidden state (or at its first step with a default hidden state) working one step at a time would be unable to do anything. Whether you compute your meta-learning 'horizontally' by repeated application to a hidden state, updating token by token, or 'vertically' inside a deep Transformer (an unrolled RNN?) conditioned on the entire history, makes no difference aside from issues of perhaps computational efficiency (a RNN is probably faster to run but slower to train) and needing more or less parameters or layers to achieve the same effective amount of pondering time (although see Universal Transformers there).

Comment by gwern on GPT-3, belief, and consistency · 2020-08-18T17:29:19.449Z · score: 2 (1 votes) · LW · GW

What would be a test? Pulling up a dialogue with Thomas Jefferson and asking his views on the growth of the American federal government?

Comment by gwern on Open & Welcome Thread - August 2020 · 2020-08-15T23:36:44.070Z · score: 4 (2 votes) · LW · GW

Yes. For example: lots of applications use online learning. A programmer flips the meaning of a boolean flag in a database somewhere while not updating all downstream callers, and suddenly an online learner is now actively pessimizing their target metric.

Comment by gwern on Are we in an AI overhang? · 2020-08-15T22:54:03.945Z · score: 5 (3 votes) · LW · GW

The likelihood loss intersection point is very vague, as they point out, as it only weakly suggests, for that specific architecture/training method/dataset, a crossover to a slower-scaling curve requiring increasing data more anywhere between 10^4 and 10^6 or so. As GPT-3 hits 10^3 and is still dead on the scaling curve, it seems that any crossover will happen much higher than lower. (I suspect part of what's going on there is the doubled context window: as Nostalgebraist notes, their experiments with 1024 ctx strongly suggests that the more context window you have, the more you can learn profitably, so doubling to 2048 ctx probably pushed off the crossover quite a bit. Obviously, they have a long way to go there.) So the crossover itself, much less negative profitability of scaling, may be outside the current 100-1000x being mooted. (I'd also note that I don't see why they are so baffled at the suggestion that a model could overfit in a single epoch. Have they looked at the Internet lately? It is not remotely a clean, stationary, minimal, or i.i.d. dataset, even after cleaning & deduplication.)

I also think that given everything we've learned about prompt programming and the large increases in benchmarks like arithmetic or WiC, making arguments from pseudo-lack-of-scaling in the paper's benchmarks is somewhere between foolish and misleading, at least until we have an equivalent set of finetuning benchmarks which should cut through the problem of good prompting (however bad the default prompt is, biasing performance downwards, some finetuning should quickly fix that regardless of meta-learning) and show what GPT-3 can really do.

Comment by gwern on Open & Welcome Thread - August 2020 · 2020-08-15T20:57:32.755Z · score: 9 (5 votes) · LW · GW

It is definitely awkward when that happens. Reward functions are hard.