October 2020 newsletter 2020-11-01T21:38:46.795Z
/r/MLScaling: new subreddit for NN scaling research/discussion 2020-10-30T20:50:25.973Z
"Scaling Laws for Autoregressive Generative Modeling", Henighan et al 2020 {OA} 2020-10-29T01:45:30.666Z
September 2020 newsletter 2020-10-26T13:38:51.107Z
August 2020 newsletter 2020-09-01T21:04:58.299Z
July 2020 newsletter 2020-08-20T16:39:27.202Z
June 2020 newsletter 2020-07-02T14:19:08.696Z
GPT-3 Fiction Samples 2020-06-25T16:12:05.422Z
May newsletter (w/GPT-3 commentary) 2020-06-02T15:40:37.155Z
OpenAI announces GPT-3 2020-05-29T01:49:04.855Z
"AI and Efficiency", OA (44✕ improvement in CNNs since 2012) 2020-05-05T16:32:20.335Z
April 2020 newsletter 2020-05-01T20:47:44.867Z
March 2020 newsletter 2020-04-03T02:16:02.871Z
February 2020 newsletter 2020-03-04T19:05:16.079Z
January 2020 newsletter 2020-01-31T18:04:21.945Z
Subscripting Typographic Convention For Citations/Dates/Sources/Evidentials: A Proposal 2020-01-08T22:20:20.290Z
Dec 2019 newsletter 2020-01-04T20:48:48.788Z
Nov 2019 newsletter 2019-12-02T21:16:04.846Z
October 2019 newsletter 2019-11-14T20:26:34.236Z
September 2019 newsletter 2019-10-04T16:44:43.147Z
"AI-GAs: AI-generating algorithms, an alternate paradigm for producing general artificial intelligence", Clune 2019 2019-09-10T21:33:08.837Z
August 2019 newsletter (popups.js demo) 2019-09-01T17:52:01.011Z
"Designing agent incentives to avoid reward tampering", DeepMind 2019-08-14T16:57:29.228Z
July 2019 newsletter 2019-08-01T16:19:59.893Z
How Should We Critique Research? A Decision Perspective 2019-07-14T22:51:59.285Z
June 2019 newsletter 2019-07-01T14:35:49.507Z
On Seeing Through 'On Seeing Through: A Unified Theory': A Unified Theory 2019-06-15T18:57:25.436Z
On Having Enough Socks 2019-06-13T15:15:21.946Z
May newsletter 2019-06-01T17:25:11.740Z
"One Man's Modus Ponens Is Another Man's Modus Tollens" 2019-05-17T22:03:59.458Z
April 2019 newsletter 2019-05-01T14:43:18.952Z
Recent updates to (2017–2019) 2019-04-28T20:18:27.083Z
"Everything is Correlated": An Anthology of the Psychology Debate 2019-04-27T13:48:05.240Z
March 2019 newsletter 2019-04-02T14:17:38.032Z
February newsletter 2019-03-02T22:42:09.490Z
'This Waifu Does Not Exist': 100,000 StyleGAN & GPT-2 samples 2019-03-01T04:29:16.529Z
January 2019 newsletter 2019-02-04T15:53:42.553Z
"Forecasting Transformative AI: An Expert Survey", Gruetzemacher et al 2019 2019-01-27T02:34:57.214Z
"AlphaStar: Mastering the Real-Time Strategy Game StarCraft II", DeepMind [won 10 of 11 games against human pros] 2019-01-24T20:49:01.350Z
Visualizing the power of multiple step selection processes in JS: Galton's bean machine 2019-01-12T17:58:34.584Z
Littlewood's Law and the Global Media 2019-01-12T17:46:09.753Z
Evolution as Backstop for Reinforcement Learning: multi-level paradigms 2019-01-12T17:45:35.485Z
December newsletter 2019-01-02T15:13:02.771Z
Internet Search Tips: how I use Google/Google Scholar/Libgen 2018-12-12T14:50:30.970Z
November 2018 newsletter 2018-12-01T13:57:00.661Z
October links 2018-11-01T01:11:28.763Z
Whole Brain Emulation & DL: imitation learning for faster AGI? 2018-10-22T15:07:54.585Z
New /r/gwern subreddit for link-sharing 2018-10-17T22:49:36.252Z
September links 2018-10-08T21:52:10.642Z
Genomic Prediction is now offering embryo selection 2018-10-07T21:27:54.071Z


Comment by gwern on Predictions made by Mati Roy in early 2020 · 2020-11-22T15:26:35.287Z · LW · GW

You missed the third baby announced as part of the trial:

Dr. He, 35, sent the scientific world into an uproar last year when he announced at a conference in Hong Kong that he had created the world’s first genetically edited babies — twin girls. On Monday, China’s state media said his work had resulted in a third genetically edited baby, who had been previously undisclosed...“He violated medical ethics, disrespected life and let three poor children bear the consequences, all for his fame and fortune,” one user wrote. “I think this punishment is too light.”

Comment by gwern on Spend twice as much effort every time you attempt to solve a problem · 2020-11-16T18:27:38.179Z · LW · GW

The doubling strategy also has counterparts in array allocation strategies and algorithmic analysis (it's common to double an array's size each time it gets too large, to amortize the copying). Successive Halving, a racing algorithm, is also of interest here if you think of a portfolio of tasks instead of a single task.

Comment by gwern on When Hindsight Isn't 20/20: Incentive Design With Imperfect Credit Allocation · 2020-11-10T14:46:19.884Z · LW · GW

So this is the 2-of-2 exploding Nash equilibrium technique applied to multiple parties/transactions? What's this generalized kind called?

(On a side note, it now strikes me that there's a parallel to RL blackbox optimization: by setting up a large penalty for any divergence from the golden path, it creates an unbiased, but high variance estimator of credit assignment. When pirates participate in enough rollouts with enough different assortments of pirates, they receive their approximate honesty-weighted return. You can try to pry open the blackbox and reduce variance by taking into account pirate baselines etc, but at the risk of losing unbiasedness if you do it wrong.)

Comment by gwern on When Hindsight Isn't 20/20: Incentive Design With Imperfect Credit Allocation · 2020-11-10T14:45:44.704Z · LW · GW

So this is the 2-of-2 exploding Nash equilibrium technique applied to multiple parties/transactions? What's this generalized kind called?

(On a side note, it now strikes me that there's a parallel to RL blackbox optimization: by setting up a large penalty for any divergence from the golden path, it creates an unbiased, but high variance estimator of credit assignment. When pirates participate in enough rollouts with enough different assortments of pirates, they receive their approximate honesty-weighted return. You can try to pry open the blackbox and reduce variance by taking into account pirate baselines etc, but at the risk of losing unbiasedness if you do it wrong.)

Comment by gwern on the scaling “inconsistency”: openAI’s new insight · 2020-11-09T02:48:53.856Z · LW · GW

I think I see 'domain-specific datasets' as broader than you do. You highlight Github, and yet, when I think of Github, I think of thousands of natural and artificial languages, tackling everything related to software in the world (which is increasingly 'everything'), by millions of people, doing things like uploading banned books for evading the Great Firewall, filing bugs and discussing things back and forth, often adversarially, all reliant on common sense and world knowledge. A GPT trained on Github at hundreds of gigabytes I would expect to induce meta-learning, reasoning, and everything else, for exactly the same reasons CC/books1/books2/WP do; yes, it would know 'source code' well (not a trivial thing in its own right), but that is a mirror of the real world. I see plenty of broad domain coverage from 'just' Github, or 'just' Arxiv. (Literotica, I'm less sure about.) I don't see Github as having much of a disadvantage over CC in terms of broadness or what a model could learn from it. Indeed, given what we know about CC's general quality and how default preprocessing can screw it up (I see a lot of artifacts in GPT-3's output I think are due to bad preprocessing), I expect Github to be more useful than an equivalent amount of CC!

The big advantage of Common Crawl over a Github scrape is that, well, CC already exists. Someone has to invest the effort at some point for all datasets, after all. You can go download pre-cleaned versions of it - aside from EleutherAI's version (which they expect to be substantially better than CC on a byte for byte basis), Facebook and Google recently released big multilingual CC. But of course, now that they've done it and added it to the Pile, that's no longer a problem.

Comment by gwern on the scaling “inconsistency”: openAI’s new insight · 2020-11-09T00:01:12.780Z · LW · GW

Yes, my hypothesis is that active learning should have a different asymptotic because in a lot of admittedly-simple scenarios like logistic regression, active learning has a much nicer asymptotic. Right now, it's not too hard to run in <=1 epoch, and GPT-3 did, and that's using just CC and books1/2/WP. There's loads of other text datasets. (I think someone in EleutherAI was saying that Literotica alone was 500GB...?) Even if active learning 'runs out' of useful data before available compute, that will (a) save a whole lot of compute/time. and (b) tell us explicitly that we've 'used up' the default data and need to revise our approaches.

The Pile is an interesting experiment, but they're mostly adding large quantities of single-domain text like Github, which is great for those domains but won't help outside them.

I disagree. Transfer learning is practically the entire point. 'Blessings of scale' etc.

Filtering for difficulty like that is tricky. In particular the most difficult samples are random noise or Chinese or something that the model can't begin to comprehend.

I would point out that GPT-2 nontrivially, and GPT-3 surprisingly well, understand Chinese. And see my link: GPT-2 is able to filter out garbage really well. It doesn't have to be perfect. Even a ratio of, say, only 99:1 garbage:good data deleted is a big win. You're trying to filter out really egregious horrible nonsense data of the sort that you can't even imagine exists until you've actually waded through the sewer of Common Crawl and understood what garbage data really is out there. (Another fun example is: when you go looking for rare languages in Common Crawl, given the base rate, what do even really good natural-language identifier models pull up for rare models? Mostly garbage/natural adversarial examples...)

Comment by gwern on the scaling “inconsistency”: openAI’s new insight · 2020-11-07T18:25:33.892Z · LW · GW

This makes sense to me and is what I've been considering as the implication of sample-efficiency (one of the blessings of scale), coming at it from another direction of meta-learning/Bayesian RL: if your model gets more sample-efficient as it gets larger & n gets larger, it's because it's increasingly approaching a Bayes-optimal learner and so it gets more out of the more data, but then when you hit the Bayes-limit, how are you going to learn more from each datapoint? You have to switch over to a different and inferior scaling law. You can't squeeze blood from a stone; once you approach the intrinsic entropy, there's not much to learn. Steeply diminishing returns is built into compiling large text datasets and just training on random samples. It looks like the former is the regime we've been in up to GPT-3 and beyond, and the latter is when the slower data-only scaling kicks in.

Aside from multimodal approaches, the crossover raises the question of whether it becomes time to invest in improvements like active learning. Bayesian RL is so sample-efficient because it actively optimizes choice of data points to acquire, it doesn't just passively acquire ever-more-redundant i.i.d. samples. Active learning is the supervised equivalent, and active learning has different and much better asymptotics than random sampling.

What scaling curve in L(D)/L(C) could we get with even a simple active learning approach like running a small GPT over Common Crawl and throwing out datapoints which are too easily predicted? (A forward pass of something like GPT-2-1.5b will cost <<1% of the cost of forwards+backwards GPT-3, and so on, and is practically free if we consider scaling a GPT-4 to where the crossover used to be. I've suggested this to EleutherAI to optimize their Pile dataset, and even as simple an approach as looking at gzip compression ratios to throw out extremely poorly/highly-compressed data to trim the outliers seems to work fairly well in throwing away spam but not ham; however, they've been too busy getting the first version working to experiment with any real refinements.)

One interesting example from my 'data' category is "Generative Models are Unsupervised Predictors of Page Quality: A Colossal-Scale Study", Bahri et al 2020:

Large generative language models such as GPT-2 are well-known for their ability to generate text as well as their utility in supervised downstream tasks via fine-tuning. Our work is twofold: firstly we demonstrate via human evaluation that classifiers trained to discriminate between human and machine-generated text emerge as unsupervised predictors of "page quality", able to detect low quality content without any training. This enables fast bootstrapping of quality indicators in a low-resource setting. Secondly, curious to understand the prevalence and nature of low quality pages in the wild, we conduct extensive qualitative and quantitative analysis over 500 million web articles, making this the largest-scale study ever conducted on the topic.

Comment by gwern on Generalize Kelly to Account for # Iterations? · 2020-11-06T00:45:16.439Z · LW · GW

Yes, for some classes of games in some sense... MDP/POMDPs are a very general setting so I don't expect any helpful simple exact answers (although to my surprise there were for this specific game), so I just have qualitative observations that it seems like when you have quasi-investment-like games like the coin-flip game, the longer they run and the higher the cap is, the more the exact optimal policy looks like the Kelly policy because the less you worry about bankruptcy & the glide-in period gets relatively smaller.

I suspect that if the winnings were not end-loaded and you could earn utility in each period, it might look somewhat less Kelly, but I have not tried that in the coin-flip game.

Comment by gwern on Generalize Kelly to Account for # Iterations? · 2020-11-02T18:28:36.368Z · LW · GW

I suggest explicitly stepping outside of an expected-utility framework here.

EV seems fine. You just need to treat it as the multi-stage decision problem it is, and solve the MDP/POMDP. One of the points of my Kelly coin-flip exercises is that the longer the horizon, and the closer you are to the median path, the more Kelly-like optimal decisions look, but the optimal choices looks very unKelly-like as you approach boundaries like the winnings cap (you 'coast in', betting much less than the naive Kelly calculation would suggest, to 'lock in' hitting the cap) or you are far behind when you start to run out of turns (since you won't lose much if you go bankrupt and the opportunity cost decreases the closer you get to the end of the game, the more greedy +EV maximization is optimal so you can extract as much as possible, so you engage in wild 'overbetting' from the KC perspective, which is unaware the game is about to end).

Comment by gwern on Living Forever is Hard, part 3: the state of life extension research · 2020-11-01T02:28:05.995Z · LW · GW

As expected, the C60 result has failed to replicate (and comes with bonus toxicity and serious purity concerns in the commercially-available C60). Just another day in the animal-studies Replication Crisis, and another result that was too good to be true.

Comment by gwern on Mesa-Search vs Mesa-Control · 2020-10-31T23:18:22.708Z · LW · GW

Because you don't train the inputs, you're trying to train parameters, but the gradients stop cold there if you just treat them as blackboxes, and this seems like it's abusing the term 'stochastic' (what does the size of minibatches being smaller than the full dataset have to do with this?). I still don't understand what you think Transformers are doing differently vs RNNs in terms of what kind of processing of history they are doing and why Transformers can't meta-learn in the same way as RNNs internally.

Comment by gwern on Mesa-Search vs Mesa-Control · 2020-10-31T18:50:25.343Z · LW · GW

An RNN is deterministic, usually (how else are you going to backprop through it to train it? not too easily), and even if it's not, I don't see why that would make a difference, or why a Transformer couldn't be 'not deterministic' in the same sense given access to random bits (talking about stochastic units merely smuggles in bits by the back door) nor why it can't learn 'Monte Carlo iterations' internally (say, one per head).

Comment by gwern on "Scaling Laws for Autoregressive Generative Modeling", Henighan et al 2020 {OA} · 2020-10-30T14:18:16.210Z · LW · GW

Up to a certain limit; Kaplan covers this in the talk a bit with reference to the RNN scaling curves in Kaplan et al 2020 - RNNs scale similarly to Transformers, with a worse constant in terms of compute, but they make bad use of context. After a few hundred tokens, the history has vanished. This is the usual RNN problem: theoretically, the history is unlimited, but as has been observed long before, the history is de facto limited to a few hundred tokens, while Transformers make effective use of history from thousands of timesteps before.

So I interpret this as meaning that NN architectures are all 'universal' in a sense (they all scale similarly, and I'm told that CNNs do too), but what makes Transformers superior is that they are more compute-efficient on current hardware and they optimize much better because, as 'unrolled RNNs', they are equivalently powerful but they have much more direct access to the history (pace residual layers) which makes the credit assignment/learning much easier than RNNs which must squeeze it all into a hidden state rather than recalculating a function with the entire raw history available.

(Lots of potential followup questions here: can you usefully distill a trained Transformer into a parameter & compute-efficient RNN? Can that provide a training signal to meta-learn RNN algorithms which do fix their history/optimization problems? If Transformers work so well because of raw long-range access to history, are RNNs just missing some 'external memory' module which would serve the same purpose? Do RNNs likewise have general scaling curves over all modalities? Where do Mixture-of-Experts flatline and what is the MoE scaling exponent?)

Comment by gwern on "Scaling Laws for Autoregressive Generative Modeling", Henighan et al 2020 {OA} · 2020-10-30T00:10:00.798Z · LW · GW

My short summary so far:

GPT-3 was not a fluke nor language-specific: all modalities tested---math, video, images, text, combined---scale cleanly and in the same way where bigger models = better; the unsupervised/pretrained models then transfer to supervised learning, like image classification. GPT-3 all the things!

Comment by gwern on Reply to Jebari and Lundborg on Artificial Superintelligence · 2020-10-25T17:11:04.344Z · LW · GW

I skimmed the paper when they announced it on Twitter. It seemed like it fundamentally ignores every possibility vaguely like mesa-optimization or imitation learning, and can't deal with things like, say, GPT-3 meta-learning agency to better predict data derived from agents (ie. humans). They leave themselves an out by handwaving away all such inconveniences as 'iron ore agents', but then it's thoroughly useless and circular; "what's an iron ore agent?" "It's one which has dangerous outcomes due to hidden agency." "OK, which agents are those, how can you tell AlphaZero from GPT-3 from AGI?" "Well, try them and see!"

Comment by gwern on The bads of ads · 2020-10-23T17:12:52.454Z · LW · GW

Market failure

One particularly important market failure in practice might be the lack of awareness by publishers about the loss from advertising; anyone can calculate the immediate revenue from carrying advertising, but the loss in readership is far harder to measure, creating a strong bias toward carrying too much advertising. It's rarely measured, but does seem to be large enough that for many publishers, it would substantially decrease the estimated profit, or even turn it into a loss. (I am pretty sure the latter was the case for me.) Across an entire society, that might translate to quite a lot of deadweight losses due to excess ads.

Comment by gwern on A prior for technological discontinuities · 2020-10-18T16:07:38.525Z · LW · GW

Looking at these examples of "continuities" makes me think that the idea of continuity vs discontinuity is almost completely useless for AI risk, or tech/economics risk in general. The more people dig into developments and come out with some way to show that "actually, it was 'continuous' if you dig through enough archival materials and apply sufficient levels of hindsight and look at enough variables", the less it means anything. By the same arguments, stuff like the Industrial Revolution or coronavirus is 'continuous', because after all, some metric like number of cases follows a smooth continuous line on a graph; but does it feel like it was 'continuous' in the ill-defined sense that people are trying to use 'continuous' to stand for in AI risk discussions? It's hard to see how the fact that some tiny niche of researchers watched a number ramp up over a few weeks or years, without knowing the importance or long-term outcomes, thereby rendering it 'continuous' on some graph years later by third parties, is really all that useful or helpful.

A better question might be, were the effects discontinuous from the perspective of other large groups and particularly their decision-making functions (OODA loops). Like, say, was progress in large ships 'continuous' from the perspective of Tokugawa Japan or the Aztecs?

Comment by gwern on Industrial literacy · 2020-10-14T18:04:30.699Z · LW · GW

By coincidence, my hot water heat recently broke. I expected the cold showers to be the worst part, but it was actually the difficulty cleaning dishes: grease, oil, and fat just wouldn't come off the dishes (or the scrubby doodle, for that matter), despite ample application of soap. Since most of my meals involve those things, I eventually resorted to cleaning what could be cleaned with running cold tap water & soap, and setting an electric kettle to boil to do a second pass to try to melt off the remnants.

It did take longer.

Comment by gwern on Philosophy of Therapy · 2020-10-13T00:22:47.766Z · LW · GW

What I was expecting from the first paragraph was a discussion of whether therapy works. I think people should know that when it's been studied, there's little evidence that talk therapy works better than getting support from a friend, family member, or other trusted person.

Better known as the 'dodo bird verdict'...

Comment by gwern on If GPT-6 is human-level AGI but costs $200 per page of output, what would happen? · 2020-10-09T19:44:43.402Z · LW · GW

(As an aside, in the linked footnote I couldn't easily spot any paper that actually evaluated humans on predicting the next word.)

Third paragraph:

GPT-2 was benchmarked at 43 perplexity on the 1 Billion Word (1BW) benchmark vs a (highly extrapolated) human perplexity of 12

The LAMBADA dataset was also constructed using humans to predict the missing words, but GPT-3 falls far short of perfection there, so while I can't numerically answer it (unless you trust OA's reasoning there), it is still very clear that GPT-3 does not match or surpass humans at text prediction.

Comment by gwern on If GPT-6 is human-level AGI but costs $200 per page of output, what would happen? · 2020-10-09T16:35:38.990Z · LW · GW

I wouldn't use this metric. I don't see how to map between it and anything we care about.

Nevertheless, it works. That's how self-supervised training/pretraining works.

If it's defined in terms of accuracy when predicting the next word, I won't be surprised if existing language models already outperforms humans.

They don't. GPT-3 is still, as far as I can tell, about twice as bad in an absolute sense as humans in text prediction:

Comment by gwern on [AN #120]: Tracing the intellectual roots of AI and AI alignment · 2020-10-07T19:32:51.420Z · LW · GW

The authors then develop their own method, Maia. They talk about it as a “modification of the AlphaZero architecture”, but as far as I can tell it is simply behavior cloning using the neural net architecture used by Leela. As you might expect, this does significantly better, and finally satisfies the property we would intuitively want: the best predictive model for a human of some skill level is the one that was trained on the data from humans at that skill level.

Yeah, I think that's all they mean: the CNN and input/output are the same as Leela the same as AlphaZero. But it does differ from behavioral cloning in that they stratify the samples - typically, behavior cloning dumps in all available expert samples (perhaps with a minimum cutoff rating, which is how AlphaGo filtered its KGS pretraining) and trains on them all equally.

Personally, I would've trained a single conditional model with a specified player-Elo for each move, instead of arbitrarily bucketing into 9 levels of Elo ranges, but perhaps they have so many games that each bucket is enough (12m each as they emphasize) and they preferred to keep it simple and spend data/compute instead of making the training & runtime more complicated.

Comment by gwern on Is there any work on incorporating aleatoric uncertainty and/or inherent randomness into AIXI? · 2020-10-04T18:09:19.648Z · LW · GW

Couldn't you just treat any 'stochastic' environment as hidden-variable theories - actually being a deterministic program with a PRNG appended whose seed you don't know?

Comment by gwern on Distributed public goods provision · 2020-09-27T01:04:04.464Z · LW · GW

Is there any connection to quadratic funding?

Comment by gwern on Environments as a bottleneck in AGI development · 2020-09-23T18:48:37.345Z · LW · GW

"Blessings of scale" observations aside, it seems like right now, environments are not the bottleneck to DL/DRL work. No one failed to solve Go because gosh darn it, they just lacked a good Go simulator which correctly implemented the rules of the game; the limits to solving ALE-57 (like Montezuma's Revenge) in general or as a single multi-task agent do not seem to be lack of Atari games where what we really need is ALE-526*; Procgen performance is not weak because of insufficient variation in levels; OpenAI Universe failed not for lack of tasks, to say the least; the challenge in creating or replicating GPT-3 is not in scraping the text (and GPT-3 didn't even run 1 epoch!). Datasets/environments sometimes unlock new performance, like ImageNet, but even when one saturates, there's typically more datasets which are not yet solved and cannot be solved simultaneously (JFT-300M, for example), and in the case of RL of course compute=data. If you went to any DRL researcher, I don't think many of them would name "we've solved all the existing environments to superhuman level and have unemployed ourselves!" as their biggest bottleneck.

Is it really the case that at some point we will be drowning in so many GPUs and petaflops that our main problem will become coming up with ever more difficult tasks to give them something useful to train on? Or is this specifically a claim about friendly AGI, where we lack any kind of environment which would seem to force alignment for maximum score?

* Apparently the existing ALE suite was chosen pretty haphazardly:

Our testing set was constructed by choosing semi-randomly from the 381 games listed on Wikipedia at the time of writing. Of these games, 123 games have their own Wikipedia page, have a single player mode, are not adult-themed or prototypes, and can be emulated in ALE. From this list, 50 games were chosen at random to form the test set.

I wonder how the history of DRL would've changed if they had happened to select from the other 73, or if Pitfall & Montezuma's Revenge had been omitted? I don't however, think it would've been a good use of their time in 2013 to work on adding more ALE games rather than, say, debugging GPU libraries to make it easier to run NNs at all...

Comment by gwern on The Haters Gonna Hate Fallacy · 2020-09-22T18:16:45.382Z · LW · GW

Communication is hard and – importantly – contextual. Most of your readers will be reasonable people

You think this partially because you are not famous or a popular writer.

By the 1% rule of Internet participation, you hear mostly from an extremely self-selected group of critics. You don't hear from the reasonable people, you hear from the unreasonable people. The more popular you get, the more this is true. And there is a lizardman constant going on: there is a fringe of crazy, stubborn readers who will fail to read the most plain and straightforward writing, misinterpret it in the wackiest way, hate you more the better you write, and amplify the craziest things they can find. (At my level of relative obscurity, it's petty stuff: sneers, doxing, death/swatting threats, ML researchers trying to get me fired, FBI visits, that sort of thing. Scott seems to have similar issues, just more so. But by the time you reach Tim Ferriss numbers of readers, this will have escalated to 'attempted kidnappings by organized crime' levels of risk, and he notes that it escalates still further to attempted murder of popular YouTubers etc.)

Combine this with the asymmetry of loss and reward, where criticism hurts a lot more than praise helps, and the more popular you get, the worse you will feel about everything you write or do, regardless of quality.

...Unless you constantly keep in mind: "haters gonna hate". If a criticism doesn't immediately make sense to you or you felt you dealt with it adequately, and it comes from someone you don't already know or trust, then oh well - haters gonna hate. If you're genuinely unsure, run a poll or A/B test or something to hear from a less self-selected sample - but do anything other than naively listening to and believing your critics! That's a luxury permitted only the most obscure or heavily filter-bubbled.

Comment by gwern on Has anyone written stories happening in Hanson's em world? · 2020-09-21T21:55:10.608Z · LW · GW

What about your "I am a contract-drafting em" poem?

Comment by gwern on Where is human level on text prediction? (GPTs task) · 2020-09-21T15:17:22.252Z · LW · GW

It's in the figure.

Comment by gwern on Where is human level on text prediction? (GPTs task) · 2020-09-21T03:08:01.301Z · LW · GW

Might as well finish out this forecasting exercise...

If we assume compute follows the current trend of peak AI project compute doubling every 3.4 months, then 2.2e6× more compute would be log2(2.2e6) = 22 doublings away - or 22*(3.4/12) = 6.3 years, or 2027. (Seems a little unlikely.)

Going the other direction, Hernandez & Brown 2020's estimate is that, net of hardware & algorithmic progress, the cost of a fixed level of performance halves every 16 months; so if GPT-3 cost ~$5m in early 2020, then it'll cost $2.5m around mid-2021, and so on. Similarly, a GPT-human requiring 2.2e6× more compute would presumably cost on the order of $10 trillion in 2020, but after 14 halvings (18 years) would cost $1b in 2038.

Metaculus currently seems to be roughly in between 2027 and 2038 right now, incidentally.

Comment by gwern on Why GPT wants to mesa-optimize & how we might change this · 2020-09-21T00:57:28.696Z · LW · GW

It still is, it's just that beam search (or other search strategies) seem to be mostly useful for closed-end short text generation; translating a sentence apparently is a task with enough of a right-or-wrong-ness to it that beam search apparently taps into no pathologies. But they get exposed for open-ended longform generation.

Comment by gwern on Where is human level on text prediction? (GPTs task) · 2020-09-20T19:35:00.580Z · LW · GW

It's probably a lower bound. These datasets tend to be fairly narrow by design. I'd guess it's more than 2x across all domains globally. And cutting the absolute loss by 50% will be quite difficult. Even increasing the compute by 1000x only gets you about half that under the best-case scenario... Let's see, to continue my WebText crossentropy example, 1000x reduces the loss by about a third, so if you want to halve it (we'll assume that's about the distance to human performance on WebText) from 1.73 to 0.86, you'd need (2.57 * (3.64 * (10^3 * x))^(-0.048)) = 0.86 where x = 2.2e6 or 2,200,000x the compute of GPT-3. Getting 2.2 million times more compute than GPT-3 is quite an ask over the next decade or two.

Comment by gwern on Where is human level on text prediction? (GPTs task) · 2020-09-20T18:22:25.349Z · LW · GW

Looking more into reported perplexities, the only benchmark which seems to allow direct comparison of human vs GPT-2 vs GPT-3 is LAMBADA.

LAMBADA was benchmarked at a GPT-2 perplexity of 8.6, and a GPT-3 perplexity of 3.0 (zero-shot) & 1.92 (few-shot). OA claims in their GPT-2 blog post (but not the paper) that human perplexity is 1-2, but provides no sources and I couldn't find any. (The authors might be guessing based on how LAMBADA was constructed: examples were filtered by whether two independent human raters provided the same right answer.) Since LAMBADA is a fairly restricted dialogue dataset, although constructed to be difficult, I'd suggest that humans are much closer to 1 than 2 on it.

So overall, it looks like the best guess is that GPT-3 continues to have somewhere around twice the absolute error of a human.

Comment by gwern on Mati_Roy's Shortform · 2020-09-20T17:35:14.399Z · LW · GW

and error and hyperparameter tuning that would probably increase the cost several-fold.

All of which was done on much smaller models and GPT-3 just scaled up existing settings/equations - they did their homework. That was the whole point of the scaling papers, to tell you how to train the largest cost-effective model without having to brute force it! I think OA may well have done a single run and people are substantially inflating the cost because they aren't paying any attention to the background research or how the GPT-3 paper pointedly omits any discussion of hyperparameter tuning and implies only one run (eg the dataset contamination issue).

Comment by gwern on Where is human level on text prediction? (GPTs task) · 2020-09-20T17:22:34.430Z · LW · GW

To simplify Daniel's point: the pretraining paradigm claims that language draws heavily on important domains like logic, causal reasoning, world knowledge, etc; to reach human absolute performance (as measured in prediction: perplexity/cross-entropy/bpc), a language model must learn all of those domains roughly as well as humans do; GPT-3 obviously has not learned those important domains to a human level; therefore, if GPT-3 had the same absolute performance as humans but not the same important domains, the pretraining paradigm must be false because we've created a language model which succeeds at one but not the other. There may be a way to do pretraining right, but one turns out to not necessarily follow from the other and so you can't just optimize for absolute performance and expect the rest of it to fall into place.

(It would have turned out that language models can model easier or inessential parts of human corpuses enough to make up for skipping the important domains; maybe if you memorize enough quotes or tropes or sayings, for example, you can predict really well while still failing completely at commonsense reasoning, and this would hold true no matter how much more data was added to the pile.)

As it happens, GPT-3 has not reached the same absolute performance because we're just comparing apples & oranges. I was only talking about WebText in my comment there, but Omohundro is talking about Penn Tree Bank & 1BW. As far as I can tell, GPT-3 is still substantially short of human performance.

Comment by gwern on Where is human level on text prediction? (GPTs task) · 2020-09-20T17:06:19.567Z · LW · GW

I think Omohundro is wrong here. His GPT-3 perplexity of 20.5 must be for Penn Tree Bank. However, his 'humans' perplexity of 12 is for a completely different dataset! Tracing his citations from his video to Shen et al 2017, which uses 1 Billion Word Benchmark. 1BW was not reported in the GPT-3 paper because it was one of the datasets affected by contamination and dropped from evaluation.

I've never read the Penn Tree Bank or 1BW so I can't compare. At best, I'd guess that if 1BW is collected from "English newspapers", that's less diverse than the Brown Corpus which goes beyond newspapers, and so perplexities will be lower on 1BW than PTB. However, some searching turned up no estimates for human performance on either PTB or WebText, so I can't guess what the real human vs GPT-3 comparison might be. I'm also a little puzzled what the 'de-tokenizers' are that the Radford GPT paper mentions are necessary for doing the perplexity calculations at all...

(There are a lot of papers estimating English text entropy in terms of bits per character, but because of the BPEs and other differences, I don't know how to turn that into a perplexity which could be compared to the reported GPT-3 performance on Penn Tree Bank/WebText/LAMBADA/etc, which is why I didn't include a human baseline in my comment there - I just don't know.)

So, am I right in thinking that if someone took random internet text and fed it to me word by word and asked me to predict the next word, I'd do about as well as GPT-2 and significantly worse than GPT-3?


Comment by gwern on Why GPT wants to mesa-optimize & how we might change this · 2020-09-20T01:44:04.861Z · LW · GW

Why is beam search missing? One possibility is that GPT-3 already does internal lookahead. OpenAI tried beam search, found it didn't improve text generation, and didn't bother adding it as an option. In other words, GPT-3 is already mesa-optimizing 😲

Beam search has never worked for likelihood-trained NNs, since at least char-RNNs back in 2015. Beam search does trigger repetition and other pathologies in GPT, see "The Curious Case of Neural Text Degeneration", Holtzman et al 2019. And while unlikelihood training seems to help, it's not a silver bullet, and is a bit ad hoc (especially if you think of it in terms of reinforcement learning).

Comment by gwern on Superintelligence via whole brain emulation · 2020-09-20T00:25:23.327Z · LW · GW

Thus, it is probably important to be careful about not accelerating non-WBE neuromorphic AI while attempting to accelerate whole brain emulation. For instance, it seems plausible to me that getting better models of neurons would be useful for creating neuromorphic AIs while better brain scanning would not, and both technologies are necessary for brain uploading, so if that is true, it may make sense to work on improving brain scanning but not on improving neural models.

But what research improves brain imaging but not DL... One thing to point out about whole brain emulation vs 'de novo' AI is that it may be, in practice, nearly impossible to get WBEs without having already, much earlier, kickstarted 'de novo' AI.

If you can scan and run successfully a single whole brain, you got there by extensive brain imaging and brain scanning of much smaller chunks of many brains, and it seems like there is a lot of very transferable knowledge from the structure and activities of a human brain to artificial neural networks, which I dub "brain imitation learning". Not only do ANNs turn out to have fairly similar activation patterns as human brains in some respects (primarily visual cortex stuff), the human brain's activation patterns encode a lot of knowledge about how visual representations work which can be used to learn & generalize. (A particularly interesting example from this month is "Self-Supervised Natural Image Reconstruction and Rich Semantic Classification from Brain Activity", Gaziv et al 2020.) You might consider this a version of the pretraining paradigm or lexical hypothesis - the algorithms of general intelligence, and world knowledge, are encoded in the connectivity and activation patterns of a human brain and so training on large corpuses of such data to imitate the connectivity & activation patterns will provide an extremely powerful prior/initialization à la GPT-3 pretraining on large text datasets.

So, it is entirely possible that by the time you get to BCIs or whole-brain scanning apparatuses, these are providing high-volume data embeddings or structural/architectural constraints which help push deep learning approaches over the finish line to AGI by providing informative priors & meta-learning capabilities by conditioning on <100% data from many brains. (In fact, if you believe this won't happen, you have to explain what on earth is being done with all this extremely expensive data for decades on end, as it slowly ramps up from scanning insect-sized chunks to full monkey brains before finally an entire human brain is scanned 100% & they flip the giant red switch to make Mr John Smith, test subject #1918, wake up inside a computer. What is everyone doing before that?)

Whatever these DL systems may be, they won't be a single specific person, and they won't come with whatever safety guarantees people think an upload of Mr John Smith would come with, but they will come years or decades before.

Comment by gwern on ricraz's Shortform · 2020-09-18T00:33:50.179Z · LW · GW

Yes, there's something to that, but you have to be careful if you want to use that as an objection. Maybe you wouldn't easily think of it, but that doesn't exclude the possibility of you doing it: you can come up with algorithms you can execute which would spit out Egan-like ideas, like 'emulate Egan's brain neuron by neuron'. (If nothing else, there's always the ol' dovetail-every-possible-Turing-machine hammer.) Most of these run into computational complexity problems, but that's the escape hatch Egan (and Scott Aaronson has made a similar argument) leaves himself by caveats like 'given enough patience, and a very large notebook'. Said patience might require billions of years, and the notebook might be the size of the Milky Way galaxy, but those are all finite numbers, so technically Egan is correct as far as that goes.

Comment by gwern on ricraz's Shortform · 2020-09-17T21:56:05.791Z · LW · GW

Equivocation. "Who's 'we', flesh man?" Even granting the necessary millions or billions of years for a human to sit down and emulate a superintelligence step by step, it is still not the human who understands, but the Chinese room.

Comment by gwern on on “learning to summarize” · 2020-09-13T22:03:35.593Z · LW · GW

The latter. I didn't notice it was a link to a different paper, but I think my point stands: the better results in this paper compared to the previous finetuning paper can't be due to adding the KL constraint because they already had one. It has to be something else they changed, like more/better labels or bigger models.

Comment by gwern on on “learning to summarize” · 2020-09-13T17:41:02.620Z · LW · GW

The original paper & codebase definitely had KL penalties on the PPO policy. I spent a fair bit of time fiddling with it and letting it go high to see what adversarial ABC music examples it found in the hopes that it would train the reward model better when I labeled them. Didn't seem to work, it would just find similar and only slightly different examples.

Comment by gwern on on “learning to summarize” · 2020-09-12T18:01:05.221Z · LW · GW

No, not yet. (IMO, the power of differentiability is greatly underused. Everyone is locked into a 'optimize parameters based on data & loss' mindset, and few ever use the alternatives like 'optimize data/trajectory based on parameters & loss' or 'optimize loss based on data/parameters.)

Comment by gwern on on “learning to summarize” · 2020-09-12T15:01:38.306Z · LW · GW

FWIW, Gwern reports trying OpenAI’s approach and finding the RL side specifically frustrating and unstable; this is pretty normal with RL, and compatible with the reward-model part being very successful in its own domain. It’s not clear whether OpenAI got the RL part to work well because they did something right, or because they have lots of resources and can keep trying over and over until it works.

At the time, I figured that it was probably a sample-efficiency problem: the reward model just wasn't picking up on the subtle esthetics I wanted it to. I see this as supported by their new results: large models are more sample-efficient, so unsurprisingly, it works a lot better - the reward model can finally manage to understand what the preferences are, so it can provide a real signal to the RL training.

They seem to think it has more to do with label quality / better raters, which I didn't think was my problem (who better than me to rate my preferred ABC samples?), but better label quality is sort of like better sample-efficiency; I haven't read the paper in enough detail to see if they ablated model size vs label n vs label quality to get an idea of where the improvement is coming from.

Again, wouldn’t it be nice if we could avoid the need for this thing and just train on the preferences directly

Accept no substitutes! Gradient ascent directly on the differentiable reward/environment model!

Some new links on that topic:

Comment by gwern on Building brain-inspired AGI is infinitely easier than understanding the brain · 2020-09-10T17:18:02.178Z · LW · GW

There are also some links in

Comment by gwern on [AN #116]: How to make explanations of neurons compositional · 2020-09-09T20:49:20.426Z · LW · GW

The composition paper seems to exemplify what I talk about as my intuition for how NNs work. The models are both very small and trained on little data, but image classification seems to be much easier than NLP (which is why the DL revolution came to image classification many years before NLP), so it's enough to train the CNN to have fairly meaningful disentangled representations of the kind we expect; their RNN model, however, continues to grope through relatively superficial associations and tricks, as the text database is relatively tiny. I'd predict that if they analyze much larger networks, like BiT or GPT-3, they'd find much more composition, and much less reliance on polysemanticity, and less vulnerability to easy 'copy-paste' adversarial examples.

Comment by gwern on August 2020 newsletter · 2020-09-08T00:59:51.696Z · LW · GW

Fortunately, it's not as if the IEEE have any special immunity to Sci-Hub.

Comment by gwern on August 2020 newsletter · 2020-09-07T21:48:13.814Z · LW · GW

A lot of the links are spillover from continued editing of my GPT-3 pages, which is probably why.

And yes, it is intentional. The IEEE is remarkably humorless about people hosting PDFs.

Comment by gwern on [AN #115]: AI safety research problems in the AI-GA framework · 2020-09-02T17:30:06.262Z · LW · GW

Is this page completely unreadable for anyone else?

Comment by gwern on interpreting GPT: the logit lens · 2020-09-02T00:05:28.867Z · LW · GW

Related layer visualizations: "Looking for Grammar in All The Right Places".

Comment by gwern on interpreting GPT: the logit lens · 2020-09-02T00:02:46.452Z · LW · GW

Doing it with GPT-3 would be quite challenging just for compute requirements like RAM. You'd want to test this out on GPT-2-117M first, definitely. If the approach works at all, it should work well for the smallest models too.