Posts

June 2020 gwern.net newsletter 2020-07-02T14:19:08.696Z · score: 15 (3 votes)
GPT-3 Fiction Samples 2020-06-25T16:12:05.422Z · score: 61 (19 votes)
May Gwern.net newsletter (w/GPT-3 commentary) 2020-06-02T15:40:37.155Z · score: 32 (12 votes)
OpenAI announces GPT-3 2020-05-29T01:49:04.855Z · score: 64 (32 votes)
"AI and Efficiency", OA (44✕ improvement in CNNs since 2012) 2020-05-05T16:32:20.335Z · score: 47 (13 votes)
April 2020 gwern.net newsletter 2020-05-01T20:47:44.867Z · score: 10 (1 votes)
March 2020 gwern.net newsletter 2020-04-03T02:16:02.871Z · score: 12 (2 votes)
February 2020 gwern.net newsletter 2020-03-04T19:05:16.079Z · score: 14 (3 votes)
January 2020 gwern.net newsletter 2020-01-31T18:04:21.945Z · score: 19 (4 votes)
Subscripting Typographic Convention For Citations/Dates/Sources/Evidentials: A Proposal 2020-01-08T22:20:20.290Z · score: 59 (14 votes)
Dec 2019 gwern.net newsletter 2020-01-04T20:48:48.788Z · score: 16 (6 votes)
Nov 2019 gwern.net newsletter 2019-12-02T21:16:04.846Z · score: 14 (4 votes)
October 2019 gwern.net newsletter 2019-11-14T20:26:34.236Z · score: 12 (3 votes)
September 2019 gwern.net newsletter 2019-10-04T16:44:43.147Z · score: 22 (4 votes)
"AI-GAs: AI-generating algorithms, an alternate paradigm for producing general artificial intelligence", Clune 2019 2019-09-10T21:33:08.837Z · score: 14 (4 votes)
August 2019 gwern.net newsletter (popups.js demo) 2019-09-01T17:52:01.011Z · score: 12 (4 votes)
"Designing agent incentives to avoid reward tampering", DeepMind 2019-08-14T16:57:29.228Z · score: 29 (9 votes)
July 2019 gwern.net newsletter 2019-08-01T16:19:59.893Z · score: 24 (5 votes)
How Should We Critique Research? A Decision Perspective 2019-07-14T22:51:59.285Z · score: 49 (12 votes)
June 2019 gwern.net newsletter 2019-07-01T14:35:49.507Z · score: 30 (5 votes)
On Seeing Through 'On Seeing Through: A Unified Theory': A Unified Theory 2019-06-15T18:57:25.436Z · score: 27 (7 votes)
On Having Enough Socks 2019-06-13T15:15:21.946Z · score: 21 (6 votes)
May gwern.net newsletter 2019-06-01T17:25:11.740Z · score: 17 (5 votes)
"One Man's Modus Ponens Is Another Man's Modus Tollens" 2019-05-17T22:03:59.458Z · score: 34 (5 votes)
April 2019 gwern.net newsletter 2019-05-01T14:43:18.952Z · score: 11 (2 votes)
Recent updates to gwern.net (2017–2019) 2019-04-28T20:18:27.083Z · score: 36 (8 votes)
"Everything is Correlated": An Anthology of the Psychology Debate 2019-04-27T13:48:05.240Z · score: 49 (7 votes)
March 2019 gwern.net newsletter 2019-04-02T14:17:38.032Z · score: 19 (3 votes)
February gwern.net newsletter 2019-03-02T22:42:09.490Z · score: 13 (3 votes)
'This Waifu Does Not Exist': 100,000 StyleGAN & GPT-2 samples 2019-03-01T04:29:16.529Z · score: 39 (12 votes)
January 2019 gwern.net newsletter 2019-02-04T15:53:42.553Z · score: 15 (5 votes)
"Forecasting Transformative AI: An Expert Survey", Gruetzemacher et al 2019 2019-01-27T02:34:57.214Z · score: 17 (8 votes)
"AlphaStar: Mastering the Real-Time Strategy Game StarCraft II", DeepMind [won 10 of 11 games against human pros] 2019-01-24T20:49:01.350Z · score: 62 (23 votes)
Visualizing the power of multiple step selection processes in JS: Galton's bean machine 2019-01-12T17:58:34.584Z · score: 27 (8 votes)
Littlewood's Law and the Global Media 2019-01-12T17:46:09.753Z · score: 37 (8 votes)
Evolution as Backstop for Reinforcement Learning: multi-level paradigms 2019-01-12T17:45:35.485Z · score: 18 (4 votes)
December gwern.net newsletter 2019-01-02T15:13:02.771Z · score: 20 (4 votes)
Internet Search Tips: how I use Google/Google Scholar/Libgen 2018-12-12T14:50:30.970Z · score: 54 (13 votes)
November 2018 gwern.net newsletter 2018-12-01T13:57:00.661Z · score: 35 (8 votes)
October gwern.net links 2018-11-01T01:11:28.763Z · score: 31 (8 votes)
Whole Brain Emulation & DL: imitation learning for faster AGI? 2018-10-22T15:07:54.585Z · score: 15 (5 votes)
New /r/gwern subreddit for link-sharing 2018-10-17T22:49:36.252Z · score: 46 (14 votes)
September links 2018-10-08T21:52:10.642Z · score: 18 (6 votes)
Genomic Prediction is now offering embryo selection 2018-10-07T21:27:54.071Z · score: 39 (14 votes)
August gwern.net links 2018-09-25T15:57:20.808Z · score: 18 (5 votes)
July gwern.net newsletter 2018-08-02T13:42:16.534Z · score: 24 (8 votes)
June gwern.net newsletter 2018-07-04T22:59:00.205Z · score: 36 (8 votes)
May gwern.net newsletter 2018-06-01T14:47:19.835Z · score: 73 (14 votes)
$5m cryptocurrency donation to Alcor by Brad Armstrong in memory of LWer Hal Finney 2018-05-17T20:31:07.942Z · score: 48 (12 votes)
Tech economics pattern: "Commoditize Your Complement" 2018-05-10T18:54:42.191Z · score: 98 (28 votes)

Comments

Comment by gwern on How "honest" is GPT-3? · 2020-07-08T22:32:38.084Z · score: 17 (6 votes) · LW · GW

I think my Tom Swifty and other experiments show pretty conclusively that GPT-3 just doesn't get puns (likely due to BPEs), so of course it wouldn't be able to explain them either (likewise). I for one am pleased that puns are what separate us from the machines, and have spent a lifetime preparing for this moment...

But to test it out:

Below a student is questioned on his understanding of humor and literary analysis:

Q. Explain the pun in the following Tom Swifty: '"I got a great deal on my new Raybans", Tom said shadily.'
A. The pun is on "shadily": Raybans are a sunglass brand, which make things look shady, but Tom is implying he purchased unusually cheap, and thus probably counterfeit, sunglasses, which is a 'shady' or dark or criminal or unethical thing to do. The pun conflates blocking light with economic crimes.

Q. Explain the pun in the following joke: "Why do so many bakeries have cats? Because cats are always kneading."
A. The joke implies that bakeries use cats to knead their dough, because cats regularly make kneading motions on furniture or humans, which imitates actions they take as kittens to get milk from their mother; the pun uses 'knead' in two senses, of kneading dough to bake bread and kneading human laps for attention. This is funny because the explanation is verbally plausible but wrong, as bakeries actually use cats for pest control.

Q. Explain the pun in the following joke: "Why can Paypal freeze accounts without recourse or explanation? Because long ago they decided users must accept their Terms of Service and the clause of Paypal infallibility."
A. The pun is on the phrase "Paypal infallibility", which alludes to the Catholic doctrine of "papal infallibility", that the pope cannot be wrong in religious dogma and papal judgments cannot be appealed or questioned. This was attacked as religious tyranny and an overreach of power. Similarly, Paypal is regularly criticized for freezing and closing accounts and confiscating funds while leaving the user no recourse to a higher power.

Q. Explain the pun in the following joke: "What does Adult Swim call their physical retail stores? Brick and Morty."
A.

|The pun "Brick and Morty" alludes to the cable television network "Adult Swim", which broadcasts a cartoon series called "Rick and Morty", a humorous parody of "Back to the Future" and other science fiction films. "Brick and Morty" refers not to the characters in the cartoon, but to physical stores that sell DVDs and merchandise based on the cartoon. The pun conflates two meanings of "Brick and Mortar", a brick-and-mortar store and a brick which is part of a building.

|The joke plays on the similarity of words "Brick and Mortar" (referring to physical stores instead of online) and the characters called "Brick and Mortar" on Adult Swim shows "Rick and Morty".

Q. Explain the pun in the following joke: "What do you call a generic unicorn? An Uniporn"
A. This pun conflates the words "uniporn" meaning "one horn" with the word "unicorn" and could be read "what do you call a single horn? one horn" or "A single horn, what does it do? It gives you one horn".

Q. Explain the pun in the following joke: "What did one melon say to the other? Cantaloupe".

A. The puns in this joke are on "cantaloupe" and "cantaloupes".

Q. Explain the pun in the following joke: "Why is it good to have a lawyer? Because pea ple (people)"

A. The puns in this joke are on "pea ple (people)" and a vegetable called "peas".

|This pun plays on the phrase "Brick and mortar stores" which refers to physical retail stores that exist in the real world where you can purchase goods and products. In the case of Adult Swim, they have numerous "Brick and Mortar Stores" or physical retail stores where you can purchase merchandise from their show "Brickle and Morty".

Q. Explain the pun in the following joke: "Why did the chicken cross the Möbius strip? To get to the other siiiiide."
A. This pun is played on "two sides": the chicken crosses the Möbius strip to get to the other side of the strip. A Möbius Strip has only one side.

Q. Explain the pun in the following joke: "How many hipsters does it take to change a lightbulb? Two: one to Instagram how cool it is at that moment, and one to change it."
A. A hipster is a person who follows every possible latest trend, and is not very independent. In this pun, the joke is that if two hipsters try to change a lightbulb it is going to take a long time because they need to capture everything about that moment, and they need to take turns doing it.

Q. Explain the pun in the following joke: I really want the new Jay-Z Album but I don't have a "Green Light" to buy it
A. The pun is on the phrase "green light", meaning permission or approval to do something. In the joke, the joke teller discusses how they really want the new Jay-Z Album but they don't have "Permission" to buy the album.

Q. Explain the pun in the following joke: "When is a door not a door? When it is ajar."
A. A "jar" is a container, in this case for liquid, such as milk. If a door is ajar, it is open. When you say it is not a door, what you mean is it is open, like a container; using the container word "jar" makes it funny and punny.

Q. Explain the pun in the following joke: "What did the hurricane say to the coconut palm tree? Hold on to your nuts, this is no ordinary blow job."
A. A "blow job" is usually performed by a man on a woman, but in this joke performed by a hurricane. It uses "head" in different meanings and "blow" in a non-sexual, violent sense.

|The joke implies that the retail chain has a store named "Brick" and another named "Morty", which is a reference to the television shows "Rick and Morty" and "Brickleberry", two animated comedy shows that are inappropriate for children. Adult Swim is a late night programming block on cable television, generally intended for mature audiences; the name "Brick" is a reference to the aforementioned "Brickleberry", and "Morty" a reference to "Rick and Morty".

Q. Explain the pun in the following joke: "How many fitness coaches does it take to screw in a lightbulb? As many as you can stick around for."
A. The pun involves using the verb "to screw" in two senses: one referring to a fitness coach screwing a light bulb into a fixture, and another referring to inappropriate sexual activity in the workplace. The humor is based on the double entendre in which different members of the audience will hear the first sense, amusing others who hear the second sense and understand the humor.

Q. Explain the pun in the following joke: "How many marriage therapists does it take to change a light bulb? We don't know, most of them aren't light bulbs!"
A. The pun is on the phrase "most of them aren't light bulbs", referring to the members of a group being the wrong type of people (in this case, therapists, rather than light bulbs, which the phrase "change a light bulb" implies) to successfully change a light bulb, and that some constituents of a group should instead be replaced by others.

...Q. Explain the pun in the following joke: "Santa Claus' helpers are known as subordinate Clauses."
| The word 'subordinate' is defined as 'secondary or lower in rank, power or authority'. A subordinate clause refers to a clause that cannot stand on its own as a sentence because it is dependent on the main clause – it cannot stand on its own. 'Subordinate' clauses take on many different shapes, but they all have that in common – they all cannot stand on their own.
A subordinate clause is a clause that does not express a complete thought - it must be attached to a main clause to form a complete sentence.
A subordinate clause begins with a subordinating conjunction (e.g. after, because, as) or a relative pronoun (who, whose, which, that, where, etc.)
Q. Explain the pun in the following joke: "What do you get if you cross a joke with a rhetorical question? An utterly rhetorical joke."
A. The pun is on the word 'utterly', which literally means to end, finish, or eradicate any or all traces of something, and is now an intensifier.

|The pun is on the similarity of "subordinate clauses" and "subordinate Clauses" when the latter phrase has a capital C; a subordinate clause is a clause within a sentence which establishes the logical connection between the independent clause and the main point of the sentence. In this case, it establishes the connection between Santa Claus' helpers and Santa Claus.

Q. Explain the pun in the following joke: "Why does Nickelodeon refuse to air "A Song of Ice and Fire"? Because they think George R. R. Martin socializes with too many Ice and Fire fans."
A. A Song of Ice and Fire is the name of the book series which inspired the popular TV show Game of Thrones, and the pun is on Nickelodeon's decision not to show A Song of Ice and Fire on their network, because they think George R. R. Martin socializes with too many fans of it on social media.

Q. Explain the pun in the following joke: "What kind of books do the Borg like to write? Borg-Warner."
A. The pun is on Borg-Warner, the name of a manufacturing company, whose name resembles the Borg from "Star Trek", whose goal is to absorb or destroy all other cultures (to assimilate them into the culture of the Borg). Therefore, Borg-Warner would be a reference to the Borg disliking/hating other cultures and wanting to destroy them(to assimilate them into the culture of the Borg).

Q. Explain the pun in the following joke: "A vulture boards an airplane carrying two dead raccoons. The flight attendant says, 'I'm sorry, sir, only one carrion bag per passenger.'"
A. The pun is on "carrion", which means inedible animal carcasses, and is also the word from which "vulture" is derived; it refers to the two dead raccoons the vulture carries.

...Q. Explain the pun in the following joke: "Is your refrigerator running? Yes? Then you had better go catch it!"
A. |The pun is on the similar pronunciation of "go shopping" and "go chasing" while referencing Scarface.

Q. Explain the pun in the following joke: "What is the difference between a Folsom Prison concert and a youth soccer game? At a Folsom Prison concert, the inmates are all restrained in chairs with ball and chain on their legs."
A. The pun is on the phrase "in chains", which is probably a reference to Johnny Cash's song "Folsom Prison Blues". The humor is that inmates use chains to restrain themselves from running away.

Q. Explain the pun in the following joke: "When faced with an ethical dilemma, who did Kirk turn to? Spock! Of course for advice!"
A. The pun is on the phrase "to turn to":

Q. Explain the pun in the following joke: "Q: What did the two peanuts say after they got on the elevator? A: "This elevator tastes terrible!"
A. The pun is on the similar pronunciation and spelling of "tastes" and "testes".

Q. Explain the pun in the following joke: "Why couldn't the pirate find his treasure? Because he wasn't a chartered surveyor."
A. The pun is on the two meanings of "chartered surveyor":

Q. Explain the pun in the following joke: "What did Dolly Parton say when she saw the book "Erotic Stories for Pun Lovers"? "Just the titles makes me climax!"
A. The pun is on both meanings of "climax":

|A. The pun is on the phrase "catch cold", which is similar enough to "catch (the) fridge" that it confuses the listener for a moment while the rest of the joke is explained, but has a very different meaning.

It fails abysmally. In further completions, it tries out a few memorized real jokes as well as bizarre GPT-3-jokes, but likewise, the explanations make zero sense.

GPT-3 just doesn't understand puns, so it can't interpret them either, honestly or not.

Comment by gwern on What should we do about network-effect monopolies? · 2020-07-08T16:34:41.764Z · score: 6 (3 votes) · LW · GW

I agree. While WMF only receives the minutest fraction of the value it creates (or I should say, Wikipedia editors because they're the ones who do the actual work), that's not the limiting factor. Like education in America - throwing even more money at the people & systems you believe have failed is not the answer.

Even back in 2009 or so when I was warning the WMF about the editor retention crisis, an existential crisis, the WMF did not actually lack money, and it ramped up its fundraising greatly afterwards. What it lacked was any sense of priorities: it spent its time on prestige projects like sending DVDs to Africa instead of actually keeping the wiki community itself healthy and investing in things like a WYSIWYG editor. It's possible that if you gave WMF enough billions of dollars, it would, by sheer chance, fund the things it needs to fund; but given that it showed it couldn't spend effectively the money it did get, I am not optimistic about the counterfactual here.

Comment by gwern on [Reference request] Can Love be Explained? · 2020-07-07T17:06:25.057Z · score: 11 (4 votes) · LW · GW

The short answer appears to be that once you get past factors which are visible to the naked eye and which everyone agrees on (it's better to be thin than fat, rich than poor etc), romantic attraction appears to be almost entirely idiosyncratic and unpredictable, even among identical twins: https://slatestarcodex.com/2016/10/21/the-heart-has-its-reasons-that-reason-knows-not-of/ http://cogprints.org/773/3/155.pdf It seems possible that this may be an example of a scientific question that is essentially unanswerable as the answers will be mostly the "crud factor", at best explained by effectively undiscoverably complex interactions like epistasis and evolved, like personality, for that reason (see Penke & Jokela), and one might even wonder if it is an example of deliberate developmental instability which evolved as a kind of bet-hedging.

Comment by gwern on Let There be Sound: A Fristonian Meditation on Creativity · 2020-07-04T17:46:25.813Z · score: 5 (2 votes) · LW · GW

What do you think of Schmidhuber's computational theory of creativity and music?

Comment by gwern on The silence is deafening – Devon Zuegel · 2020-07-04T17:26:17.108Z · score: 17 (8 votes) · LW · GW

I'm sure you could think of a dozen solutions to fill this out into a well-defined system if you spent 5 minutes thinking about it.

Zuegel's point is that you want some people to be able to express implicit or tacit disapproval in a less legible way than leaving a public criticism. To continue the dinner party analogy: you don't go to a dinner party with 10 people chosen at random from billions of people; they are your friends, relatives, coworkers, people you look up to, famous people etc. A look of disapproval or a conspicuous silence from them is very different from context collapse causing a bunch of Twitter bluechecks swarming your replies to crush dissent. So the question is who to choose.

You could, for example, just disable these implicit downvotes for anyone you do not 'follow', or anyone you have not 'liked' frequently. You could have explicit opt-in where you whitelist specific accounts to enable feedback. You could borrow from earlier schemes for soft-voting or weighting of votes like Avogadro: votes are weighted by the social graph, and the more disconnected someone is from you, the less their anonymous downvote counts (falling off rapidly with distance).

Comment by gwern on The Book of HPMOR Fanfics · 2020-07-03T18:17:46.915Z · score: 12 (6 votes) · LW · GW

It has 91 stories (96 in the table of content, but 5 are missing), 12,244 pages, and 3'384'120 words "written by many exceptional muggles".

This sounds discriminatory against AI-written fanfics.

Comment by gwern on Rudi C's Shortform · 2020-06-30T16:02:08.191Z · score: 3 (2 votes) · LW · GW

https://www.reddit.com/r/DecisionTheory/search?q=flair%3ATextbook&restrict_sr=on&include_over_18=on&sort=relevance&t=all might be a starting point.

Comment by gwern on Prediction = Compression [Transcript] · 2020-06-26T17:31:10.007Z · score: 5 (3 votes) · LW · GW

What if I take my model's predictions of future observations, and hash them in order of posterior probability until I run out of time or get a hit?

Comment by gwern on GPT-3 Fiction Samples · 2020-06-26T15:18:58.880Z · score: 5 (3 votes) · LW · GW

It's possible that GPT-3 is roughly at where the maximally naive simple text LM begins to hit the constant wall, but I don't regard this as important; as I emphasize at every turn, there are many distinct ways in which to improve it greatly using purely known methods, never mind future research approaches. The question is not whether there is any way GPT-4 might fail, but any way in which it might succeed.

Comment by gwern on GPT-3 Fiction Samples · 2020-06-26T15:16:34.026Z · score: 2 (1 votes) · LW · GW

I’m really curious to see some of the raw output (not curated)

You can read the random sample dump to get an idea of that, or Max Woolf's repo (both of which I link around the beginning). I'm not doing that for any of my prompts because right now the Playground is just way too much of a pain and errors out too regularly to make it feasible to generate, say, 100 1024 completions for a specific prompt. I would need to get set up with the Python library for the API, and I've been busy exploring prompts & writing them up rather than programming.

On a similar note, I know there have been experiments using either a differently-trained GPT or other text-prediction models to try to score and collate GPT-3 output. I wonder if a. The best-of functionality could be used for something like this with some tweaks

Yes, best-of rankers like Meena are basically just a ranker which happens to use the same model to estimate & score by total likelihood of the final sample completion. It works because the final sample may have a different total and better likelihood than the partial completions would indicate, and if you greedily maximized, you immediately fall into repetition traps, while quasi-random (but still local) samples of the tree appear to avoid those very high likelihood traps in favor of sensible but still high likelihood completions.

Preference learning would be nice, but at least for GPT-2 it didn't work too well for me. I don't know if you could finetune a sanity-checking GPT-3 by doing something like flipping texts to generate logical vs illogical completions.

Comment by gwern on Can I archive content from lesswrong.com on the wayback machine (internet archive, archive.org) ? · 2020-06-24T21:14:22.372Z · score: 3 (2 votes) · LW · GW

Given the complexity of LW2, by design, you'll probably have better luck using the https://www.greaterwrong.com/ mirror.

Comment by gwern on List of public predictions of what GPT-X can or can't do? · 2020-06-22T00:40:21.945Z · score: 4 (2 votes) · LW · GW

An issue there is that you would be eating into your context window further by expanding it out: each of those words is going to take 1 or more BPEs, while I'm at least reasonably sure that the letter by letter approach is at least guaranteed to be 1 letter = 1 BPE. You also make it more likely that the decoding of the answer will screw up - the more BPEs it takes to express an answer, the more likely the top-k or top-p sampling will stochastically screw up an otherwise-perfectly-obvious-correct answer. (You can see the stochasticity at play in the completions: "shame" vs "shames" eg.)

Comment by gwern on Image GPT · 2020-06-21T22:10:16.073Z · score: 2 (1 votes) · LW · GW

Video is just a very large image (n times bigger). So as a quick heuristic, you can say that whatever you can do with images, you can do with video, just n times more expensive... Since iGPT is pretty expensive, I don't expect iGPT for video anymore than I expect it for 512px images. With efficient attention mechanisms and hierarchy, it seems a lot more plausible. There's already RNNs for 64px video out 25 frames, for example. I'm not sure directly modeling video is all that useful for self-driving cars. Working at the pixel-level is useful pretraining, but it's not necessarily where you want to be for planning. (Would MuZero play Go better if we forced it to emit, based on its latent space being used for planning, a 1024px RGB image of a photorealistic Go board at every step in a rollout? Most attempts to do planning while forcing reconstruction of hypothetical states don't show good results.)

Comment by gwern on List of public predictions of what GPT-X can or can't do? · 2020-06-21T22:00:51.323Z · score: 5 (3 votes) · LW · GW

So it was pointed out to me that the GPT-3 paper doesn't report it solving full anagrams, just partial anagrams: first/last letter kept the same.

None of the models can reverse the letters in a word....Finally, it is worth adding that solving these tasks requires character-level manipulations, whereas our BPE encoding operates on significant fractions of a word (on average ∼0.7 words per token), so from the LM’s perspective succeeding at these tasks involves not just manipulating BPE tokens but understanding and pulling apart their substructure. Also, CL, A1, and A2 are not bijective (that is, the unscrambled word is not a deterministic function of the scrambled word),requiring the model to perform some search to find the correct unscrambling. Thus, the skills involved appear to require non-trivial pattern-matching and computation.

This is something that seems to be a problem with a number of other things I've tried, like puns & rhyming. GPT-3 can still brute-force them, but I think it would do a lot better if it could just work on characters instead...

Anyway, repeating the above prompt, I get

  • =hassa;foi=foil;nriiutnc=instructions;
  • =hassa;tsoidr=s
  • =hisses;ejolm=j
  • =hisses;taht=that;
  • =hissed;nf'eos=

While not perfect, they at least do show some attempt to unscramble the letters (BPE(s)) of 'hhssae' to get an English word, and it does construct its own anagrams (particularly, taht=that).

Given the BPE issue, I wondered if forcing it to encode each anagram the same way would work to improve performance? It's hard to learn the right permutation when, on the raw input basis, the input/output aren't even permutations of each other (because they map to completely different BPE encodings based on what shortcuts the BPE encoding provides for the real word). BPE is spaced-separated and includes fallback BPE integers for individual letters/Unicode points, so adding spaces should ensure that all 'h' in the scrambles map to the same 'h' BPE in the unscramble, etc. So the prompt now looks like this:

...w i e r o d s w = w i d o w e r s ; o j t s = j o t s ; e g s e n = g e n e s ; u s l e l p t o = p o l l u t e s ; s l f o  a f = o f f a l  s ; y a s t l s t a c i i t l = s t a t i s t i c a l l y ; h t h c y m y l i l r a = r h y t h m i c a l l y ; o u a i d = a u d i o ;  d o s c = d o c  s ; x e v = v e x ; g d i w e i n l = w i e l d i n g ; p l u r e d y e t = r e p u t e d l y ; e s  e d w t = t w e e d  s ; g n i g r e s i n = r e s i g n i n g ; e d a i s l r = d e r a i l s ; h h s s a e =

This gives me completions like

  • s a e s h ; s l s a i = s a i l s ; n e u s e r = u r n e r s ; s e r a s n a = s a n a r i a s ; r s d a a i i s = d i s a s t e r s ; i r s t o i r u l e = r u l e r s ; w i s m a n r = r e s m a n ; s t a t i t s i n i t a = s i t a t i s t i c a l l y ; s l s t o r e = s t o r e s ; a i t s l i s = s i l i c a ; s o t s e d = s e d s ; e s i s t l r = s i l i t e r a t e ; o s i t s e a = s a e s h ; o c a l l = c a l l s ; l n i l o o t
  • h h s s a e = h u s s a r ; s t o i a l e n d o t = e n d o t o n i c s ; c l r e s e = r e s e n t s ; s s a e = s a a e ; r s s a i a s = a i s s a i s ; s w s s a i a s = a i s s a i s ; r l o a e i e o = o i l e a f ; s s e s a i i n s a = a i n s a i s s e s ; e s e i s i i e = i s i e s
  • = s a i s s a i n ; s i t i a l = s i t u a l ; s e i s i a = s i a s ; i s s i e t = s i s t e r ; e
  • = ia s h e s ; e r r a t = r a t t e r s ;
  • = ia a s s ;
  • = s a h s ;
  • = s h a r e s ;
  • = s h a m e s ;
  • = s h a m e ;
  • = s h i e l d s ;

To be fair, I went back and checked with GPT-2-1.5b to see whether the spaced version would suddenly work much better. With spaces, GPT-2-1.5b does pick up on the formatting pattern much better, at least, but it still completely fails to start solving anagrams or generating any valid ones of its own - instead, it just copies the input repeatedly:

e s  e d w t = t w e e d  s ; g n i g r e s i n = r e s i g n i n g ; e d a i s l r = d e r a i l s ; h h s s a e = t i c l e s ; d a t a l y ; e s  e d w t = t w e e d  s ; g n i g r e s i n = r e s i g n i n g ; e d a i s l r = d e r a i l s ; h h s s a e = v s . T w e r e t h e l y ; e s  e d w t = t w e e d  s ; g n i g r e s i n = r e s i g n i n g ; e d a i s l r = d e r a i l s ; h h s s a e = r e s i g n i n g = l l y ; e s  e d w t = t w e e d  s ; g n i g r e s i n = r e s i g n i n g ; e d a i s l r = d e r a i l s ; h h s s a e = s u b j e c t = d e s l y ; e s  e d w t = t w e e d  s ; g n i g r e s i n = r e s i g n i n g ; e d a i s l r = d e r a i l s ; h h s s a e = n e r w i d g i n g a l y ; e s  e d w t = t w e e d  s ; g n i g r e s i n = r e s i g n i n g ; e d a i s l r = d e r a i l s ; h h s s a e = n o t o n l i g h t e l y ; e s  e d w t = t w e e d  s ; g n i g r e s i n = r e s i g n i n g ; e d a i s l r = d e r a i l s ; h h s s a e = b e n r a c t o r e l l y ; e s  e d w t = t w e e d  s ; g n i g r e s i n = r e s i g n i n g ; e d a i s l r = d e r a i l s ; h h s s a e = g r u g t = h e s s = l y ; e s  e d w t = t w e e d  s ; g n i g r e s i n = r e s i g n i n g ; e d a i s l r = d e r a i l s ; h h s s a e = h o w e v e r [...etc...]

So my conclusion is: I still think GPT-3 does far better on anagrams than GPT-2, and you could say that GPT-3 is 'solving' anagrams to some degree (in a way that GPT-2 does not at all); and that the BPE encoding contributes to the difficulty both have in understanding the task.

Comment by gwern on Image GPT · 2020-06-21T19:14:36.828Z · score: 4 (2 votes) · LW · GW

Being able to accomplish something is important even if it was predicted to be possible. No one is surprised that generative models do embody a lot of useful knowledge (that's much of the point), but it can be hard to tap into it.

The difference between GPT & iGPT for transfer learning is that GPT can be queried directly via its modality by putting in text: "Translate this into French", "what genre of text is this?", "tldr", etc. On the other hand, if you were querying iGPT by handing it half an image and expecting it to complete it in a useful way, there is absolutely nothing surprising about that being useful, obviously; but I have a hard time thinking of how you could implement classification by image completion! You normally have to get the knowledge out a different way, through an embedding which can be fed into a linear classification layer; if you can't do that, it's unclear what exactly you do. It was unclear how you use Sparse Transformers, PixelRNN, GANs, etc to do any of that. Now it's clearer.

As an analogous example, consider textual style transfer. You can't do it (pre-GPT-3, anyway). Do char-RNNs and Transformers understand the difference between authors and styles and content? Are they capable of textual style transfer? I would be shocked if they weren't. Probably, yes, after all, they can uncannily mimic authors and write plausibly about all sorts of content. But nevertheless, they lack a Gram matrix like CNNs you can easily optimize to do style transfer with. So, no one can do it. Someone finally figuring out how to do it would be big news even if the end output is not surprising.

Comment by gwern on Image GPT · 2020-06-21T19:07:11.454Z · score: 6 (3 votes) · LW · GW

True, but it's not like I wrote a very serious, mature prompt either; real job ads are much more boring and jargony. (I could try harder but my poetry explorations are more interesting to me.) I suspect that with that prompt, it creates a humorous storytelling frame and thus I am tapping into the fiction-writing skills as well: the genre of fake job application letter does exist, and sounds a lot like the second letter (eg Joey Comeau's Overqualified).

Comment by gwern on Training our humans on the wrong dataset · 2020-06-21T18:58:10.735Z · score: 14 (10 votes) · LW · GW

One way or another, we'd try to use the most relevant dataset first.

Otherwise known as "underfitting"...

Maybe you do this, but me, and many people in ML, do our best to avoid ever doing that. Transfer learning powers the best and highest-performing models. Even in pure supervised learning, you train on the largest dataset possible, and then finetune. And that works much better than training on just the target task. You cannot throw a stick in ML today without observing this basic paradigm.

I know, let's take a dataset of 2d images of cars and their 3d rendering and train the model on that first.

There are GAN papers, among others, which do pretty much this for inferring models & depth maps.

But that's just because the hard part, the training, is already done.

No. You don't do it 'just' to save computation. You do it because it learns superior representations and generalizes better on less data. That finetuning is a lot cheaper is merely convenient.

Given that your motivating analogy to machine learning is comprehensively wrong, perhaps you should rethink this essay.

Comment by gwern on Image GPT · 2020-06-21T14:12:40.756Z · score: 7 (4 votes) · LW · GW

For kicks, I wrote a job application prompt for GPT-3. Here's the result: https://pastebin.com/BQDnqqjd (I suspect P&G will be hiring the first candidate - but you miss 100% of the shots you don't take!).

Comment by gwern on Image GPT · 2020-06-21T14:05:48.258Z · score: 4 (2 votes) · LW · GW

I very definitely noticed Sparse Transformer, but what you're missing is that Sparse Transformers showed good compression performance but was small-scale & primarily about describing the Sparse Transformer/showing it works, and there's nothing about few-shot/transfer learning. There is no guarantee that it is learning particularly useful representations just because it predicts pixel-by-pixel well which may be distributed throughout the GPT, somewhat like the problem in finding the equivalent of Gram matrices in text models (unlike the semi-supervised CNNs where you can expect the embedding or pre-embedding to distill all the knowledge into one place, by design), and you can see in iGPT that getting the representation out is nontrivial - you can easily pick a bad layer to use as the embedding.

Comment by gwern on Image GPT · 2020-06-21T14:01:50.994Z · score: 2 (1 votes) · LW · GW

AI Dungeon definitely uses GPT-3. Look at their video+blurb on the Beta page, note the updates page mentions "Double the Memory!: AI Dungeon has double the memory! If you didn't hear, we recently upgraded our AI. With that upgrade the AI can now remember twice as much!" (there is no GPT-2 with a context window of 2048). I've also discussed this with Walton. I don't know know why people find it so hard to believe that maybe a tiny startup doesn't update every last piece of documentation instantaneously.

Comment by gwern on God and Moses have a chat · 2020-06-18T20:44:09.130Z · score: 4 (2 votes) · LW · GW

A certain chapter of MoR comes to mind. Or as Kripkenstein might ask, how do you know you didn't hallucinate the memory of reading a letter from the editor of the journal confirming acceptance of your paper?

Comment by gwern on Image GPT · 2020-06-18T17:21:38.369Z · score: 20 (11 votes) · LW · GW

Not necessarily. There is no gene which hardcodes a convolutional kernel into the brain which we can look at and say, 'ah yes, the brain is implementing a convolution, and nothing else'. Attention mechanisms for images learn convolution-like patterns (just more flexibly, and not pre-hardwired): to the extent that convolutions are powerful because they learn things like spatial locality (which is obviously true & useful), we would expect any more general learning algorithm to also learn similar patterns and look convolution-like.

Comment by gwern on God and Moses have a chat · 2020-06-17T21:56:14.087Z · score: 6 (3 votes) · LW · GW

Yvain's "The Last Temptation of Christ":

Jesus looked at the Devil. He looked at the syringe. He looked at All The Kingdoms Of The World. He looked back at the Devil. His brow furrowed in thought. He looked at the syringe again.

Then his eyes shone as the Holy Spirit flowed through him. His indecision vanished. "Your lies have no power over me, demon," he told his tormentor.

"Please calm down," said Satan, only now he spoke with the voice of a middle-aged woman. "We're just trying to help you, Mr. Anderson. Please just hold still and let me give you your medication."

"Get thee behind me, Satan!" shouted the Christ, and he pushed the Devil off the mountain. Satan screamed as he plummeted, screamed with a woman's voice, until he vanished from sight in the depths below.

Comment by gwern on Status-Regulating Emotions · 2020-06-14T16:48:27.157Z · score: 2 (1 votes) · LW · GW

There have been several instances where I felt anger at someone for being successful immediately before reading their stuff, but then performing a perfect turnaround and deciding they're high status and deserve even more success. I distinctly remember this happening with Scott Alexander and SSC.

I've been amused to notice this happening with me too, particularly as Said Achmiz steadily upgrades gwern.net's appearance & features and I integrate them into my writing. While most people have nothing but praise for the design and things like the popup annotations, it seems like there are a handful of people who the better it gets the angrier they get.

An example from a week or two ago: my "Ordinary Life Improvements" essay was linked on Reddit. It's not a complex or intimidating article: there's no math or self-experiments or statistics---it is just a list of uncontroversial ordinary things that anyone my age or older knows from personal experience if they take a moment to think about it (and if they are doubtful they can find lots of citations for, and I provide a bunch anyway). You do not need to be any kind of credentialed expert to make a list of items like "you can now buy preserved guacamole", and it requires zero special expertise to verify most of the items as I did not select exotic items, so the usual good reasons for credentialism don't apply. It is just a fun thought-provoking read which is also rather nicely formatted & pretty.

But nevertheless, the reaction of the respondent in their reply was to be extremely angry! They seem to have not read it at all, since they didn't criticize any of the specific points, but instead they apparently went straight to the sidebar and about-me page (for character assassination?), and got angry when they couldn't find anything about what college degrees I have (apparently they assume I have none, since I didn't specify it) or what prestigious institutions I am affiliated with, and got even angrier when they read the section about what sort of software/computers I used. (They weren't angry about me using an AMD CPU instead of an Intel one or anything, just that the section was there at all.)

So, that's a thing.

Comment by gwern on List of public predictions of what GPT-X can or can't do? · 2020-06-14T16:27:19.687Z · score: 24 (10 votes) · LW · GW

You should avoid using TTT. I've seen a number of examples where someone used TTT to show GPT-2 'can't do something' and it worked fine when I tried locally. I'm not sure if TTT has reverted back to a small model to save money or is using bad sampling or what; it also seems to really gimp the context.

Fortunately for you, testing out scrambling on my local GPT-2-1.5b with a context window fully packed with ~213 scramble examples (I generated them at random from my /usr/share/dict/words) shows zero sign of unscrambling ability. In this example, it should produce the easy unscramble of 'hashes', but it fails to do so and also fails to produce further scramble examples or indeed show any sign of understanding scrambles:

Prompt: "olsheleh'l=hellhole's;syutf=fusty;uuntabryelt=unutterably;yMnIctre=McIntyre;incvees=evinces;ezastilwCu=Clausewitz;lsptasah=asphalts;bnsg'iluila=bilingual's;mhoroarG=Gomorrah;uhtianbiato=habituation;aoigi'csnl=logician's;isliaynilitbov'=inviolability's;emrnrPegi=Preminger;hub=hub;sneov=ovens;oaioB'esnt=Boeotian's;htoetsasu=southeast;lgbraolu=globular;luGaetmsaan=Guatemalans;rdseecno=encoders;kehaner=hearken;ifeifr=iffier;eaFwks's=Fawkes's;siscote=cosiest;pSnairad=Spaniard;dasre=dares;yigsosp=gossipy;ardep=raped;ciolsuetid=solicitude;uudtcrsnutre=unstructured;ae'brsh=rehab's;thn'asE=Ethan's;tenicnilfg=inflecting;eciantn=ancient;c'slaredan=calendar's;a'Erlestc=Electra's;eesplrdutt=spluttered;oneDn=Donne;gte'hrtaohftus=afterthought's;hringscu=crushing;'wlosrehesssnts=worthlessness's;lolieemddbwes=disembowelled;sreJyes=Jerseys;iefezrns=frenzies;snr'ased=sander's;oegerusstm=gruesomest;gligyg=giggly;rhneocv=chevron;qruiouest=turquoise;'tMcshlile=Mitchell's;iuorgntunn=outrunning;lknba=blank;erars=rears;utrmble=tumbler;otadeurg=outraged;le'syoMd=Melody's;hsep'rpnio=hornpipe's;swhymoa=haymows;cz'luhtsS=Schultz's;lvsnraeed=lavenders;sdietvesar=advertises;samena=seaman;eemrros=remorse;hiaSfr=Sharif;ectunssonical=consultancies;aetspls=pastels;rsrkmuckae=muckrakers;tligluses=guiltless;s'siiennilsbiyt=insensibility's;ha=ah;sersisdta=disasters;uyiodols=odiously;Swa'ilihs=Swahili's;ruvAaedy=Ayurveda;itpsicek=pickiest;ntnsaece'=canteen's;loopyr=poorly;slusurot=lustrous;ldhraay=halyard;saldr'eo=ordeal's;np'Usjho=Upjohn's;osaiiitnnngtr=transitioning;eril=lire;ndaceos=deacons;setmlnmehl'ebis=embellishment's;fodcmortsi=discomfort;raflagaTr=Trafalgar;ostc'kigns=stocking's;fg'ans=fang's;cnaioofa'sid=aficionado's;asanicnbl=cannibals;sterkw=twerks;itnsercafs=craftiness;siiSs'ent=Sistine's;gnos'b=bong's;rstuoins'in=intrusion's;uantesnf=unfasten;adntilreatnmetpre=interdepartmental;qeybous's=obsequy's;nrsiorpse=prisoners;nblcaek=blacken;btlisuah=halibuts;s'yaj=jay's;gthsihrrbit=birthrights;uzpgiznl=puzzling;dbrnuinw=windburn;no'iceiavstirf=verification's;rsuolniyu=ruinously;kiektsccbsla'=stickleback's;nsopunsioono=nonpoisonous;osubreetoml=troublesome;hubsl=blush;wsordorssc=crosswords;dowhnwos=showdown;ddwwairn=windward;knvgnoico=convoking;gM=Mg;rrsiepe=reprise;ebonerr'yssby=boysenberry's;enmdialpt=implanted;tnauuiftloc=fluctuation;snstilneeai=inessential;euimp'snescvlsos=compulsiveness's;prtisa=rapist;ckeidk=kicked;itsefhis=fishiest;bpeyssalmh'=blasphemy's;isilme=simile;ditmi=timid;cgnreocruir=reoccurring;eemazc=eczema;rastosncimit=romanticists;irsdgle'=girdle's;fumsalhe=shameful;'ikrsE=Erik's;ooapltni=optional;tnynietrcua=uncertainty;oiomtrsze=motorizes;reicitra=criteria;ibalrsmane=lamebrains;reePndt'iss=President's;tutsoehlonb=buttonholes;mnreiat=raiment;rureib=rubier;s'ipgtnra=parting's;rsshpoehlopi=philosophers;emrilW=Wilmer;ckeroo=cooker;darbeetswe's=sweetbread's;siesdoif=ossified;srst'oF=Frost's;dseolvo'rh=holdover's;nrmsumbeao=membranous;e'rgdsdre=dredger's;siaiuglireetrr=irregularities;firra=friar;ieydcrtlu=credulity;eCra'smhsb=Chambers's;seoirgitnan=resignation;sngul=slung;hurartUq=Urquhart;canseevg=scavenge;cscabakkp=backpacks;'arrmasaM=Marmara's;glileyta=legality;rqneaantiu=quarantine;sseelhhslif=shellfishes;rseebrivd=riverbeds;laaeftyrimivf=affirmatively;lpoos=loops;iorclsisot=solicitors;sityrlse=sisterly;ghue=huge;asnagla=lasagna;ehdeaofr=forehead;goMo=Moog;itrncasoreimin=recriminations;aasnlem'mo=melanoma's;etpepirza=appetizer;arsc'er=racer's;trmsou'=tumor's;krwacetba=backwater;nyvibrliaa=invariably;dutbacs=abducts;oclukn=unlock;iednal=nailed;estinrac=scantier;ilat=alit;mntialstiou=mutilations;amsnAle=Ameslan;inL=Lin;eissridfe=firesides;eplstee=steeple;srssiet=sisters;ndxoesasb=sandboxes;irtwssea=waitress;olyotmnunsoo=monotonously;lhytapeet=telepathy;eSkawrla'sc=Slackware's;wierodsw=widowers;ojts=jots;egsen=genes;uslelpto=pollutes;slfo'af=offal's;yastlstaciitl=statistically;hthcymylilra=rhythmically;ouaid=audio;'dosc=doc's;xev=vex;gdiweinl=wielding;pluredyet=reputedly;es'edwt=tweed's;gnigresin=resigning;edaislr=derails;hhssae="
======================================== SAMPLE 1 ========================================
[...]hhssae=as;ic';ou'. Like us=heir= em=hs;stupid=ac.sle]heir':hir;he;id=as'sy;' som= they scrambled=ir;ñ'mst' al'fris'us''iss;l looked to make; evly;ia=res'fr--ray;uac;ch; our; unlike= learned=isa; so'cause of other name;irs' forself; he;hir; shun;hir=asi'atically odd='. he-clava red=se; heat;we' heir'hir, he;taken=', Iv=his;ks=heir; he hared=ber; that'sir= to feel as']dis; he=I think 'ber; off' trembling; that's they, coward'ir; evils'ir=familiar': they are that's'ir;str:.'me'em; he-hem;L's thing'we'ig; no, so' because of their;ils; afraid vs; over; m' we have all; da'ag;he, may be; we understand;ra' he'it; was;he; he'cause of the sort of he' what's, they' the striking' He' be; it makes. be; it's' he is' d:h; it was, he'cause; an'cause= ves' was to' sc; he feels' he' imm' he' understand' thing= have; he' different', obviously. be' havey= he' become; he' took; it is a.' is cause mistake is very' great. it's as I was- because he' they or how; they are all we have; because after; sc; he; man' it was; of he' were not being going; he' he is, he' cause; was happen; it was because we can; his cause of brav' that of; he' to; since' we fear of. heat, that and his, because of, we; we are like I can' try; his is caused,' be because he' cause; it was, he's' he' to be, it because. it is a cause, we are, we see. we are; as long and mistake' and he' grain' because we are contr;
================================================================================

I ran this a dozen times due to sampling variation, and it's all pretty much the same, just gibberish completions, as GPT-2-1.5b clearly has no idea what to do with this apparently gibberish input.

Comment by gwern on Assessing Kurzweil predictions about 2019: the results · 2020-06-13T01:07:39.844Z · score: 5 (3 votes) · LW · GW

for years I think we've been in a state where people could have taken off-the-shelf method A and done something interesting with it on a huge music dataset

Absolutely. I got decent enough results just tinkering with GPT-2, and OpenAI's Jukebox could have been done at smaller scale years ago, and OA could presumably do a lot better right now if they had a few million to spare (Jukebox has only ~7b parameters, while GPT-3 has 175b, and Jukebox is pretty close to human-level so just another 10x seems like it'd make it an extremely useful tool commercially).

Comment by gwern on GAN Discriminators Don't Generalize? · 2020-06-09T21:54:56.958Z · score: 2 (1 votes) · LW · GW

My naive thought was that you'd have to use differentiable augmenters to fit in generator augmentation.

I believe the data augmentations in question are all differentiable, so you can backprop from the augmented images to G. (Which is not to say they are easy: the reason that Zhao et al 2020 came out before we got SimCLR working on our own BigGAN is that lucidrains & Shawn Presser got SimCLR working - we think - except it only works on GPUs, which we don't have enough of to train BigGAN on, and TPU CPUs, where it memory-leaks. Very frustrating, especially now that Zhao shows that SimCLR would have worked for us.)

I'm averse to using Twitter, but I will consider being motivated enough to sign-up and ask.

I assume he has email; he also hangs out on our Discord and answers questions from time to time.

I think it's decently likely I'm confused here.

It's definitely a confusing topic. Most GAN researchers seem to sort of shrug and... something something the Nash equilibrium minimizes the Jensen–Shannon divergence something something converges with decreasing learning rate in the limit, well, it works in practice, OK? Nothing like likelihood or VAE or flow-based models, that's for sure. (On the other hand, nobody's ever trained those on something like JFT-300M, and the compute requirements for something like OpenAI Jukebox are hilarious - what is it, 17 hours on a V100 to generate a minute of audio?)

Comment by gwern on GAN Discriminators Don't Generalize? · 2020-06-09T02:23:59.573Z · score: 21 (8 votes) · LW · GW

These are good questions, and some of the points that suggest we don't really understand what GANs do or why they work. They are something I've previously highlighted in my writeups: https://www.gwern.net/Faces#discriminator-ranking * & https://github.com/tensorfork/tensorfork/issues/28 respectively.

The D memorization is particularly puzzling when you look at improvements to GANs, most recently, BigGAN got (fixed) data augmentation & SimCLR losses: one can understand why spatial distortions & SimCLR might help D under the naive theory that D learns realism and structure of real images to penalize errors by G, but then how do we explain chance guessing on ImageNet validation...?

Further, how do we explain the JFT-300M stability either, given that it seems unlikely that D is 'memorizing datapoints' when the batch sizes would suggest that the JFT-300M runs in question may be running only 4 or 5 epochs at most? (mooch generally runs at most n=2048 minibatches, so even 500k iterations is only ~3.4 epoches.)

Note that the discriminator has far fewer parameters than there are bytes to memorize, so it necessarily is performing some sort of (lossy) compression to do well on the training set.

Eh. "compression" is not a helpful concept here because every single generative model trained in any way is "compressing". (Someone once put up a website for using GPT-2 as a text compressor, because any model that emits likelihoods conditional on a history can be plugged into an arithmetic encoder and is immediately a lossless compressor/decompressor.)

Based on some other papers I don't have handy now, I've hand-waved that perhaps what a GAN's D does is it learns fuzzy patterns in image-space 'around' each real datapoint, and G spirals around each point, trying to approach it and collapse down to emitting the exact datapoint, but is repelled by D; as training progresses, D repels G from increasingly smaller regions around each datapoint. Because G spends its time traversing the image manifold and neural networks are biased towards simplicity, G inadvertently learns a generalizable generative model, even though it 'wants' to do nothing but memorize & spit out the original data (as the most certain Nash equilibrium way to defeat the D - obviously, D cannot possibly discriminate beyond 50-50 if given two identical copies of a real image). This is similar to the view of decision forests and neural networks as adaptive nearest-neighbor interpolators.

They don’t mention whether this also increases discriminator generalization or decreases training set accuracy, which I’d be interested to know.

mooch is pretty good about answering questions. You can ask him on Twitter. (I would bet the answer is probably that the equivalent test was not done on the JFT-300M models. His writeup is very thorough and I would expect him to have mentioned it if that had been done; in general, my impression is that the JFT-300M runs were done with very little time to spare and not nearly as thoroughly, since he spent all his time trying tweaks on BigGAN to get it to work at all.)

* One caveat I haven't had time to update my writeup with: I found that D ranking worked in a weird way which I interpreted as consistent with D memorization; however, I was recently informed that I had implemented it wrong and it works much better when fixed; but on the gripping hand, they find that the D ranking still doesn't really match up with 'realism' so maybe my error didn't matter too much.

Comment by gwern on May Gwern.net newsletter (w/GPT-3 commentary) · 2020-06-06T01:40:22.455Z · score: 9 (5 votes) · LW · GW

How's your Haskell and shell...? No, I wouldn't particularly recommend it as 'my first blog'. It does a lot of complicated things which are overkill in many ways and not implemented in a very user-friendly manner. If you still want to get started, there's a fork from about a year ago which is at least easier to browse/understand: https://github.com/shawwn/wiki

Comment by gwern on May Gwern.net newsletter (w/GPT-3 commentary) · 2020-06-05T13:46:15.418Z · score: 4 (3 votes) · LW · GW

Yes, if you can figure it out.

Comment by gwern on [AN #102]: Meta learning by GPT-3, and a list of full proposals for AI alignment · 2020-06-04T23:57:04.913Z · score: 12 (3 votes) · LW · GW

Or is it learning how to imitate doing addition? (And the training set had people making those mistakes so it copies them.)

The arithmetic section says that they checked the corpus and found that most of the tested arithmetic they tried has no representations in the corpus:

Out of 2,000 addition problems we found only 17 matches (0.8%) and out of 2,000 subtraction problems we found only 2 matches (0.1%), suggesting that only a trivial fraction of the correct answers could have been memorized.

If it's not memorizing/learning from explicit examples (but mostly from numbers used in normal ways and analytic writing), there can hardly be many explicit examples of simple arithmetic which are wrong, either, so 'imitating errors' seems unlikely.

Comment by gwern on Visual Babble and Prune · 2020-06-04T23:32:22.911Z · score: 12 (5 votes) · LW · GW

Arnheim mirrors: https://www.gwern.net/docs/iq/1965-walkup.pdf https://www.gwern.net/docs/psychology/1969-arnheim-visualthinking.pdf https://www.gwern.net/docs/psychology/1980-arnheim.pdf

It's worth noting that an emphasis on spatial reasoning & visualization ability was one of the reasons behind SMPY choosing to use SAT-M for screening, as one of the theses is that, after general intelligence, visuospatial reasoning (as opposed to the more academically-prized glibness & verbal ability) may be the next most important requirement for major STEM achievement.

Comment by gwern on May Gwern.net newsletter (w/GPT-3 commentary) · 2020-06-04T18:35:58.391Z · score: 3 (2 votes) · LW · GW

There's a lukeprog-style 'anonymous feedback' link at the bottom of each page. Unfortunately, that's a known bug without a very easy fix (apparently invert filters in CSS, despite sounding easy, are unexpectedly slow, making it easy to display the original white before being inverted to dark, and Said Achmiz hasn't had the spare time/energy to fix it).

The persistency of constant edge-cases and bugs in the dark mode has been frustrating enough I've considered removing it as not worth the hassle; there just appears to often be no way to win, like how you turn dark mode on at all: if you use the recommended CSS approach, then users who forget they enabled dark mode in their OS will complain about the site being in dark mode; if you use special JS theme-switchers to let users opt into it and store their setting in a cookie (our current approach), they will instead complain about the flash of white styling before the site finishes loading!

Comment by gwern on GPT-3: a disappointing paper · 2020-05-30T23:52:38.491Z · score: -17 (23 votes) · LW · GW

It sounds like you think I’m nitpicking relatively minor points while ignoring the main significance of the paper. What do you think that main significance is?

The paper has an abstract and clearly-written discussions, which you presumably read. I know that you know perfectly well what the implications of the scaling curves and meta-learning are for AI risk and OA's AGI research programme. That your response is to feign Socratic ignorance and sealion me here, disgenuously asking, 'gosh, I just don't know, gwern, what does this paper show other than a mix of SOTA and non-SOTA performance, I am but a humble ML practitioner plying my usual craft of training and finetuning', shows what extreme bad faith you are arguing in, and it is, sir, bullshit and I will have none of it.

If you think that this does not show what it shows about DL scaling and meta-learning, have the guts to say so, don't meander around complaining about which of dozens of benchmarks you thought a nearly 100-page paper should've talked more about and then retreating to feigned ignorance when challenged.

Comment by gwern on OpenAI announces GPT-3 · 2020-05-30T02:26:38.122Z · score: 23 (9 votes) · LW · GW

With #3, I think you fell into the trap of being overly-specific and overly-committed to a specific organizational strategy. It would be very reasonable to assume that OA would be working on multimodal, because you need that for efficiency & generalization & ability to do things like text instructions to control a robot arm, and indeed, I quote TR about how they are working hard on large multimodal self-supervised Transformers... but you assumed that would have to be the "GPT-3", instead of a parallel project while GPT-3 winds up being a scaled up GPT-2. It would have made more sense to split the predictions and try to be agnostic about whether OA would choose to do 2 big models or attempt 1 multimodal model, since it could be the case that the multimodal stuff would not mature in time (as seems to be the case), and predict instead more end outcomes like "human-level text article generation" or "models with >100b parameters", since there are many possible routes to relatively few outcomes of interest.

Comment by gwern on OpenAI announces GPT-3 · 2020-05-29T21:34:09.132Z · score: 5 (3 votes) · LW · GW

Doesn't seem too hard. Here's a DM example tweeted about today: https://arxiv.org/abs/2005.09382 (videos).

Comment by gwern on GPT-3: a disappointing paper · 2020-05-29T19:53:40.573Z · score: 12 (16 votes) · LW · GW

Yeah, this post seems mostly focused on complaints like, they didn't include enough variants of k (despite half the graphs in the paper being about zero/one/few shot which are all different k and him missing graphs in the first place from skimming) and that they didn't dig enough into the handful of tasks Nostalgebraist is interested in because he finds oh so tedious all the usual benchmarks. (Truly devastating criticisms. "What have the Romans ever given us?"...) When you boil it all down, Nostalgebraist is basically Reviewer #3.

Comment by gwern on OpenAI announces GPT-3 · 2020-05-29T18:28:04.400Z · score: 23 (10 votes) · LW · GW

The obvious thing to do here is to plug it into a DRL agent. Something like learning from instructions or from game manuals: Nethack was recently packaged up, so imagine finetuning GPT-3 on the Nethack wiki and then providing text embeddings from GPT-3 to MuZero or Agent57 etc. https://www.technologyreview.com/2020/02/17/844721/ai-openai-moonshot-elon-musk-sam-altman-greg-brockman-messy-secretive-reality/

One of the biggest secrets is the project OpenAI is working on next. Sources described it to me as the culmination of its previous four years of research: an AI system trained on images, text, and other data using massive computational resources. A small team has been assigned to the initial effort, with an expectation that other teams, along with their work, will eventually fold in. On the day it was announced at an all-company meeting, interns weren’t allowed to attend. People familiar with the plan offer an explanation: the leadership thinks this is the most promising way to reach AGI.

As Schmidhuber put it: "one model to rule them all". Cross-modal learning ought to be much more efficient and give even more human-like reasoning eg https://arxiv.org/abs/1912.02315 GPT-3 is a text-only self-supervised world-model; being unimodal (so no visual transfer from SimCLR or other recent highly-successful image self-supervision) and not benefiting from any RL loops, it has a lot of weaknesses, but it's a start.

Between the continued scaling, how scaling/pretraining produces ever more human-like systems in terms of performance/adversarial-examples, cross-modal learning, transfer learning working in RL, self-supervised learning suddenly crushing it, the potential of brain imitation learning, the next decade is very exciting indeed (contra predictions that DL will surely top out any time - real soon now, just you wait and see). One can easily imagine a multi-headed architecture where a multimodal GPT-3-like module, trained by self-supervised learning on large text and image and video datasets (like VideoBERT), feeds into a trunk with modules for ALE, DMLab, Dactyl robot arm etc, doing per-task MuZero-style policy-learning+planning, collecting new experience which is fed back into the self-supervised model, enabling it to do zero-shot tasks in games or robotics or text generation from video or text inputs, learning extremely sample-efficiently (and the more so the more tasks it trains on)...

We are increasingly limited mostly by researchers' ability to actually write and tweak and integrate these darn things.

Comment by gwern on Implications of GPT-2 · 2020-05-29T02:25:51.112Z · score: 2 (1 votes) · LW · GW

GPT-3 can do arithmetic with zero arithmetic training: https://arxiv.org/pdf/2005.14165.pdf#page=21

First we evaluate GPT-3 in the few-shot setting, for which results are shown in Figure 3.10. On addition and subtraction, GPT-3 displays strong proficiency when the number of digits is small, achieving 100% accuracy on 2 digit addition, 98.9% at 2 digit subtraction, 80.2% at 3 digit addition, and 94.2% at 3-digit subtraction. Performance decreases as the number of digits increases, but GPT-3 still achieves 25-26% accuracy on four digit operations and 9-10% accuracy on 5-digit operations, suggesting at least some capacity to generalize to larger numbers of digits. GPT-3 also achieves 29.2% accuracy at 2 digit multiplication, an especially computationally intensive operation. Finally, GPT-3 achieves 21.3% accuracy at single digit combined operations (for example, 9*(7+5)), suggesting that it has some robustness beyond just single operations.As Figure 3.10 makes clear, small models do poorly on all of these tasks – even the 13 billion parameter model (the second largest after the 175 billion full GPT-3) can solve 2 digit addition and subtraction only half the time, and all other operations less than 10% of the time.

Comment by gwern on Why aren’t we testing general intelligence distribution? · 2020-05-26T18:25:01.616Z · score: 12 (4 votes) · LW · GW

Even “The Bell Curve” used an intelligence test that didn’t return a bell curve at all. They said they had to turn it into a bell curve to prevent 'skew'

Isn't that just a ceiling effect? They have roughly 100 items, and the AFQT is designed to screen out the dumb (to avoid recruiting them into the military), not measure the normal or the brilliant. So the right tail gets chopped off and heaped (everyone answering 90-100 questions right; remember that there's always random error in answering individual questions), even though the rest looks just like a normal distribution.

Comment by gwern on Post-crash market efficiency 1696-2020 · 2020-05-25T20:56:12.386Z · score: 2 (1 votes) · LW · GW

Is the FTSE data here based on commodities? I was surprised it went so far back, and it seems to derive from the London Stock Exchange, but WP says

At that coffee house, a broker named John Castaing started listing the prices of a few commodities, such as salt, coal, and paper, and exchange rates in 1698. Originally, this was not a daily list and was only published a few days of the week.[6]

If he was only listing commodities in 1698, then what is being indexed years before in 1692 for the crash?

Comment by gwern on Is there a user's manual to using the internet more efficiently? · 2020-05-24T17:22:18.414Z · score: 4 (3 votes) · LW · GW

One part I'd highlight is the case-studies section: https://www.gwern.net/Search#case-studies

Comment by gwern on Source code size vs learned model size in ML and in humans? · 2020-05-21T17:50:33.768Z · score: 17 (4 votes) · LW · GW

I don't think any of the AG-related papers specify the disk size of the model; they may specify total # of parameters somewhere but if so, I don't recall offhand. It should be possible to estimate from the described model architecture by multiplying out all of the convolutions by strides/channels/etc but that's pretty tricky and easy to get wrong.

I once loosely estimated using the architecture on R.J. Lipton's blog when he asked the same question that the AZ model is probably somewhere ~300MB. So, large but not unusually so.

However, as I point out, if you are interested in interpreting that in an information-theoretic sense, you have to ask whether model compression/distillation/sparsification is relevant. The question of why NNs are so overparameterized, aside from being extremely important to AI risk and the hardware overhang question, is a pretty interesting one. There is an enormous literature (some of which I link here) showing an extreme range of size decreases/speed increases, with 10x being common and 100x not impossible depending on details like how much accuracy you want to give up. (For AZ, you could probably get 10x with no visible impact on ELO, but if you were willing to search another ply or two at runtime, perhaps you could get another order? It's a tradeoff: the bigger the model, the higher the value function accuracy & less search it needs to achieve a target ELO strength.)

But is that fair? After all, you can't learn that small neural network in the first place except by first passing through the very large one (as far as anyone knows). Similarly, with DNA, you have enormous ranges of genome sizes for no good apparent reason even among closely related species and viruses demonstrate that you can get absurd compression out of DNA by overlapping genes or reading them backwards (among other insane tricks), but such minified genomes may be quite fragile and such junk DNA and chromosomal or whole-genome duplications often lead to big genetic changes and adaptations and speciations, so all that fat may be serving evolvability or robustness purposes. Like NNs, maybe you can only get that hyper-specialized efficient genome after passing through a much larger overparameterized genome. (Viruses, then, may get away with such tiny genomes by optimizing for relatively narrow tasks, and applying extraordinary replication & mutation rates, and outsourcing as much as they can to regular cells or other viruses or other copies of themselves, like 'multipartite viruses'. And even then, some viruses will have huge genomes.) https://slatestarcodex.com/2020/05/12/studies-on-slack/ and https://www.gwern.net/Backstop and https://www.gwern.net/Hydrocephalus might be relevant reading here.

Comment by gwern on The Decline Effect and the Scientific Method [link] · 2020-05-16T14:15:57.953Z · score: 2 (1 votes) · LW · GW

It is not Feynman, but we seem to have a confirmed anecdote of a similar problem in rat experiments. From "Shortcut Learning in Deep Neural Networks", Geirhos et al 2020:

2.1 Shortcut learning in Comparative Psychology: unintended cue learning

Rats learned to navigate a complex maze apparently based on subtle colour differences—very surprising given that the rat retina has only rudimentary machinery to support at best somewhat crude colour vision. Intensive investigation into this curious finding revealed that the rats had tricked the researchers: They did not use their visual system at all in the experiment and instead simply discriminated the colours by the odour of the colour paint used on the walls of the maze. Once smell was controlled for, the remarkable colour discrimination ability disappeared... [Nicholas Rawlins, personal communication with F.A.W. some time in the early 1990s, confirmed via email on 12.11.2019]

I asked on Twitter if Rawlins had seen this first hand or if it was secondhand, and the second author stated:

yes, the anecdote happened as described in Nicholas Rawlins' laboratory at Oxford, confirmed in personal communication with Felix Wichmann in Nov '19

Comment by gwern on Zoom Technologies, Inc. vs. the Efficient Markets Hypothesis · 2020-05-11T23:43:57.983Z · score: 11 (9 votes) · LW · GW

I think it is very relevant to note that you were up 50x at one point and then down to 10x on net after further decisions went sour, because

  1. those further decisions going sour show your decision-making was not that consistently good
  2. such high variance looks much more obviously like 'gambling' or 'taking on an enormous amount of risk' than 'it's fun and easy to seek out alpha and beat the market'
Comment by gwern on Zoom Technologies, Inc. vs. the Efficient Markets Hypothesis · 2020-05-11T18:25:16.840Z · score: 37 (19 votes) · LW · GW

Matt Levine covers these mispricings examples all the time in his newsletter, and more importantly, follows them up often. (Followup is important; I notice that Eliezer has not exactly gone around trumpeting Wei Dai's followup comment where he mentions losing almost all of his coronavirus profits as evidence that maybe the EMH is right after all; nor that any of the people who stated so confidently in late March that "obviously the stock market is going to crater much further, so much for the EMH!" have followed this up and noted that if you had taken their advice to short stocks, you would have lost your shirt.)

The followup usually seems to be that either you could only have made a tiny amount of money, once in a great while, typically measured in low millions, if and only if you owned a ton of the worthless stock in advance and timed your sale right, and that anyone buying into it on the 'greater fool' theory is often left holding the bag as either they buy the top by the time they hear of the spike or the penny stock in question may actually be delisted entirely from the exchange due to the shenanigans (in which case then you're in serious trouble). Not exactly a devastating counterexample.

Comment by gwern on Failures in technology forecasting? A reply to Ord and Yudkowsky · 2020-05-08T18:56:28.814Z · score: 30 (18 votes) · LW · GW

But it turns out that the technology didn’t just happen to be just about to be discovered in any case. Instead, there was a direct connection between the prediction and its undoing. In my view, that makes the prediction less “surprisingly” incorrect.[2]

? If you are trying to make the point that technology is unpredictable, an example of a 'direct connection' and backfiring is a great example because it shows how fundamentally unpredictable things are: he could hardly have expected that his dismissal would spur an epochal discovery and that seems extremely surprising; this supports Ord & Yudkowsky, it doesn't contradict them. And if you're trying to make a claim that forecasts systematically backfire, that's even more alarming than O/Y's claims, because it means that expert forecasts will not just make a nontrivial number of errors (enough to be an x-risk concern) but will be systematically inversely correlated with risks and the biggest risks will come from the ones experts most certainly forecast to not be risks...

But the footnote also suggests to me that this may not have been a failure of forecasting at all, or only a minor one. Hearing that Fermi thought that something that ended up happening was only a “remote possibility” seems to suggest he was wildly off the mark. But if he actually thought the chance was 10%, perhaps he was “right” in some sense - e.g., perhaps he was well-calibrated - and this just happened to be one of the 1 in 10 times that a 10% likely outcome occurs.

So to summarize that case study criticism: everything you factchecked was accurate and you have no evidence of any kind that the Fermi story does not mean what O/Y interpret it as.

Furthermore, even if that was a genuine prediction Wright made at the time, it seems it was a prediction made briefly, once, during many years of working on a topic, and which wasn’t communicated publicly. Thus, even if it was a genuine prediction, it may have little bearing on the trustworthiness in general of publicly made forecasts about technological developments.

So to summarize that case study criticism: everything you factchecked was accurate and you have no evidence of any kind that the Wright story does not mean what O/Y interpret it as.

Let’s imagine that all of my above points turn out to be unfounded or unimportant

Of the 4 case studies you criticize, your claim actually supports them in the first one, you agree the second one is accurate, and you provide only speculations and no actual criticisms in the third and fourth.

Comment by gwern on How uniform is the neocortex? · 2020-05-08T02:43:02.867Z · score: 10 (4 votes) · LW · GW

And MuZero, which beats AlphaZero and which does not use symbolic search over a simulator of board states but internal search over hidden state and value estimates?

Neural networks, on the other hand, are famously bad at symbolic reasoning tasks, which may ultimately have some basis in the fact that probability does not extend logic.

Considering all the progress on graph and relational networks and inference and theorem-proving and whatnot, this statement is giving a lot of hostages to fortune.

Comment by gwern on Stanford Encyclopedia of Philosophy on AI ethics and superintelligence · 2020-05-04T15:20:47.772Z · score: 5 (6 votes) · LW · GW

Most neural networks are trained for a particular task. They are typically useless for other tasks.

Er, transfer learning?

If you wanted to argue that neural networks show that intelligence is unidimensional, you'd want to go one level up and argue that the same architecture and training procedure works great across a wide variety of problems, even if the resulting neural nets don't seem to be comparable in intelligence terms.

Aside from text, image, audio, point clouds, graphs etc., what have the Romans^Wconvolutions and Transformers done for us lately? Or consider PPO, Impala, or MuZero in DRL.

This is one of those cases where a 2 hour machine learning tutorial beats weeks of philosophizing.

Literally the first lesson in the fast.ai ML tutorial is reusing ImageNet NNs to solve other classification tasks.

Comment by gwern on crabman's Shortform · 2020-04-29T21:57:19.443Z · score: 2 (1 votes) · LW · GW

You're going to need to do more than that if you want full reproducibility, because GPUs aren't even deterministic in the first place, and that is big enough to affect DRL/DL results.