Posts

AlphaFold 2 paper released: "Highly accurate protein structure prediction with AlphaFold", Jumper et al 2021 2021-07-15T19:27:20.584Z
May 2021 Gwern.net newsletter 2021-06-11T14:13:18.485Z
"Decision Transformer" (Tool AIs are secret Agent AIs) 2021-06-09T01:06:57.937Z
April 2021 Gwern.net newsletter 2021-06-03T15:13:29.138Z
gwern's Shortform 2021-04-24T21:39:14.128Z
March 2021 gwern.net newsletter 2021-04-06T14:06:20.198Z
February 2021 gwern.net newsletter 2021-03-13T14:57:54.645Z
January 2021 gwern.net newsletter 2021-02-04T20:12:39.555Z
December 2020 gwern.net links 2021-01-10T17:21:40.756Z
November 2020 gwern.net newsletter 2020-12-03T22:47:16.917Z
October 2020 gwern.net newsletter 2020-11-01T21:38:46.795Z
/r/MLScaling: new subreddit for NN scaling research/discussion 2020-10-30T20:50:25.973Z
"Scaling Laws for Autoregressive Generative Modeling", Henighan et al 2020 {OA} 2020-10-29T01:45:30.666Z
September 2020 gwern.net newsletter 2020-10-26T13:38:51.107Z
August 2020 gwern.net newsletter 2020-09-01T21:04:58.299Z
July 2020 gwern.net newsletter 2020-08-20T16:39:27.202Z
June 2020 gwern.net newsletter 2020-07-02T14:19:08.696Z
GPT-3 Fiction Samples 2020-06-25T16:12:05.422Z
May Gwern.net newsletter (w/GPT-3 commentary) 2020-06-02T15:40:37.155Z
OpenAI announces GPT-3 2020-05-29T01:49:04.855Z
"AI and Efficiency", OA (44✕ improvement in CNNs since 2012) 2020-05-05T16:32:20.335Z
April 2020 gwern.net newsletter 2020-05-01T20:47:44.867Z
March 2020 gwern.net newsletter 2020-04-03T02:16:02.871Z
February 2020 gwern.net newsletter 2020-03-04T19:05:16.079Z
January 2020 gwern.net newsletter 2020-01-31T18:04:21.945Z
Subscripting Typographic Convention For Citations/Dates/Sources/Evidentials: A Proposal 2020-01-08T22:20:20.290Z
Dec 2019 gwern.net newsletter 2020-01-04T20:48:48.788Z
Nov 2019 gwern.net newsletter 2019-12-02T21:16:04.846Z
October 2019 gwern.net newsletter 2019-11-14T20:26:34.236Z
September 2019 gwern.net newsletter 2019-10-04T16:44:43.147Z
"AI-GAs: AI-generating algorithms, an alternate paradigm for producing general artificial intelligence", Clune 2019 2019-09-10T21:33:08.837Z
August 2019 gwern.net newsletter (popups.js demo) 2019-09-01T17:52:01.011Z
"Designing agent incentives to avoid reward tampering", DeepMind 2019-08-14T16:57:29.228Z
July 2019 gwern.net newsletter 2019-08-01T16:19:59.893Z
How Should We Critique Research? A Decision Perspective 2019-07-14T22:51:59.285Z
June 2019 gwern.net newsletter 2019-07-01T14:35:49.507Z
On Seeing Through 'On Seeing Through: A Unified Theory': A Unified Theory 2019-06-15T18:57:25.436Z
On Having Enough Socks 2019-06-13T15:15:21.946Z
May gwern.net newsletter 2019-06-01T17:25:11.740Z
"One Man's Modus Ponens Is Another Man's Modus Tollens" 2019-05-17T22:03:59.458Z
April 2019 gwern.net newsletter 2019-05-01T14:43:18.952Z
Recent updates to gwern.net (2017–2019) 2019-04-28T20:18:27.083Z
"Everything is Correlated": An Anthology of the Psychology Debate 2019-04-27T13:48:05.240Z
March 2019 gwern.net newsletter 2019-04-02T14:17:38.032Z
February gwern.net newsletter 2019-03-02T22:42:09.490Z
'This Waifu Does Not Exist': 100,000 StyleGAN & GPT-2 samples 2019-03-01T04:29:16.529Z
January 2019 gwern.net newsletter 2019-02-04T15:53:42.553Z
"Forecasting Transformative AI: An Expert Survey", Gruetzemacher et al 2019 2019-01-27T02:34:57.214Z
"AlphaStar: Mastering the Real-Time Strategy Game StarCraft II", DeepMind [won 10 of 11 games against human pros] 2019-01-24T20:49:01.350Z
Visualizing the power of multiple step selection processes in JS: Galton's bean machine 2019-01-12T17:58:34.584Z

Comments

Comment by gwern on [Book Review] "The Alignment Problem" by Brian Christian · 2021-09-26T00:27:26.653Z · LW · GW

So, how many third parties reported about the classification and how trustworthy were they? How many studies were conducted on the classification of black people as gorillas? What should we make of an ecosystem which tells us on a literally daily to weekly basis (google the term) about the gorillas, but never, ever tells you about the seals (I only learned about that one because I was reading the Google expert's post for other reasons)? What should we infer about the epistemics and justifications of the various experts and reporting here?

I'm writing this rather nitpicky comment because this is the top comment replying with rather strong wording about sourcing and studies and double standards for reporting...

Comment by gwern on Pathways: Google's AGI · 2021-09-25T16:48:25.928Z · LW · GW

It might be more useful to discuss Google's dense GPT-like LaMDA-137b instead, because there's so little information about Pathways or MUM. Google papers refuse to name it, for unclear reasons, but they've been doing interesting OA-like research with it: eg "Program Synthesis with Large Language Models", "Finetuned Language Models Are Zero-Shot Learners", or text style transfer.

Comment by gwern on Redwood Research’s current project · 2021-09-23T21:43:07.203Z · LW · GW

Controlling the violence latent would let you systematically sample for it: you could hold the violence latent constant, and generate an evenly spaced grid of points around it to get a wide diversity of violent but stylistically/semantically unique. Kinds of text which would be exponentially hard to find by brute force sampling can be found this way easily. It also lets you do various kinds of guided search or diversity sampling, and do data augmentation (encode known-violent samples into their latent, hold the violent latent constant, generate a bunch of samples 'near' it). Even if the violence latent is pretty low quality, it's still probably a lot better as an initialization for sampling than trying to brute force random samples and running into very rapidly diminishing returns as you try to dig your way into the tails.

And if you can't do any of that because there is no equivalent of a violent latent or its equivalent is clearly too narrow & incomplete, that is pretty important, I would think. Violence is such a salient category, so frequent in fiction and nonfiction (news), that a generative model which has not learned it as a concept is, IMO, probably too stupid to be all that useful as a 'model organism' of alignment. (I would not expect a classifier based on a failed generative model to be all that useful either.) If a model cannot or does not understand what 'violence' is, how can you hope to get a model which knows not to generate violence, can recognize violence, can ask for labels on violence, or do anything useful about violence?

Comment by gwern on MikkW's Shortform · 2021-09-22T01:08:34.875Z · LW · GW

At some point in time, I took to calling them "dexter" and "winstar", from the Latin »dexter« and Middle English »winstre«, meaning "right" and "left", respectively

Are you aware that "deasil" and "widdershins" mean those from those roots already?

Comment by gwern on Redwood Research’s current project · 2021-09-22T01:06:40.216Z · LW · GW

Similarly, you might think that a promising approach is to look for snippets which cause the generator to generate violent completions with particularly high probability, reasoning that if the classifier says that the first 99 completions were bad but that the 100th was good, there’s perhaps an unusually high chance that it’s wrong about that 100th completion. And again, you can take this into account at eval time, by increasing the conservatism of your classifier based on how many completions it has rejected already...Try cleverer approaches to look for model mistakes, TBD. We’ve done a couple of things here but nothing has panned out super well yet.

Have you tried any of the guided generation approaches like GeDI to make the model generate only violent completions and then calling in the human oracles on all of those guided completions which the classifier misses? Or looking for a 'violence' latent?

Comment by gwern on How much should you be willing to pay for an AGI? · 2021-09-20T18:03:34.273Z · LW · GW

GPT-3 is slightly too expensive for many of the use-cases that I am interested in. This problem is made even worse by the fact that one of the basic techniques I normally use in procedural generation is "generate 100 of something and then pick the best one".

It's worth noting here that in a sense, GPT-3 isn't expensive enough if you are trading so much compute to get the necessary quality. You might well be better off with a GPT-4 which cost 10x as much. This is because the best sample out of 100 is only a bit better than the best out of 50, or the best out of 10, or the average sample, but generating 100 samples costs 100x more. If GPT-4 cost up to 100x more to run, then it might still be a win.

Particularly if you include the cost of screening 100 samples and how many workflows that eliminates... Many absolute technical metrics have hard to understand nonlinear translations to enduser utility. Below a certain apparently arbitrary % as defined by accuracy or word error rate or perplexity or whatever, a tool may be effectively useless; and then as soon as it crests it, suddenly it becomes useful for ordinary people. (Speech transcription & machine translation are two examples where I've noticed this.) It could be worth paying much more if it gets you to a level of reliability or quality where you can use it by default, or without supervision, or for entirely new tasks.

Comment by gwern on [Book Review] "The Alignment Problem" by Brian Christian · 2021-09-20T17:32:23.238Z · LW · GW

I'm curious what animal I would get classified as if people who look like me were removed from Google Photos training dataset. (I hope it's a meerkat.)

If anyone was wondering, no journalists bothered reporting this, but that system classified white people as 'dogs' and 'seals'.

Comment by gwern on Why sigmoids are so hard to predict · 2021-09-18T16:04:32.510Z · LW · GW

Paper version: https://arxiv.org/abs/2109.08065

Comment by gwern on Jitters No Evidence of Stupidity in RL · 2021-09-17T21:58:35.435Z · LW · GW

I agree that much of jittering reflects merely a minor absence of reward-shaping to penalize energy expenditures or wear-and-tear on equipment (the latter especially is why in robotics they do tend to add in tiny penalties for actions/changes to encourage smoothness). And when it learns tactics which depend on ultra-rapid fluctuations, well, that's usually 'a feature not a bug', assuming the environment is faithful to the intended application.

But I still tend to be a little troubled when I see jittering in an agent because it seems like it can reflect pathologies of estimation of values or actions, and to interfere with learning by adding in extraneous variation.

When an agent flips back and forth between actions which are irrelevant, that suggests that the value of the actions are fluctuating rapidly, even though the state of the environment has probably changed only a little; if the agent was learning well, with robust accurate estimation and few weird outliers or overfit estimates, you'd expect more consistency: "in state X, and X+1, and X+2, the best move is to go left"; it would be weird if a single pixel at the edge of the screen being red rather than green convinces the agent to go left - wait now it's one RGB shade brighter, go right - wait, it's back, go left - wait, it's green, go up! - you expect more temporal consistency. (When I read about adversarial attacks on DRL agents, particularly the soccer example, it's hard not to feel like there's some connection to jittering there. There's an analogy there to "non-robust features" in image classification, as well as the original adversarial image attacks: we have a strong intuition that jittering a few pixels should not have any effect.)

In general, it seems like better agents do act more like humans. The hide&seek OA agents or the related DM game agents don't seem to jitter like the original ALE DQN does; AlphaZero, for example, was noted by both Go & chess pros to play in a much more human-like way than weaker computer Go/chess systems (despite the latter also being superhuman), and I've collated many examples of more human-like better-performing systems under the "blessings of scale" rubric. So it seems to me that when an agent is learning clearly inhuman policies like jittering, that is a strong hint that however good it is, it could still be better.

It also seems like it'd interfere with learning: aside from the effect on exploration (jittering looks like epsilon random exploration, about the worst kind), the more disparate actions, the harder it is to estimate the net effect of the key actions or the environmental baseline. If you have only a few actions inside an episode, credit assignment ought to be easier. This might contribute to the previous problem through what you might call "superstitious agents": by twitching rapidly in a particular pattern, maybe it caused the final victory? How do you know it didn't? (It only has a very sparse set of episodes interacting with the environment to try to learn these difficult high-dimensional policies trying to solve potentially arbitrary environments, and those episodes are only partially under control & highly stochastic etc.)

Comment by gwern on Leaky Delegation: You are not a Commodity · 2021-09-15T17:47:59.771Z · LW · GW

When my housemates and I walked into the restaurant, we were greeted by an exquisite ambience and jazz music. Our hearts sank."Given its ratings," one explained, "the better the decor, the worse the food."

I found a man whose business is helping descendants of German Jews reclaim their citizenship. The references he gave all spoke glowingly of him. Yet it only took me a few minutes after calling them to decline. "They all said you were invaluable because they didn't speak German. But I do speak German."

Berkson's paradox along the Pareto frontier.

Comment by gwern on DARPA Digital Tutor: Four Months to Total Technical Expertise? · 2021-09-14T01:18:55.096Z · LW · GW

I ran into this review of Accelerated Expertise about a book (on LG) about a Air Force/DoD thing that sounds very similar, and may give the overall paradigm.

Comment by gwern on How good are our mouse models (psychology, biology, medicine, etc.), ignoring translation into humans, just in terms of understanding mice? (Same question for drosophila.) · 2021-09-13T18:29:31.148Z · LW · GW

https://www.gwern.net/Replication#animal-studies might be a useful bibliography. It is focused on translation into humans, but a lot of the failure to translate appears to be due to the original animal studies being quite bad in terms of both internal & external validity (ie often just wrong, and when right, doesn't even translate to other strains, much less species).

Comment by gwern on How factories were made safe · 2021-09-12T20:29:35.320Z · LW · GW

It's a great video. Even has its own WP article!

Comment by gwern on The Duplicator: Instant Cloning Would Make the World Economy Explode · 2021-09-09T14:07:27.516Z · LW · GW

First of all, today's data centers are like 100K processors, while one em has 100B neurons and way more synapses, so adding processors will make sense for quite awhile.

Today's data centers are completely incapable of running whole brains. We're discussing extremely hypothetical hardware here, so what today's data centers do is at best a loose analogy. The closest we have today is GPUs and neuromorphic hardware designed to implement neurons at the hardware level. GPUs already are a big pain to run efficiently in clusters because lack of parallelization means that communication between nodes is a major bottleneck, and communication within GPUs between layers is also a bottleneck. And neuromorphic hardware (or something like Cerebras) shows that you can create a lot of neurons at the hardware level; it's not an area I follow in any particular detail, but for example, IBM's Loihi chip implements 1,024 individual "spiking neural units" per core, 128 cores per chip, and they combine them in racks of 64 chips maxing out at 768 for a total of 100 million hardware neurons - so we are already far beyond any '100k processors' in terms of total compute elements. I suppose we could wind up having relatively few but very powerful serial compute elements for the first em, but given how strong the pressures have been to go as parallel as possible as soon as possible, I don't see much reason to expect a 'serial overhang'.

Comment by gwern on The Duplicator: Instant Cloning Would Make the World Economy Explode · 2021-09-08T23:11:33.513Z · LW · GW

A brain has serial bottlenecks in the form of all the communication between neurons, in the same way you can't simply shard GPT-3-173b onto 173 billion processors to make it run 173 billion times faster. Each compute element is going to be stuck waiting on communication with the adjacent neurons. At some point, you have 1 compute node per neuron or so (this is roughly the sort of hardware you'd expect ems to run on, brain-sized neuromorphic hardware, efficiently implementing something like spiking neurons), and almost all the time is spent idle waiting for inputs/outputs. At that point, you have saturated your available parallelism and Amdahl's law rules. Then there's no easy way to apply more parallelism: if you have some big chunks of brains which don't need to communicate much and so can be parallelized for performance gains... Then you just have multiple brains.

To overclocking - it seems you're saying parallelization depends on it somehow? I didn't really understand this part.

Increasing clock speed has superlinear costs.

Comment by gwern on The Duplicator: Instant Cloning Would Make the World Economy Explode · 2021-09-08T14:43:14.173Z · LW · GW

This seems to ignore all of the inefficiencies in parallelization.

Processors run more inefficiently the faster you run them (this is the entire reason for 'underclocking'), so running 1 em of hardware 1000x faster will cost you >>1000x. (IIRC, Hanson has a lot of discussion of this point in Age of Em about how the cost of speed will result in tiers: some ems would run at the fastest possible frequencies but only for a tiny handful of tasks which justify the cost, somewhat analogous to HFT vs most computing tasks today - they may need 1 millisecond less latency and will pay for a continent-wide system of microwave towers and commission custom FPGAs/ASICs, but you sure don't!)

There's also Amdahl's law: anything you do in parallel with n processors can be done serially with _n_x the time with zero penalty, but vice-versa is not at all true - many tasks just can't be parallelized, or have only a few parts which can be parallelized, and the parallelization usually comes at at least some overhead (and this is in addition to the penalty you pay for running processors faster).

If there are fixed costs, it would make more sense to do something like run 1 em on a premium processor, and then fork it as soon as possible to a bunch of slow efficient processors to amortize the fixed cost; you wouldn't fork out for an super-exotic crazy (Cray?) 1000x faster processor to do it all in one place.

Comment by gwern on Can you get AGI from a Transformer? · 2021-09-07T22:08:29.100Z · LW · GW

Yeah, I didn't want to just nitpick over "is this tree search a MCTS or not", which is why I added in #2-4, which address the steelman - even if you think MuZero is using MCTS, I think that doesn't matter because one doesn't need any tree search at all, so a fortiori that question doesn't matter.

(I also think the MuZero paper is generally confusing and poorly-written, and that's where a lot of confusion is coming from. I am not the only person to read it through several times and come away confused about multiple things, and people trying to independently reimplement MuZero tell me that it seems to leave out a lot of details. There's been multiple interesting followup papers, so perhaps reading them all together would clarify things.)


Yes, so on your spectrum of #1-6, I would put myself at closer to 3 than 2. I would say that while we have the global compute capacity now to scale up what are the moral equivalents of contemporary models to what the scaling laws would predict is human-equivalence (assuming, as seems likely but far from certain, that they more or less hold - we haven't seen any scaling law truly break yet), at the hundreds of trillions to quadrillion parameter regime of Transformers or MLPs, this is only about the compute for a single training run. The hardware exists and the world is wealthy enough to afford it if it wanted to (although it doesn't).

But we actually need the compute for the equivalent of many runs. The reason hardware progress drives algorithmic software progress is because we are absolutely terrible at designing NNs, and are little more than monkeys banging at giant black boxes with trial-and-error, confabulating or retrospectively cherrypicking theories to explain the observed results. Thus we need enough compute to blow on enough runs that a grad student can go 'what if I added a shortcut connection? Oh' or 'these MLP things never work beyond 3 or 4 layers, everyone knows that... but what if I added any kind of normalization, the way we normalize every other kind of NN? Oh' and figure out the right detail which makes it Just Work.

So, we will need a lot of algorithmic efficiency beyond the bare minimum of '1 training run, once', to afford all the slightly-broken training runs.

(Unless we get 'lucky' and the prototyping small runs are so accurate and the code so solid that you can prototype at a tiny scale and do 1 run; I tend to disbelieve this because there's so many issues that always come up as you move several magnitudes, both at just the code level and training.)

On the other hand, it is something that humans deliberately added to the code.

/shrug. If you don't like TreeQN example, I have others! Just keep making the NN deeper (and/or more recurrent, same thing really, when unrolled...), and it'll keep approximating the value function better at fairly modest additional cost compared to 'real' tree search. (After all, the human brain can't have any symbolic discrete tree in it either, it just passes everything forward for the initial glance and then recurs for System 2 thinking through the game tree.)

I see symbolic vs neural as a bias-variance continuum, per the Bitter Lesson: symbolic learns quickly for little compute, but then it tops out, and eventually, the scissors cross, and the more neural you go, the better it gets. So the question ultimately becomes one of budgets. What's your budget? How much constant-factor performance optimization and ultimate ceiling do you need, and how much hand-engineering of that specialized complicated symbolic architecture are you willing to buy? If you have little compute and don't mind attaining less than superhuman performance and buying a lot of complicated domain-specific code, you will move far down the symbolic end; if you have lots of compute and want the best possible generic code...

and less apt to believe that it's feasible for something like AutoML-Zero to search through the whole space of things that you can do with this toolkit, and less apt to describe the space of things you can build with this toolkit as "algorithms similar to DNNs".

But that's where the scaling laws become concerning. Can AutoML-Zero successfully search for "code to implement MCTS with pUCT exploration heuristic and domain-specific tuned hyperparameters with heavy playouts using a shallow MLP for value approximation"? Probably no. That's complex, specialized, and fragile (a half-working version doesn't work at all). Can AutoML-Zero learn "add 10 moar layers to $DEFAULT_NN lol"? ...Probably yes.

Comment by gwern on [deleted post] 2021-09-06T22:19:30.524Z

Yeah, it's a bit of an blind men/elephant thing. Like the Turing test thing was all of those, because he said something along the lines of "we don't want to aim for passing the Turing test (because that's pointless/useless and OA can only do a few things at a time) but we could if we put a few years into it and a hypothetical GPT-5* alone could probably do it". All 3 claims ("we could solve Turing test", "a GPT-5 would probably solve Turing", "we don't plan to solve Turing") are true and logically connected, but different people will be interested in different parts.

* undefined but presumably like a GPT-4 or GPT-3 in being another 2 OOM or so beyond the previous GPT

Comment by gwern on Can you get AGI from a Transformer? · 2021-09-06T18:53:24.806Z · LW · GW

No. I am very familiar with the paper, and MuZero does not use MCTS, nor does it support the claims of OP.

First, that's not MCTS. It is not using random rollouts to the terminal states (literally half the name, 'Monte Carlo Tree Search'). This is abuse of terminology (or more charitably, genericizing the term for easier communication): "MCTS" means something specific, it doesn't simply refer to any kind of tree-ish planning procedure using some sort of heuristic-y thing-y to avoid expanding out the entire tree. The use of a learned latent 'state' space makes this even less MCTS.*

Second, using MCTS for the planning is not necessary. As they note, any kind of planning algorithm, not just MCTS would work ("For example, a naive search could simply select the k step action sequence that maximizes the value function. More generally, we may apply any MDP planning algorithm to the internal rewards and state space induced by the dynamics function.")

Third, NNs absolutely can plan in a 'pure' fashion: TreeQN (which they cite) constructs its own tree which it does its own planning/exploration over in a differentiable fashion. What more do you want? I feel that we should at least acknowledge that TreeQN exists, wasn't insuperably hard to create, and, inasmuch as it runs on current hardware at all, doesn't seem to entail 'a factor of a million slowdown'. (VIN/VPN/Predictron might count as examples here too? There's a lot of model-based RL work which make the NN learn part of the planning process, like Imagination-based Planner or MCTSnets.)

Fourth, planning is not necessary at all for the NN to compute results just as strong as tree search would: just like regular AlphaZero, the policy network on its own, with no rollouts or trees involved of any sort, is very strong, and they show that it increases greatly in strength over training. We also have the scaling law work of Andy Jones, verifying the intuition that anything tree search does can be efficiently distilled into a non-tree-search model trained for longer. (I would also point out the steeply diminishing returns to both depth & number of iterations: AlphaZero or Master, IIRC, used only a few TPUs because the tree-search was a simple one which only descended a few plies; you can also see in the papers like the MuZero appendix referenced that most of the play strength comes from just a few iterations, and they don't even evaluate at more than 800, IIRC. It seems like what tree search does qualitatively is correct the occasional blind spot where the NN thinks forward a few moves for its best move and goes 'oh shit! That's actually a bad idea!'. It's not doing anything super-impressive or subtle. It's just a modest local policy iteration update, if you will. But the NN is what does almost all of the work.) This alone is completely fatal to OP's claims that tree search is an example of useful algorithms neural nets cannot do and that adding orders of magnitude more compute would not make a difference (it totally would - the exact scaling exponent for Go/ALE is unknown but I'd bet that anything you can do with MuZero+tree-search can be done with a larger MuZero's policy alone given another order or three of compute).

So, MuZero does not use MCTS; the symbolic tree planning algorithm(s) it uses are not that important; to the extent that explicit tree planning is useful it can be done in a pure neural fashion; and relatively modest (as these things go) increases in compute can obviate the need for even pure neural tree search.

This refutes Byrne's use of tree search as an example of "Background Claim 1: There are types of information processing that cannot be cast in the form of Deep Neural Net (DNN)-type calculations (= matrix multiplications, ReLUs, etc.), except with an exorbitant performance penalty." Tree search is not an example because it already has been cast into DNN form without exorbitant performance penalty.

* for more on what AlphaZero MCTS "really" is, https://arxiv.org/abs/2007.12509 & https://arxiv.org/abs/1804.04577 come to mind.

Comment by gwern on [deleted post] 2021-09-06T15:50:26.097Z

It's what I would have guessed at the estimated revenue numbers, but it's good to know. It also means that they're probably going to seek more VC (from MS?) for continuing upgrading & scaling, since we're seeing increasing competition. (Google alone has posted at least 2 papers so far using LamDA which is at a comparable scale, for coding as well, and their mysterious 'Pathways' model is apparently multimodal and even larger.)

Comment by gwern on [deleted post] 2021-09-06T14:45:30.382Z

One observation I found interesting was Altman said that the OA API+Codex is profitable, but not profitable enough for the 'next generation' of models.

Comment by gwern on [deleted post] 2021-09-06T01:51:38.728Z

The context was that he was saying, to paraphrase, "that people would adapt to the changes from pervasive cheap energy & intelligence on tap [which he forecasts as coming in the next decades], however scary and weird we might find it, because the modern context is already weird and very different from human history; an example of this sort of human ability to cope with change is that the US government announced the other day that UFOs are real, and everyone just shrugged and carried on as usual." I didn't take him as endorsing the claim "yeah, space aliens are totally real, here, and buzzing pilots for kicks", necessarily.

Comment by gwern on [deleted post] 2021-09-06T01:47:59.529Z

I do remember hearing that if systems became capable of self-improvement (sooner than expected?), that could be a big update

The way I heard that bit was he said he expected it to go smoothly; then someone asked him what it would take to change his mind and would be a 'fire alarm', and he said self-improvement with some sudden jumps in abilities is where he'd start to seriously worry about a hard takeoff.

Comment by gwern on Why the technological singularity by AGI may never happen · 2021-09-03T16:08:06.385Z · LW · GW

Don't forget https://www.gwern.net/Complexity-vs-AI which deals with hippke's argument more generally. We could also point out that scaling is not itself fixed as constant factors & exponents both improve over time, see the various experience curves & algorithmic progress datasets. (To paraphrase Eliezer, the IQ necessary to destroy the world drops by 10% after every doubling of cumulative research effort.)

Comment by gwern on Guide to Warwick, New York · 2021-09-01T22:19:27.229Z · LW · GW

I think it is simply automatic crossposting of Zvi's feed, is it not? One of the post submitted because many posts on that source are of LW interest, not necessarily with any particular post claimed to be super-relevant.

In this case, you could certainly extract a LW-relevant lesson: "small towns are pretty nice & cheap places to live if you are unambitious and don't want to meet people, but are intellectually costly". This is important to note about the tradeoffs, and this was precisely what was at issue with MIRI's recent evaluation of whether to move out of the Bay Area to places a lot like Warwick, NY, and their choice not to: as crummy and getting worse as the Bay Area may be, it is still ultra-dense with the people that MIRI needs both professionally and as amenities for its people personally.

Comment by gwern on How is reinforcement learning possible in non-sentient agents? · 2021-08-31T17:17:27.913Z · LW · GW

One toy model worth considering is MENACE. It is clearly a model-free RL algorithm (a kind of tabular Q-learner) which successfully solves Tic-Tac-Toe, without even requiring a computer, but breaks most of one's anthropomorphization or mentalization attempts.

Comment by gwern on Multi-dimensional rewards for AGI interpretability and control · 2021-08-30T01:55:52.322Z · LW · GW

What happens? Your behavior would change in response, but I claim it would change very gradually.

For model-free learning. For model-based learning, your behavior changes instantly, as you are now able to project forward into the future, examine the utilities, discover that scenarios with social approval now have zero utility, and all actions are then directed towards whatever is still rewarding. (In psychological experiments, the speed of adaptation is in fact considered one way to distinguish between a mouse using model-based RL and when it is using model-free: when a reward at the end of the maze changes, does it need to hit the end of the maze several times before any decisions start changing, or is it able to rapidly switch plans, implying a model of the environment separate from the learning about rewards?)

Comment by gwern on Randal Koene on brain understanding before whole brain emulation · 2021-08-25T23:40:41.938Z · LW · GW

I've been trying to brand this paradigm as "brain imitation learning" but it hasn't caught on. The research still continues and we're seeing exponential increases in neuron recording capabilities and DL models are doing ever better in cracking open the human brain's neural code*, but this in-between approach is still mostly ignored.

* so IMO the only reason to be less interested in it than a few years ago is if you think pure DL scaling/progress has gone so fast that it's outpacing even that, which is reasonable but given the imponderables here and the potential for sudden plateaus in scaling or pure DL progress, I think people should still be keeping more of an eye on brain imitation learning than they do.

Comment by gwern on Are we in an AI overhang? · 2021-08-25T16:59:58.582Z · LW · GW

Might be worth getting around to it:

Comment by gwern on A Better Web is Coming · 2021-08-22T17:06:29.387Z · LW · GW

Yes, the community equilibrium is entirely different. On WP editors have little compunction about editing categories; here, I know vaguely that tags can be added (although I didn't know that you could refactor them or remove them), but I wouldn't do so because there's no particular norm to do so. Who am I go to about editing matto's post's tags to break down world-optimization into something more specific?

Tags could be useful, but they aren't now, and so they stay being not useful, and it's unrealistic to expect anyone to single-handedly fix that when there's like 10 posts a day and approaching 12 years of backlog.

A GPT-3 proof-of-concept will certainly be interesting. If it works, it could bootstrap useful tags on larger corpuses like LW. (It might be expensive, but it's only money, and a lot cheaper than the expert LWer time it'd take; and of course, if GPT-3 works well, then perhaps a rival model like GPT-J or T5 or Jurassic would be worth finetuning to cut costs.)

Comment by gwern on A Better Web is Coming · 2021-08-22T02:22:25.477Z · LW · GW

I find the context tags on LessWrong useful at times.

I've found them useless in every iteration. They are extremely inconsistently applied, and those authors who do bother to make an effort often leave them at uselessly large levels of granularity like math or statistics or AI. (Gee, thanks.)

A decent tag or category system needs to be reasonably comprehensive - if not, why even bother, just go straight to Google search - and regularly refined to shrink member count. If there are 1000 members of a category, then it is long past time to break that down into a few sub-categories. When I look at websites whose tags or categories are useful, like Wikipedia or Danbooru or classic folksonomies like Del.icious (RIP), the tagging itself is a major focus of community efforts and it doesn't require the cooperation of the author to update things.

Any WP editor can refine a category into subcategories or add a category to any article, and there are tools to assist by brute force to clean it all up. It's a huge time-sink of human effort, like everything on WP, but it works, dammit! You can meaningfully browse WP categories and have a reasonable expectation of comprehensiveness, and they do a good job of gradually encoding the structure of all the crosscutting domains. I use them fairly often.


I use tags on gwern.net for pages, and I try to systematically add new tags to all relevant pages and refactor them down into reasonably sharp tags. I think they wind up being reasonably useful, but there's also not enough pages on gwern.net for tags to shine. (When you can simply list all the good pages on the index in a few screens by topic, you've covered the Pareto value of tags.)

What I have been considering is extending tags to external links/documents. I have something like 20k external links + hosted documents, and the sheer volume means that tags are potentially highly useful for them. (A link like "Open-Ended Learning Leads to Generally Capable Agents" would benefit a lot from a set of tags like 'blessings-of-scale multi-agent DeepMind deep-reinforcement-learning' which offer an entrance point to the scores of prior art links to contextualize it.) The problem is how to be systematic? My thinking is that this is a case where I can employ the OA GPT-3 API's "classification" endpoint to do the work for me: I don't scale well, but it does. I can initialize the link tags from my existing directory hierarchy, finetune a GPT-3 model to infer "tag" from "annotation" (GPT-3 is smart enough that it'll understand this very well), use that to rank possible tags for all links, accept/reject by hand, and bootstrap. Then adding new tags can be done by re-classifying all links. A lot of details to get right, but if it works, it'll be almost as good as if I'd been building up a tag folksonomy on my links from the getgo.

Comment by gwern on MikkW's Shortform · 2021-08-21T16:38:44.239Z · LW · GW

It's less surprising if you're familiar with the history of MCTS. MCTS is a generic MDP or decision-tree solver: you can use it for pretty much any kind of non-adversarial discrete fully-observed planning process where you have a model; you can extend it to non-fully-observed POMDP and continuous observations fairly easily, and that was done back in the 2000s. (Adversarial is also easy - minimax it - but adversarial+POMDP mostly breaks MCTS which is why you don't see it but other methods solving poker.) Path planning is a classic tree search problem which comes up all the time in robotics and other planning domains like planning movement paths in simulations/games, and so if you go back and look, you'll find plenty of pre-AlphaGo applications of MCTS to path planning.

Comment by gwern on Are we in an AI overhang? · 2021-08-20T19:29:15.113Z · LW · GW

They still running into stationary objects? The hardware is cool, sure, but unclear how much good it's doing them...

Comment by gwern on AI-Based Code Generation Using GPT-J-6B · 2021-08-18T01:53:07.896Z · LW · GW

To elaborate on this a little more: maintenance is the kind of nasty field where '99% accurate' may still not be nearly good enough if you want to unlock big productivity gains of the sort you get by replacing humans entirely, rather than merely saving a few minutes here or there looking up API docs etc. Amdahl's law is not mocked: if a human has to manually review and step in, then it cannot deliver more than modest factor gains, any more than learning to type really fast will deliver life-changing productivity gains. Maintenance is almost by definition about the long tail of subtle bugs, system interactions, faulty assumptions, and business-driven requirement changes.* If you're a SWE at Google, you don't spend very much time writing little self-contained greenfield scripts of 100-500 lines. You'll spend a lot more time doing, say, code reviews of new pulls, which involve no writing of greenfield code at all. Something like Codex can help knock out the occasional script or help in learning a new system or be a very useful substrate for static analysis tools (like Coverity on steroids), but I can confidently predict that Codex is not going to make programmers even 10x more productive. Utility doesn't increase smoothly with accuracy: it plateaus and jumps. You don't want to use a voice transcription system which makes 10% errors, but at 5% it might suddenly become useful.

But ironically, in many ways, developing DL code is far simpler. Sometimes, solving a much harder problem is much easier. DL is much more self-contained and amenable to self-modification. The complexity of the learned tasks resides in the weights, not the seed algorithm which learns the NN; the seed algorithm may be extremely simple and short, a few hundred lines at most, including all the boilerplate and wrappers and ceremony. You can write backprop and CNNs in a few hundred lines for a self-contained CPU implementation. Available DL libraries let you create & train an arch like GPT in a few dozen lines (Karpathy does minGPT in <300 lines of bloated code). Rip Van Winkle is an interesting exercise in estimating complexity, in a Kolmogorov sort of way, of a formerly-SOTA CNN ResNet at 1,032 bits. Evolutionary search programs like AutoML-Zero can recapitulate backprop and other core algorithms in a few lines. We also see this in the breakthroughs themselves: why do MLPs suddenly work? Because you add like 1 line to re-normalize or gate intermediate activations. Why did resnets suddenly make 'deep' (>10) layer NNs work? Because you add like 1-3 lines to define a shortcut connection. Why did NNs suddenly start working around 2009? Because you added 1 line for the right initialization, and 1 line for a ReLU instead of sigmoid nonlinearity. Why did X work - we could go on all day. (Why is one person a genius and another person ordinary? Differences at a few thousand alleles which could be encoded in less than a kilobyte. Everything is fragile.) The space of all possible programs of a few hundred self-contained lines to bootstrap a general meta-learning agent is vast... but it's also exactly the sort of task where a self-supervised agent can acquire most of the necessary bits from the environment, solving basic problems like how to create valid ASTs (the sort of knowledge that isn't in AutoML-Zero-esque systems, and mostly accounts for their boil-the-ocean inefficiency), and then use the tiny bit of supervision from evolutionary RL losses to close the gap by selecting only plausible modifications to test, running a feasible number of iterations, and modifying the last handful of key lines.

Thus, an asymmetry in code-generating AIs. A code-generating AI could be almost completely useless for 'easy' maintenance tasks like fixing bugs in production code because it comes with so much overhead and unreliability that it isn't worth the hassle, but also still offer enormous exponential gains in ranking candidates for the 'hard' problem of rewriting a core DL algorithm.

* If you've paid attention to the popups on Gwern.net, you've probably noticed that they've changed a number of times; the Wikipedia popups, specifically, have now gone through 8 completely different implementations. The 8th iteration, ironically, is very similar to the 1st iteration: it requests from the Wikipedia APIs an article summary and displays it; that's all. I & Obormot have spent a breathtaking amount of time on this, not because the actual coding itself takes up substantial time (none of it is remotely impressive algorithmically), but because the hard part is understanding what even should be done in the first place and what tradeoff between static, dynamic, inlined vs external, popup vs popin etc works best, implementing and testing in the real world to see how it felt in practice and what users thought, how it scaled as I fixed bugs & found edge-cases... By the 8th iteration, what we'd learned was that static or inlined couldn't work at scale or provide recursion in any feasible way and were deadends, and the main motivation for those - displaying hyperlinked excerpts - was moot because we were using the wrong WP API in the first iteration, and there was a 'mobile' API which, I discovered after hours of docs reading, provided useful rather than butchered excerpts and worked fine all along. "Time is a circle."

Comment by gwern on Who was the person who escaped the Nazis a day before they cracked down? · 2021-08-17T20:27:01.709Z · LW · GW

https://jasoncrawford.org/precognition

Comment by gwern on Simultaneous Redundant Research · 2021-08-17T17:09:06.933Z · LW · GW

Obviously the first thing to do is to divide the research into subproblems and research them in parallel. But what if the number of research teams still exceeds the number of real subproblems identified?

This is easy but not necessarily optimal. Sometimes you want to overkill a hypothesis before falling back. Imagine a scenario where the top hypothesis has 50% prior probability, you run an experiment which is powered to have a 10% error rate in definitively accepting/rejecting it, and you could run a second experiment reducing that error to 5%; do you really want to instead spend that experiment testing a dark horse hypothesis with a priority probability of 1%? Probably better to drive that top hypothesis down to <1% first, and the marginal value of other hypotheses becoming larger, before investing in buying lottery tickets.

This is a pretty classic sort of multi-stage decision problem in research and so relevant stuff comes up everywhere depending on how you look at it: it's related to experiment design, particularly factorial design; to external vs internal validity, especially in meta-analysis where you balance between-study measurement of heterogeneity/systematic error with overcoming within-study random sampling error; to group testing; and to parallelized blackbox optimization (especially in hyperparameter optimization, where you can more easily run many models in parallel than one model really fast) where you have to distribute multiple arms sampling across the loss landscape and need to avoid over-concentrating in narrow regions of settings.

Comment by gwern on Rafael Harth's Shortform · 2021-08-16T16:35:39.676Z · LW · GW

Aside from lambdas, Python has 'inner functions' where you just def inside a def. Java has anonymous inner classes and private functions, and Java 8 adds lambdas; I had to google this one, but apparently Java even has "local classes" which sounds like an exact match for what you want?

Comment by gwern on Rafael Harth's Shortform · 2021-08-15T22:03:53.631Z · LW · GW

What languages are you using that don't support that? Every language I use on a semi-monthly basis (Haskell, R, Python, Bash, Javascript, PHP, Elisp...) that I can think of supports defining a function inside a function (under various names like let/where local definitions, 'inner functions', what-have-you), and typically support even anonymous function definitions (lambdas).

Comment by gwern on New GPT-3 competitor · 2021-08-15T21:05:42.252Z · LW · GW

It's both context length and bias-variance means modeling raw data is intrinsically harder. Realistically, byte-level is about as low-level as is reasonable to tokenize at this point, and you can get good results like ByT5.

You could definitely imagine that more complicated architectures with more flexible computation patterns than standard Transformers would be more able to handle bit-level encodings, like a Perceiver which selectively attends to bits and pieces of a very large binary input, saving computation by only iteratively focusing on the specific bits which it needs, but such an arch is going to be that much harder to train, and likely require more data to overcome the overhead & increased flexibility.

Comment by gwern on New GPT-3 competitor · 2021-08-15T18:16:15.399Z · LW · GW

One way to think of it: the least-lossy and biased tokenization, the one most faithful to all input representations, allowing the most accurate modeling possible, which allows the best possible splitting, would have exactly 2 tokens - '0', and '1'.

All tokenizations beyond that are implicitly pre-processing the data before the NN sees it, and are making a choice on the bias-variance tradeoff to inject some bias (hiding the raw data) to reduce the variance (by condensing texts into shorter token sequences and doing some 'thinking' in advance).

Comment by gwern on Slack Has Positive Externalities For Groups · 2021-08-14T00:37:52.641Z · LW · GW

Queuing theory: https://blog.acolyer.org/2015/04/29/applying-the-universal-scalability-law-to-organisations/

Comment by gwern on How would the Scaling Hypothesis change things? · 2021-08-13T22:51:11.694Z · LW · GW

which is that you could actually get decent data-efficiency out of current architectures if they were just really really big?

You mean in some way other than the improvements on zero/few-shotting/meta-learning we already see from stuff like Dactyl or GPT-3 where bigger=better?

Comment by gwern on [AN #160]: Building AIs that learn and think like people · 2021-08-13T19:16:30.591Z · LW · GW

(Formatting is again completely screwed up, to the point where you can't even scroll to the comment section.)

Comment by gwern on New GPT-3 competitor · 2021-08-13T19:13:03.795Z · LW · GW

I agree.

Comment by gwern on Combining the best of Georgian and Harberger taxes · 2021-08-12T19:06:38.931Z · LW · GW

This reminds me of the difficulty in doing any kind of Harbergerian transfer of domain names in cryptocurrency-related DNS or identity systems: the value to a malicious attacker of a domain name can vastly exceed any amount that the legitimate owner, who creates that value, can afford to pay. An attacker could afford to spend millions of dollars to seize 'Coinbase.com' for an hour (so as to MITM accounts+passwords, and thereby steal billions in account balances), while Coinbase itself cannot afford to spend millions per hour securing a domain name. The system operates as designed in transferring the domain name to the user who can extract the most value from the monopoly over the name, but that's not the same thing as creating social value...

Comment by gwern on What 2026 looks like (Daniel's Median Future) · 2021-08-12T18:12:23.637Z · LW · GW

Are you surprised? That is precisely what you should expect from the transfer scaling law papers: transfer works as an informative prior saving you a fixed amount of data in the target domain, but informative vs uninformative priors wash out in the limit of enough data - similar to how good prompts are worth a few hundred/thousand finetuning datapoints. If you have limited data in the target domain, transfer can be a huge win; but if you have huge amounts of data, it may be unimportant in terms of final converged performance (albeit potentially important for other reasons like saving compute!).

This is an application where you can scrape huge amounts of code from Github and the rest of the Internet (literally terabytes), so it's unsurprising that you can reach the parity point.

Comment by gwern on AI-Based Code Generation Using GPT-J-6B · 2021-08-12T18:11:06.778Z · LW · GW

The approach of just generative self-supervised learning on existing source code corpuses is picking the low-hanging fruit. As impressive as it is to see Codex just knock out a web scraper, coding is very much a colonization wave sort of place: standalone code is fine, but the bulk of the work has always been maintenance and debugging of existing systems, not spitting out little self-contained scripts. Because of this asymmetry, Codex is a meaningful step towards UFAI, but is a smaller step in terms of automating programmers.

Comment by gwern on New GPT-3 competitor · 2021-08-12T18:05:27.309Z · LW · GW

The interesting thing about Jurassic-1 is that it really doesn’t go much beyond GPT-3.

No, the interesting thing is that it's available as a public API. It took 13 months for an OA API competitor to emerge, but now it's here and the OA API has a real competitor, and someone who will be happy to pick up many of the customers OA has driven away with its increasingly heavy-handed, arbitrary, and last-minute restrictions. (The tokenizer and better width vs depth scaling is trivial by comparison.)

The models came before, but not an API/SaaS. GPT-3 was already matched/exceeded by the dense models HyperClova & PanGu-α, and possibly MUM/LaMDA/Pathways/the Wu Daos*, but none of those are meaningfully publicly accessible, and so came and went. Jurassic-1 is available as an API, and is even free right now. That is very different, in much the same way that GPT-J is being so heavily used by everyone locked out of the OA API because it is available for free. "Free [public] is different."

* details are sparse on all these, including the nature of any sparsity

Comment by gwern on Bring up Genius · 2021-08-08T23:42:06.075Z · LW · GW

I don't think that's true either, though. Early specialization requires solving an almost impossible prediction problem (it's difficult enough to know what would be the 'right' field for a teenager or young adult, how are you going to do it for a <5yo? This is the same reason that high-IQ elementary schools can't work); people, nevertheless, continue to try to do with Polgar says, and yet, we don't see kids trained from toddlerhood dominating the elite reaches of every field. Early training is of pretty dubious value: when we look at early childhood interventions like Headstart, the gains fade out, and there are plenty of places like, I believe, Finland, which start education late and see no problem from this. (I think Scott also discussed this for homeschooling and in his graduation post.) "T-shaped" expertise requires a lot of exploration to gain breadth and figure out where to specialize, and for every Polgar, there's a late bloomer (iirc, Epstein in The Sports Gene - which I liked far more than Bring Up Genius - gives many athletic examples, and made it a major focus of his 2019 Range: Why generalists triumph in a specialized world which I haven't read yet); and you have newer results like "What Makes a Champion? Early Multidisciplinary Practice, Not Early Specialization, Predicts World-Class Performance", Gullich et al 2021, which find the opposite of this claim:

What explains the acquisition of exceptional human performance? Does a focus on intensive specialized practice facilitate excellence, or is a multidisciplinary practice background better? We investigated this question in sports. Our meta-analysis involved 51 international study reports with 477 effect sizes from 6,096 athletes, including 772 of the world’s top performers. Predictor variables included starting age, age of reaching defined performance milestones, and amounts of coach-led practice and youth-led play (e.g., pickup games) in the athlete’s respective main sport and in other sports. Analyses revealed that (a) adult world-class athletes engaged in more childhood/adolescent multisport practice, started their main sport later, accumulated less main-sport practice, and initially progressed more slowly than did national-class athletes; (b) higher performing youth athletes started playing their main sport earlier, engaged in more main-sport practice but less other-sports practice, and had faster initial progress than did lower performing youth athletes; and (c) youth-led play in any sport had negligible effects on both youth and adult performance. We illustrate parallels from science: Nobel laureates had multidisciplinary study/working experience and slower early progress than did national-level award winners. The findings suggest that variable, multidisciplinary practice experiences are associated with gradual initial discipline-specific progress but greater sustainability of long-term development of excellence.

...On the other hand, Sir Chris Hoy, the most successful racing cyclist of all time, did not start track cycling until age 17 and won his first gold medal at age 26 (Mackay, 2017). College basketball player Donald Thomas started practicing the high jump at age 22 and became world champion in the high jump at age 23 (Denman, 2007). Furthermore, athletes widely regarded as the greatest of all time in their sports, Roger Federer, Michael Jordan, Wayne Gretzky, Michael Phelps, and Sir Chris Hoy, all played a diverse range of sports throughout childhood and adolescence rather than specializing in their main sport at an early age (Epstein, 2019; Landers, 2017; Hawkins, 2014; Mackay, 2017; DeHority, 2020).

...This research focused on sports, but analogous findings have been reported for at least one nonathletic domain: science. Graf 2015 [Die Wissenschaftselite Deutschlands: Sozialprofil und Werdegänge zwischen 1945 und 2013] examined the biographies of the 48 German Nobel laureates in physics, chemistry, economy, and medicine/physiology since 1945. 42 had multidisciplinary study and/or working experiences. Compared with winners of the Leibnitz prize---Germany’s highest national science award---Nobel laureates were less likely to have won a scholarship as a student and took statistically-significantly longer to earn full professorships and to achieve their award. Taken together, the observations suggest that early multidisciplinary practice is associated with gradual initial discipline-specific progress but greater sustainability of long-term development of excellence.

(I favor their "multiple-sampling-and-functional-matching hypothesis": when I read biographies, the importance of 'fitting' in a specific field that one can be obsessive about and which matches one's unique profile, seems like a critical and often underrated factor in going from being a highly talented and competent researcher, to a researcher someone would be reading or writing a bio about.)

Comment by gwern on Analysis of World Records in Speedrunning [LINKPOST] · 2021-08-08T23:30:21.962Z · LW · GW

Differences and commonalities to expect between speedrunning and technological improvement in different fields.

Is there any way to estimate how many cumulative games that speedrunners have run at a given point? It is intuitive that progress should be related to amount of effort put into it, and that the more people play a game, the further they can push the limits, which may explain a lot of the apparent heterogeneity, even if all games have a similar experience curve exponent.

It's also interesting because the form might suggest that each attempt has an equal chance of setting a record (equal-odds rule; "On the distribution of time-to-proof of mathematical conjectures", Hisano & Sornette 2012 for math proof attempts; counting-argument in "Scaling Scaling Laws with Board Games", Jones 2021), which shows how progress comes from brute force thinking.