Brain-inspired AGI and the "lifetime anchor"

post by Steven Byrnes (steve2152) · 2021-09-29T13:09:44.141Z · LW · GW · 16 comments

Contents

  1. Assumptions for this post
  2. Thesis and outline
  3. Background: The “Lifetime Anchor” in Ajeya Cotra's draft report
  4. Why Ajeya puts very little weight on the Lifetime Anchor, and why I disagree
  5. Why Ajeya thinks the computer-vs-brain inefficiency factor should be >>1, and why I disagree
    5.1 …And indeed why the computer-vs-brain inefficiency factor should be <<1!
  6. Some other timeline-relevant considerations
    6.1 How long does it take to get from janky grad-student code to polished, scalable, parallelized, hardware-accelerated, turn-key learning algorithms?
    6.2 How long (wall-clock time) does it take to train one of these models?
    6.3 How many full-length training runs do we need?
  7. Conclusion
None
16 comments

Last year Ajeya Cotra published a draft report on AI timelines [LW · GW]. (See also: summary and commentary by Holden Karnofsky, podcast interview with Ajeya [LW · GW].)

I commented at the time (1 [LW(p) · GW(p)],2 [LW(p) · GW(p)],3 [LW(p) · GW(p)]) in the form of skepticism about the usefulness of the "Genome Anchor" section of the report. Later I fleshed out those thoughts in my post Against Evolution as an Analogy for how Humans Will Create AGI [LW · GW], see especially the "genome=code" analogy table near the top [LW · GW].

In this post I want to talk about a different section of the report: the "Lifetime Anchor".

1. Assumptions for this post

Here are some assumptions. I don’t exactly believe them—let alone with 100% confidence—but for the purpose of this post let’s say I do. I’m not going to present any evidence for or against them here. Think of it as the Jeff Hawkins perspective [LW · GW] or something.

ASSUMPTION 1: There’s a “secret sauce” of human intelligence, and it looks like a learning algorithm (and associated inference algorithm).

ASSUMPTION 2: It’s a fundamentally different learning algorithm from deep neural networks. I don’t just mean a different neural network architecture, regularizer, etc. I mean really different, like “involving probabilistic program inference algorithms” or whatever.

ASSUMPTION 3: The algorithm is human-legible, but nobody knows how it works yet.

ASSUMPTION 4: We'll eventually figure out this “secret sauce” and get Transformative AI (TAI). [Note added for clarification: To simplify the discussion, I'm assuming that when this is all happening, we don't already have TAI independently via some unrelated R&D path.]

If you think these assumptions are all absolutely 100% wrong, well, I guess you might not find this post very interesting.

To be clear, Ajeya pretty much explicitly rejected these assumptions when writing her report [LW · GW] (cf. discussion of “algorithmic breakthroughs” here [LW · GW]), so there's no surprise that I wind up disagreeing with what she wrote. Maybe I shouldn't even be using the word "disagree" in this post. Oh well; her report is still a good starting point / foil for present purposes.

2. Thesis and outline

I will argue that under those assumptions, once we understand that “secret sauce”, it’s plausible that we will then be <10 years away from optimized, tested, well-understood, widely-used, industrial-scale systems for training these models all the way to TAI.

I’ll also argue that training these models from scratch will plausibly be easily affordable, as in <$10M—i.e., a massive hardware overhang [? · GW].

(By “plausible” I mean >25% probability I guess? Sorry, I’m not at the point where I can offer a probability distribution that isn’t pulled out of my ass.)

Outline of the rest of this post: First I’ll summarize and respond to Ajeya’s discussion of the “Lifetime Anchor” (which is not exactly the scenario I’m talking about here, but close). Then I’ll talk (somewhat speculatively) about time and cost involved in refactoring and optimizing and parallelizing and hardware-accelerating and scaling the new algorithm, and in doing training runs.

3. Background: The “Lifetime Anchor” in Ajeya Cotra's draft report

In Ajeya's draft report [LW · GW], one of the four bases for estimating TAI timelines is the so-called “Lifetime Anchor”.

She put it in the report but puts very little stock in it: she only gives it 5% weight.

What is the “Lifetime Anchor”? Ajeya starts by estimating that simulating a brain from birth to adulthood would involve a median estimate of 1e24 floating-point operations (FLOP). This comes from 1e24 FLOP ≈ 1e15 FLOP/s × 30 years, with the former being roughly the median estimate in Joe Carlsmith’s report, and 30 years being roughly human adulthood (and rounds to a nice even 1e9 seconds). Actually, she uses the term “30 subjective years” to convey the idea that if we do a 10×-sped-up simulation of the brain, then the same training would take 3 years of wall-clock time, for example.

A 1e24 FLOP computation would cost about $10M in 2019, she says, and existing ML projects (like training AlphaStar at 1e23 FLOP) are already kinda in that ballpark. So 1e24 FLOP is ridiculously cheap for a transformative world-changing AI. (Memory requirements are also relevant, but I don’t think they change that picture, see footnote.[1])

OK, so far she has a probability distribution centered at 1e24 FLOP, proportional to the distribution she derived from Joe Carlsmith’s report. She then multiplies by a, let’s call it, “computer-vs-brain inefficiency factor” that she represents as a distribution centered at 1000. (I’ll get back to that.) Then there’s one more step of ruling out extremely-low-compute scenarios. (She rules them out for reasons that wouldn't apply to the scenario of Section 1 that I'm talking about here.) She combines this with estimates of investment and incremental algorithmic improvements and Moore's law and so on, and she winds up with a probability distribution for what year we'll get TAI. That's her “lifetime anchor”.

4. Why Ajeya puts very little weight on the Lifetime Anchor, and why I disagree

Ajeya cites two reasons she doesn’t like the lifetime anchor.

First, it doesn’t seem compatible with the empirical model size and training estimates for current deep neural networks:

I think the most plausible way for this hypothesis to be true would be if a) it turns out we need a smaller model than I previously assumed, e.g. ~1e11 or ~1e12 FLOP / subj sec with a similar number of parameters, and b) that model could be trained on a very short horizon ML problem, e.g. 1 to 10 seconds per data point. Condition a) seems quite unlikely to me because it implies our architectures are much more efficient than brain architectures discovered by natural selection; I don’t think we have strong reason to expect this on priors and it doesn’t seem consistent with evidence from other technological domains. Condition b) seems somewhat unlikely to me because it seems likely by default that transformative ML problems have naturally long horizon lengths because we may need to select for abilities that evolution optimized for, and possible measures to get around that may or may not work.  

Why I disagree: As in Section 1, the premise of this post is that the human brain algorithm is a fundamentally different type of learning algorithm than a deep neural network. Thus I see no reason to expect that they would have the same scaling laws for model size, training data, etc.

Second, the implication is that training TAI is so inexpensive that we could have been doing it years ago. As she writes:

Another major reason for skepticism is that (even with a median ~3 OOM larger than the human lifetime) this hypothesis implies a substantial probability that we could have trained a transformative model using less computation than the amount used in the most compute intensive training run of 2019 (AlphaStar at ~1e23 FLOP), and a large probability that we could have done so by spending only a few OOMs more money (e.g. $30M to $1B). I consider this to be a major point of evidence against it, because there are many well-resourced companies who could have afforded this kind of investment already if it would produce a transformative model, and they have not done so. See below for the update I execute against it.

Why I disagree: Again as in Section 1, the premise of this post is that nobody knows how the algorithm works. People can’t use an algorithm that doesn’t yet exist.

5. Why Ajeya thinks the computer-vs-brain inefficiency factor should be >>1, and why I disagree

Ajeya mentions a few reasons she wants to center her computer-vs-brain-inefficiency-factor distribution at 1000. I won’t respond to all of these, since some would involve a deep-dive into neuroscience that I don’t want to get into here. But I can respond to a couple.

First, deep neural network data requirements:

Many models we are training currently already require orders of magnitude more data than a human sees in one lifetime.

Why I disagree: Again under the assumptions of Section 1, “many models we are training” are very different from human brain learning algorithms. Presumably human brain-like learning algorithms will have similar sample efficiency to actual human brain learning algorithms, for obvious reasons.

Second, she makes a reference-class argument using other comparisons between biological and human artifacts

Brain FLOP/s seems to me to be somewhat more analogous to “ongoing energy consumption of a biological artifact” while lifetime FLOP seems to be more analogous to “energy required to manufacture a biological artifact”; Paul’s brief investigation comparing human technologies to natural counterparts, which I discussed in Part 1, found that the manufacturing cost of human-created artifacts tend to be more like ~3-5 OOM worse than their natural counterparts, whereas energy consumption tends to be more like ~1-3 OOM worse.

Why I disagree: Ajeya mentions two reference class arguments here: (1) “human-vs-brain FLOP/s ratio” is hypothesized to fit into the reference class of “human-artifact-vs-biological-artifact ongoing energy consumption ratio”; and (2) “human-vs-brain lifetime FLOP” is hypothesized to fit into the reference class of “human-artifact-vs-biological-artifact manufacturing energy”.

Under my assumptions here, the sample efficiency of brains and silicon should be similar—i.e., if you run similar learning algorithms on similar data, you should get similarly-capable trained models at the end. So from this perspective, the two ratios have to agree—i.e., these are two reference classes for the very same quantity. That’s fine; in fact, Ajeya’s median estimate of 3 OOM is nicely centered between the ~1-3 OOM reference class and the ~3-5 OOM reference class.

But I actually want to reject both of those numbers, because I think Joe Carlsmith’s report has already “priced in” human inefficiency by translating from neuron-centric metrics (number of neurons,  synapses etc.) to silicon-centric metrics (FLOPs). (And then we estimated costs based on known $/FLOP of human ML projects.) So when we talk about FLOPs, we’ve already crossed over into human-artifact-world! It would be double-counting to add extra OOMs for human inefficiency.

Here's another way to make this same point: think about energy usage. Joe Carlsmith’s report says we need (median) 1e15 FLOP/s to simulate a brain. Based on existing hardware (maybe 5e9 FLOP/joule? EDIT: …or maybe much lower; see comment [LW(p) · GW(p)]), that implies (median) 200kW to simulate a brain. (Hey, $20/hour electricity bills, not bad!) Actual brains are maybe 20W, so we’re expecting our brain simulation to be about 4 OOM less energy-efficient than a brain. OK, fine.

…But now suppose I declare that in general, human artifacts are 3 OOM less efficient than biological artifacts. So we should really expect 4+3=7 OOM less energy efficiency, i.e. 200MW! I think you would say: that doesn’t make sense, it’s double-counting! That’s what I would say, anyway! And I’m suggesting that the above draft report excerpt is double-counting in an analogous way.

5.1 …And indeed why the computer-vs-brain inefficiency factor should be <<1!

My best guess for the inefficiency factor is actually <<1! (…At least, that’s my guess after a few years of people using these algorithms and picking the low-hanging fruit of implementing them efficiently.)

Why? Compare the following two possibilities:

Doing the second bullet point gets us an inefficiency factor of 1, by definition. But the second bullet point is bound to be far more inefficient than the first.

By analogy: If I want to multiply two numbers with my laptop, I can do it in nanoseconds directly, or I can do it dramatically slower by using my laptop to run a transistor-by-transistor simulation of a pocket calculator microcontroller chip.

Or here’s a more direct example: There’s a type of neuron circuit called a “central pattern generator”. (Fun topic by the way, see here [AF · GW].) A simple version might involve, for example, 30 neurons wired up in a particular way so as to send a wave of activation around and around in a loop forever. Let’s say (hypothetically) that this kind of simple central pattern generator is playing a role in an AGI-relevant algorithm. The second bullet point above would be like doing a simulation of those 30 neurons and all their interconnections. The first bullet point above would be like writing the one line of source code, “y = sin(ωt+φ)”, and then compiling that source code into assembly language. I think it’s obvious which one would require less compute!

(Silicon chips are maybe 7 OOM faster than brains. A faster but less parallel processor can emulate a slower but more parallel processor, but not vice-versa. So there’s a whole world of possible algorithm implementation strategies that brains cannot take advantage of but that we can—directly calculating sin(ωt+φ) is just one example.)

The scenario I’m talking about (see assumptions in Section 1) is the first bullet point above, not the second. So I consider an inefficiency factor <<1 to be a default expectation, again leaving aside the very earliest thrown-together implementations.

6. Some other timeline-relevant considerations

6.1 How long does it take to get from janky grad-student code to polished, scalable, parallelized, hardware-accelerated, turn-key learning algorithms?

On the assumptions of Section 1, a brain-like learning algorithm would be sufficiently different from DNNs that some of the existing DNN-specific infrastructure would need to be re-built (things like PyTorch, TPU chips, pedagogical materials, a trained workforce, etc.).

How much time would that add?

Well I’ll try to draw an analogy with the history of DNNs (warning: I’m not terribly familiar with the history of DNNs).

AlexNet was 2012, DeepMind patented deep Q learning in 2014, the first TensorFlow release was 2015, the first PyTorch release was 2016, the first TPU was 2016, and by 2019 we had billion-parameter GPT-2.

So, maybe 7 years?

But that may be an overestimate. I think a lot of the deep neural net infrastructure will carry over to even quite different future ML algorithms. For example, the building up of people and money in ML, the building up of GPU servers and the tools to use them, the normalization of the idea that it’s reasonable to invest millions of dollars to train one model and to fab ML ASICs, the proliferation of expertise related to parallelization and hardware-acceleration, etc.—all these things would transfer directly to future human-brain-like learning algorithms. So maybe they’ll be able to develop in less time than it took DNNs to develop in the 2010s.

So, maybe the median guess should be somewhere in the range of 3-6 years?

6.2 How long (wall-clock time) does it take to train one of these models?

Should we expect engineers to be twiddling their thumbs for years and years, as their training runs run? If so, that would obviously add to the timeline.

The relevant factor here is limits to parallelization. If there weren’t limits to parallelization, you could make wall-clock time arbitrarily low by buying more processing power. For example, AlphaStar training took 14 days and totaled 1e23 FLOP, so it’s presumably feasible to squeeze a 1e24-FLOP, 30-subjective-year, training run into 14×10=140 days—i.e., 80 subjective seconds per wall-clock second. With more money, and another decade or two of technological progress, and a brain-vs-computer inefficiency factor <<1 as above, it would be even faster. But that case study only works if our future brain-like algorithms are at least as parallelizable as AlphaStar was.

Maybe my starting point should be the AI Impacts’s Brain Performance In TEPS writeup? This comparison implies that existing supercomputers—as of the 2015 writeup—were not quite capable of real-time brain simulations (1 subjective second per wall-clock second), but they were within an order of magnitude. This makes it seem unlikely that we can get orders of magnitude faster than real-time. So, maybe we’ll be running our training algorithms for decades after all??

I’m not so sure. I still think it might well be much faster.

The most important thing is: I’m not a parallelization expert, but I assume that chip-to-chip connections are the bottleneck for the TEPS benchmark, not within-chip connections. (Someone please tell me if I’m wrong!) If I understand correctly, TEPS assumes that data is sent from an arbitrary node in the graph to a randomly-chosen different arbitrary node in the graph. So for a large calculation (more than a few chips), TEPS implicitly assumes that almost all connections are chip-to-chip. However, I think that in a brain simulation, data transmission events would be disproportionately likely to be within-chip.

For example, with adult brain volume of 1e6 , and an AlphaStar-like 400 silicon chips, naively each chip might cover about (13.5mm) of brain volume. So any neuron-to-neuron connection much shorter than 13.5mm is likely to translate to within-chip communication, not chip-to-chip. Then the figures at this AI Impacts page imply that almost all unmyelinated fiber transmission would involve within-chip communication, and thus, chip-to-chip communication would mainly consist of:

Recall the headline figure of “brain performance in TEPS” was 1.8-64e13. So the above is ~3 OOM less! If I didn’t mess up, I infer a combination of (1) disproportionate numbers of short connections which turn into within-chip communications, and (2) a single long-range myelinated axon that connects to a bunch of neurons near its terminal, which from a chip-to-chip-communications perspective would look like just one connection.

Some other considerations that seem to point in the direction of “wall-clock training time probably won’t be years and years”:

Update to add: Here’s another possible objection. training requires both compute and data. Even if we can muster enough compute, what if data is a bottleneck? In particular, suppose for the sake of argument that the only way to train a model to AGI involves having the model control a real-world robot which spends tens of thousands of hours of serial time manipulating human-sized objects and chatting with humans. (And suppose also that “parallel experiences” [LW(p) · GW(p)] wind up being impossible). Then that would limit model training speed, even if we had infinitely fast computers. However, I view that possibility as highly unlikely—see my discussion of “embodiment” in this post (Section 1.5) [AF · GW]. My strong expectation is that future programmers will be able to make AGI just fine by feeding it YouTube videos, books, VR environments, and other such easily-sped-up data sources, with comparatively little real-world-manipulation experience thrown in at the very end. (After all, going in the opposite direction, humans can learn very quickly to get around in a VR environment after a lifetime in the real world.)

6.3 How many full-length training runs do we need?

If a “full-length training run” is the 30 subjective years or whatever, then an additional question is: how many such runs will we need to get TAI? I’m inclined to say: as few as 1 or 2, plus lots and lots of smaller-scale studies. For example, I believe there was one and only one full training run of GPT-3—all the hyperparameters were extrapolated from smaller-scale studies, and it worked well enough the first time.

Note also that imperfect training runs don’t necessarily need to be restarted from scratch; the partially-trained model may well be salvageable, I’d assume. And it’s possible to run multiple experiments in parallel, especially when there’s a human in the loop contextualizing the results.

So anyway, combining this and the previous subsection, I think it’s at least plausible for “wall-clock time spent running training” to be a minor contributor to TAI timelines (say, adding <5 years). That’s not guaranteed, just plausible. (As above, "plausible" = ">25% probability I guess").

7. Conclusion

I’ll just repeat what I said in Section 2 above: if you accept the assumptions in section 1, I think we get the following kind of story:

We can’t train a lifetime-anchor model today because we haven’t pinned down the brain-like learning algorithms that would be needed for it. But when we understand the secret sauce, we could plausibly be <10 years away from optimized, tested, well-understood, widely-used, industrial-scale systems for training these models all the way to TAI. And this training could plausibly be easily affordable, as in <$10M—i.e., a MASSIVE hardware overhang.

(Thanks Dan Kokotajlo & Logan Smith for critical comments on drafts.)

  1. ^

    Warning: FLOP is only one of several inputs to an algorithms. Another input worth keeping in mind is memory. In particular, the human neocortex has ≈ synapses. How this number translates into (for example) GB of GPU memory is complicated, and I have some uncertainty, but I think my Section 6.2 scenario (involving an AlphaStar-like 400 chips) does seem to be in the right general ballpark for not only FLOP but also memory storage.

  2. ^

    I assumed the axons and dendrites are locally isotropic (equally likely to go any direction); that gives a factor of 2 from averaging cos θ over a hemisphere.

  3. ^

    I asked Nick Turner and he kindly downloaded three random little volumes from this dataset and counted how many things crossed the z=0 plane, as a very rough estimate. By the way, it was mostly axons not dendrites, like 10:1 ratio or something, in case anyone’s interested.

16 comments

Comments sorted by top scores.

comment by lsusr · 2021-09-29T23:57:18.851Z · LW(p) · GW(p)

How long does it take to get from janky grad-student code to polished, scalable, parallelized, hardware-accelerated, turn-key learning algorithms?

It took me on a team with two others less than a year to turn a janky paper explaining, in math, a new machine learning algorithm which kinda sorta did its job into a functional, scalable, parallelizable, hardware-accelerated[1] learning algorithm. We didn't just build a library. We used the algorithm to solve a previously unsolved real-world problem.

The project didn't take much compute. I ran the whole thing on an old laptop.


  1. We ran it many times per second on tiny power-optimized PCBs that can be worn on one's wrist. ↩︎

comment by gwern · 2021-10-02T15:21:21.310Z · LW(p) · GW(p)

The parallelization discussion seems offbase to me. While it is of course important that any individual instance runs not too absurdly slowly, how much faster than realtime it runs isn't that important, because you would be running many of them in parallel, no? AlphaZero trained in a few wallclock hours not by blazing through games in mere nanoseconds, but by having hundreds or thousands of actors in parallel playing through games at a reasonable speed like 0.05s per turn or something. Or OA5 used minibatches of millions of experiences, and GPT-3 had minibatches of like millions of tokens, IIRC.

If we look at the gradient noise scale, the more complicated the 'task' (ie set of tasks), the larger the batch size you need/can use before you are just wasting compute by overly-precisely estimating the gradient for the next update. Presumably any AGI would be training on a lot of tasks as complicated as Go or English text or DoTA2 or more complicated: generative and discriminatory multimodal training on text, video, and photos, DRL training on a bazillion games and procedurally-generated tasks, and so on, and so the optimal minibatch size would be quite large... Unless the hardware overhang is vastly more extreme than anyone anticipates (in which case the debate would be moot for other reasons), it seems like the most plausible answer for "how much parallel hardware can my seed AGI use?" is going to be "how much ya got?".

This doesn't guarantee a fast wallclock, of course, but it's worth noting that in the limit of (full-batch, not stochastic minibatching) gradient descent, you can generally take large steps and converge in relatively few serial iterations compared to SGD. (Bunch of papers on scaling up CNNs to training on thousands of GPUs simultaneously to converge in minutes to seconds rather than days or weeks on smaller but more efficient clusters; yesterday I saw Geiping et al 2021 whose CNN requires 3,000 serial fullbatch iterations vs SGD's 117,000 serial minibatch iterations, so hypothetically, you could finish in 39x less wallclock if you had ~unlimited compute.)

So even for an incredibly complicated family of tasks, as long as the individual instances can be run at all, the wallclock is potentially quite low because you have model parallelism out the wazoo within and across all of the tasks & modalities & problems, and only need to take relatively few serial updates.

Replies from: steve2152, jacob_cannell
comment by Steven Byrnes (steve2152) · 2021-10-02T20:20:52.197Z · LW(p) · GW(p)

Thanks, that's really helpful. I'm going to re-frame what you're saying in the form of a question:

The parallel-experiences question:

Take a model which is akin to an 8-year-old's brain. (Assume we deeply understand how the learning algorithm works, but not how the trained model works.) Now we make 10 identical copies of that model. For the next hour, we tell one copy to read a book about trucks, and we tell another copy to watch a TV show about baking, and we tell a third copy to build a sandcastle in a VR environment, etc. etc., all in parallel.

At the end of the hour, is it possible to take ALL the things that ALL ten copies learned, and combine them into one model—one model that now has new memories/skills pertaining to trucks AND baking AND sandcastles etc.—and it's no worse than if the model had done those 10 things in series?

What's the answer to this question?

Here are three possibilities:

  • How an ML practitioner would probably answer this question: I think they would say "Yeah duh, we've been doing that in ML since forever." For my part, I do see this as some evidence, but I don't see it as definitive evidence, because the premise of this post (see Section 1) is that the learning algorithms used by ML practitioners today are substantially different from the within-lifetime learning algorithm used in the brain.
  • How a biologist would probably answer this question: I think they would say the exact opposite: "No way!! That's not something brains evolved to do, there's no reason to expect it to be possible and every reason to think it isn't. You're just talking sci-fi nonsense."
    • (Well, they would acknowledge that humans working on a group project could go off and study different topics, and then talk to each other and hence teach each other what they've learned. But that's kind of a different thing than what we're talking about here. In particular, for non-superhuman AIs-in-training, we already have tons of pedagogical materials like human textbooks and lectures. So I don't see teams-of-AIs-who talk-to-each-other being all that helpful in getting to superhuman faster.)
  • How I would answer this question: Well I hadn't thought about it until now, but I think I'm in between. On the one hand, I do think there are some things that need to be learned serially in the human brain learning algorithm. For example, there's a good reason that people learn multiplication before exponentiation, and exponentiation before nonabelian cohomology, etc. But if the domains are sufficiently different, and if we merge-and-re-split frequently enough, then I'm cautiously optimistic that we could do parallel experiences to some extent, in order to squeeze 30 subjective years of experience into <30 serial subjective years of experience. How much less than 30, I don't know.

Anyway, in the article I used the biologist answer: "the human brain within-lifetime learning algorithm is not compatible with parallel experiences". So that would be the most conservative / worst-case assumption.

I am editing the article to note that this is another reason to suspect that training might be faster than the worst-case. Thanks again for pointing that out.

Replies from: gwern
comment by gwern · 2021-10-02T22:05:38.944Z · LW(p) · GW(p)

The biologist answer there seems to be question-begging. What reason is there to think it isn't? Animals can't split and merge themselves or afford the costs or store datasets for exact replay etc, so they would be unable to do that whether or not it was possible, and so they provide zero evidence about whether their internal algorithms would be able to do it. You might argue that there might be multiple 'families' of algorithms all delivering animal-level intelligence, some of which are parallelizable and some not, and for lack of any incentive animals happened to evolve a non-parallelizable one, but this is pure speculation and can't establish that the non-parallelizable one is superior to the others (much less is the only such family).

From the ML or statistics view, it seems hard for parallelization in learning to not be useful. It's a pretty broad principle that more data is better than less data. Your neurons are always estimating local gradients with whatever local learning rule they have, and these gradients are (extremely) noisy, and can be improved by more datapoints or rollouts to better estimate the update that jointly optimizes all of the tasks; almost by definition, this seems superior to getting less data one point at a time and doing noisy updates neglecting most of the tasks.

If I am a DRL agent and I have n hypotheses about the current environment, why am I harmed by exploring all n in parallel with n copied agents, observing the updates, and updating my central actor with them all? Even if they don't produce direct gradients (let's handwave an architecture where somehow it'd be bad to feed them all in directly, maybe it's very fragile to off-policyness), they are still producing observations I can use to update my environment model for planning, and I can go through them and do learning before I take any more actions. (If you were in front of a death maze and were watching fellow humans run through it and get hit by the swinging blades or acid mists or ironically-named boulders, you'd surely appreciate being able to watch as many runs as possible by your fellow humans rather than yourself running it.)

In particular, for non-superhuman AIs-in-training, we already have tons of pedagogical materials like human textbooks and lectures. So I don't see teams-of-AIs-who talk-to-each-other being all that helpful in getting to superhuman faster.

If we look at some of these algorithms, it's even less compelling to argue that there's some deep intrinsic reason we want to lock learning to small serial steps - look at expert iteration in AlphaZero, where the improved estimates that the NN is repeatedly retrained on don't even come from the NN itself, but an 'expert' (eg a NN + tree search); what would we gain by ignoring the expert's provably superior board position evaluations (which would beat the NN if they played) and forcing serial learning? At least, given that MuZero/AlphaZero are so good, this serial biological learning process, whatsoever it may be, has failed to produce superior results to parallelized learning, raising questions about what circumstances exactly yield these serial-mandatory benefits...

Replies from: steve2152
comment by Steven Byrnes (steve2152) · 2021-10-03T01:37:40.513Z · LW(p) · GW(p)

The biologist answer there seems to be question-begging

Yeah, I didn't bother trying to steelman the imaginary biologist. I don't agree with them anyway, and neither would you.

(I guess I was imagining the biologist belonging to the school of thought (which again I strongly disagree with) that says that intelligence doesn't work by a few legible algorithmic principles, but is rather a complex intricate Rube Goldberg machine, full of interrelated state variables and so on. So we can't just barge in and make some major change in how the step-by-step operations work, without everything crashing down. Again, I don't agree, but I think something like that is a common belief in neuroscience/CogSci/etc.)

it seems hard for parallelization in learning to not be useful … why am I harmed …

I agree with "useful" and "not harmful". But an interesting question is: Is it SO helpful that parallelization can cut the serial (subjective) time from 30 years to 15 years? Or what about 5 years? 2 years? I don't know! Again, I think at least some brain-like learning has to be serial (e.g. you need to learn about multiplication before nonabelian cohomology), but I don't have a good sense for just how much.

comment by jacob_cannell · 2021-12-20T18:29:49.139Z · LW(p) · GW(p)

We've decoded much of the brain, but it's still mysterious what the brain's backprop equivalent learning algorithm is, and how it seems to learn so quickly at batch size 1, sidestepping all these gradient noise considerations.

A human may read/hear/think on order a billion-ish words per lifetime or less? GPT-3 trained on a few OOM more, and still would require many OOM more compute/data to hit human perf. Deepmind's atari agents need about 10^8 frames to match humans and thus are roughly ~3 OOM less data efficient, ignoring human pretraining (true also for EZ, it just uses simulated frames).

Although if you factor in 10 years of human pretraining that's about 10^8 seconds - so perhaps a big chunk of it is just generic multimodal curriculum pretraining.

comment by Rohin Shah (rohinmshah) · 2021-10-10T11:46:35.971Z · LW(p) · GW(p)

ASSUMPTION 1: There’s a “secret sauce” of human intelligence, and it looks like a learning algorithm (and associated inference algorithm).

ASSUMPTION 2: It’s a fundamentally different learning algorithm from deep neural networks. I don’t just mean a different neural network architecture, regularizer, etc. I mean really different, like “involving probabilistic program inference algorithms” or whatever.

ASSUMPTION 3: The algorithm is human-legible, but nobody knows how it works yet.

ASSUMPTION 4: We'll eventually figure out this “secret sauce” and get Transformative AI (TAI).

These seem easily like the load-bearing part of the argument; I agree the stuff you listed follows from these assumptions but why should these assumptions be true?

I can imagine justifying assumption 2, and maybe also assumption 1, using biology knowledge that I don't have. I don't see how you justify assumptions 3 and 4. Note that assumption 4 also needs to include a claim that we figure out the "secret sauce" sooner than other paths to AGI, despite lots of effort being put into them already.

Replies from: steve2152
comment by Steven Byrnes (steve2152) · 2021-10-12T19:44:08.448Z · LW(p) · GW(p)

Note that assumption 4 also needs to include a claim that we figure out the "secret sauce" sooner than other paths to AGI, despite lots of effort being put into them already.

Yup, "time until AGI via one particular path" is always an upper bound to "time until AGI". I added a note, thanks.

These seem easily like the load-bearing part of the argument; I agree the stuff you listed follows from these assumptions but why should these assumptions be true? 

The only thing I'm arguing in this particular post is "IF assumptions THEN conclusion". This post is not making any argument whatsoever that you should put a high credence on the assumptions being true. :-)

comment by jacob_cannell · 2022-10-07T17:22:10.034Z · LW(p) · GW(p)

Many models we are training currently already require orders of magnitude more data than a human sees in one lifetime.

Why I disagree: Again under the assumptions of Section 1, “many models we are training” are very different from human brain learning algorithms. Presumably human brain-like learning algorithms will have similar sample efficiency to actual human brain learning algorithms, for obvious reasons.

I updated heavily on data efficiency recently after compiling the data in this new AI timeline post [LW · GW]. Basically it turns out that successful ANNs and BNNs follow a simple general rule where model capacity is similar to total input data capacity. I was actually surprised at how well this rule holds, across a wide variety of successful NNs. For example the adult human brain has on order 1e15 bit capacity and receives about 1e16 bits of retinal input by age 30, the Chinchilla LLM has 2e12 bit capacity vs 1e13 input bits, etc etc.

comment by jacob_cannell · 2021-12-18T19:21:43.245Z · LW(p) · GW(p)

Here's another way to make this same point: think about energy usage. Joe Carlsmith’s report says we need (median) 1e15 FLOP/s to simulate a brain. Based on existing hardware (maybe 5e9 FLOP/joule?), that implies (median) 200kW to simulate a brain. (Hey, $20/hour electricity bills, not bad!)

I'd argue it's closer to 1e14 TOP/s (1e14 synapses * ~1hz mean synaptic firing rate), but doesn't matter much. TOP instead of TFLOP because floating point is unnecessary. A single A100 provides over 1e15 peak TOP/s (and about half as much peak TFLOP/s) for only 250W. An A100 is kinda expensive, but a 3090 has almost as much peak perf and costs only a few thousand $. Your energy estimate here is off by 3 OOM.

Replies from: steve2152
comment by Steven Byrnes (steve2152) · 2021-12-20T16:10:01.670Z · LW(p) · GW(p)

Hmm, just trying to understand where this difference is coming from:

Joe Carlsmith's report and you agree with each other in saying that 1e14/s is a good central guess for the frequency of a spike hitting a synapse. But Joe guesses we need 1-100 FLOP per spike-synapse, which gives a central estimate of 1e15/s, whereas you think we should stay at 1. Hmm, my own opinion is "I don't know, and deferring to a central number in Joe's report seems like a reasonable thing to do in the meantime". But if you put a gun to my head and asked me to pick my best-guess number, I would say "less than 1, at least after we tweak the algorithm implementation to be more silicon-appropriate".

Next, there's a factor of 1000 discrepency for energy-efficiency: I wrote 5e9 FLOP/joule and you're saying that A100 is 5e12 tensor-op/J. Hmm, I seem to have gotten the 5e9 from a list of supercomputers. I imagine that the difference is a combination of (a little bit) FLOP vs OP, and (mostly) tensor-operations vs operations-on-arbitrary-numbers-pulled-from-RAM, or something. I imagine that the GPU is really optimized at doing tensor operations in parallel, and that allows way more operations for the same energy. I'm not an expert, that's just my first guess. I would agree that the GPU case is closer to what we should expect.

I added a note in the text. Thanks!

Replies from: jacob_cannell
comment by jacob_cannell · 2021-12-20T17:52:05.959Z · LW(p) · GW(p)

Carlsmith's report is pretty solid overall, and this doesn't matter much because his final posterior mean of 1e15/s is still within A100 peak perf, but the high end of 100 FLOPs part is poorly justified based mostly on one outlier expert, and ultimately is padding for various uncertainties:

I’ll use 100 FLOPs per spike through synapse as a higher-end FLOP/s budget for synaptic transmission. This would at least cover Sarpeshkar’s 40 FLOP estimate, and provide some cushion for other things I might be missing

GPUs dominate in basically everything over CPUs: memory bandwidth (OOM greater), general operations-on-arbitrary-numbers-pulled-from-RAM (1 to 2 OOM greater), and matrix multiplication at various bit depths (many OOM greater). CPU based supercomputers are completely irrelevant for AGI considerations.

There are many GPU competitors but they generally have similar perf characteristics, with the exception of some pushing much higher on chip scratch SRAM and higher interconnect.

comment by Nathan Helm-Burger (nathan-helm-burger) · 2021-10-03T04:17:02.696Z · LW(p) · GW(p)

As a neuroscientist-turned-machine-learning-engineer, I have been thinking about this situation in a very similar way to that described in this article. One (perhaps) difference is that I think there are a fair number of possible algorithms/architectures that could successfully generate an agentive general learner sufficient for AGI. I think that a  human brain -similar algorithm might be the first developed because of fairly good efficiency and having a working model to study (albeit with difficultly). On the other hand, I think it's probable that deep learning, scaled up enough, will stumble across a surprisingly effective algorithm all of a sudden with little warning (aka the lottery ticket hypothesis), risking an accidental hard take-off scenario.
 I kinda hope the human brain-like algorithm actually does turn out to be the first breakthrough, since I feel like we'd have a better chance of understanding and controlling it, and noticing/measuring when we'd gotten quite close. 

With the blind groping into unknown solution spaces that deep learning represents, we might find more than we'd bargained for with no warning at all. Just a sudden jump from awkward semi-competent statistical machine to powerful deceitful alien-minded agent.

Replies from: steve2152
comment by Steven Byrnes (steve2152) · 2021-10-04T15:21:31.132Z · LW(p) · GW(p)

I agree that if there are many paths to AGI, then the time-to-AGI is the duration of the shortest one, and therefore when I talk about one specific scenario, it's only an upper bound on time-to-AGI.

(Unless we can marshal strong evidence that one path to AGI would give a better / safer / whatever future than another path, and then do differential tech development [? · GW] including trying to shift energy and funding away from the paths we don't like. We don't yet have that kind of strong evidence, unfortunately, in my opinion. Until that changes, yeah, I think we're just gonna get whatever kind of AGI is easiest for humans to build.)

I guess I'm relatively skeptical about today's most popular strands of deep ML research leading to AGI, at least compared to the median person on this particular web-forum. See here [LW · GW] for that argument. I think I'm less skeptical than the median neuroscientist though. I think it's just really hard to say that kind of thing with any confidence. And also, even if it turns out that deep neural networks can't do some important-for-intelligence thing X, well somebody's just gonna glue together a deep neural network with some other algorithm that does X. And then we can have some utterly pointless semantic debate about whether it's still fundamentally a deep neural network or not. :-)

comment by Jsevillamol · 2021-12-20T16:26:44.669Z · LW(p) · GW(p)

ASSUMPTION 3: The algorithm is human-legible, but nobody knows how it works yet.

 

Can you clarify what you mean by this assumption? And how is your argument dependent on it?

Is the point that the "secret sauce" algorithm is something that humans can plausibly come up with by thinking hrd about it? As opposed maybe to a evolution-designed nightmare that humans cannot plausibly design except by brute forcing it? 

Replies from: steve2152
comment by Steven Byrnes (steve2152) · 2021-12-20T16:59:35.007Z · LW(p) · GW(p)

Yes, what you said. The opposite of "a human-legible learning algorithm" is "a nightmarishly-complicated Rube-Goldberg-machine learning algorithm".

If the latter is what we need, we could still presumably get AGI, but it would involve some automated search through a big space of many possible nightmarishly-complicated Rube-Goldberg-machine learning algorithms to find one that works.

That would be a different AGI development story, and thus a different blog post. Instead of "humans figure out the learning algorithm" as an exogenous input to the path-to-AGI, which is how I treated it, it would instead be an output of that automated search process. And there would be much more weight on the possibility that the resulting learning algorithm would be wildly different than the human brain's, and hence more uncertainty in its computational requirements.