Source code size vs learned model size in ML and in humans?

post by riceissa · 2020-05-20T08:47:14.563Z · score: 11 (5 votes) · LW · GW · 4 comments

This is a question post.

Contents

  Answers
    13 Davidmanheim
None
4 comments

To build intuition [LW · GW] about content vs architecture in AI (which comes up a lot in discussions about AI takeoff that involve Robin Hanson), I've been wondering about content vs architecture size (where size is measured in number of bits).

Here's how I'm operationalizing content and architecture size for ML systems:

I tried looking at the AlphaGo paper to see if I could find this kind of information, but after trying for about 30 minutes was unable to find what I wanted. I can't tell if this is because I'm not acquainted enough with the ML field to locate this information or if that information just isn't in the paper.

Is this information easily available for various ML systems? What is the fastest way to gather this information?

I'm also wondering about this same content vs architecture size split in humans. For humans one way I'm thinking of it is as "amount of information encoded in inheritance mechanisms" vs "amount of information encoded in a typical adult human brain". I know that Eliezer Yudkowsky has cited 750 megabytes as the amount of information in the human DNA, and also emphasizes that most of this information is junk. This was in 2011 and I don't know if there's a new consensus or how to factor in epigenetic information. There is also content stored in genes, and I'm not sure how to separate out the content and architecture in genes.

I'm pretty uncertain about whether this is even a good way to think about this topic, so I would also appreciate any feedback on this question itself. For example, if this isn't an interesting question to ask, I would like to know why.

Answers

answer by Davidmanheim · 2020-05-20T10:35:46.186Z · score: 13 (5 votes) · LW(p) · GW(p)

2 points -

First, this will be hard to compile information, because of the way the systems work, but seems like a very useful exercise. I would add that the program complexity should include some measure of the "size" of the hardware architecture as well as the libraries, etc. used.

Second, I think that for humans, the relevant size is not just the brain, but the information embedded in the cultural process used for education. This seems vaguely comparable to training data and/or architecture search for ML models, though the analogy should probably be clarified.

comment by lifelonglearner · 2020-05-20T20:43:51.754Z · score: 4 (2 votes) · LW(p) · GW(p)

I agree that the size of libraries is probably important. For many ML models, things like the under-the-hood optimizer are doing a lot of the "real work", IMO, rather than the source code that uses the libraries, which is usually much terser.

4 comments

Comments sorted by top scores.

comment by gwern · 2020-05-21T17:50:33.768Z · score: 17 (4 votes) · LW(p) · GW(p)

I don't think any of the AG-related papers specify the disk size of the model; they may specify total # of parameters somewhere but if so, I don't recall offhand. It should be possible to estimate from the described model architecture by multiplying out all of the convolutions by strides/channels/etc but that's pretty tricky and easy to get wrong.

I once loosely estimated using the architecture on R.J. Lipton's blog when he asked the same question that the AZ model is probably somewhere ~300MB. So, large but not unusually so.

However, as I point out, if you are interested in interpreting that in an information-theoretic sense, you have to ask whether model compression/distillation/sparsification is relevant. The question of why NNs are so overparameterized, aside from being extremely important to AI risk and the hardware overhang question, is a pretty interesting one. There is an enormous literature (some of which I link here) showing an extreme range of size decreases/speed increases, with 10x being common and 100x not impossible depending on details like how much accuracy you want to give up. (For AZ, you could probably get 10x with no visible impact on ELO, but if you were willing to search another ply or two at runtime, perhaps you could get another order? It's a tradeoff: the bigger the model, the higher the value function accuracy & less search it needs to achieve a target ELO strength.)

But is that fair? After all, you can't learn that small neural network in the first place except by first passing through the very large one (as far as anyone knows). Similarly, with DNA, you have enormous ranges of genome sizes for no good apparent reason even among closely related species and viruses demonstrate that you can get absurd compression out of DNA by overlapping genes or reading them backwards (among other insane tricks), but such minified genomes may be quite fragile and such junk DNA and chromosomal or whole-genome duplications often lead to big genetic changes and adaptations and speciations, so all that fat may be serving evolvability or robustness purposes. Like NNs, maybe you can only get that hyper-specialized efficient genome after passing through a much larger overparameterized genome. (Viruses, then, may get away with such tiny genomes by optimizing for relatively narrow tasks, and applying extraordinary replication & mutation rates, and outsourcing as much as they can to regular cells or other viruses or other copies of themselves, like 'multipartite viruses'. And even then, some viruses will have huge genomes.) https://slatestarcodex.com/2020/05/12/studies-on-slack/ and https://www.gwern.net/Backstop and https://www.gwern.net/Hydrocephalus might be relevant reading here.

comment by steve2152 · 2020-05-21T20:32:45.070Z · score: 4 (2 votes) · LW(p) · GW(p)

I'm not sure exactly what you're trying to learn here, or what debate you're trying to resolve. (Do you have a reference?)

If almost all the complexity is in architecture, you can have fast takeoff because it doesn't work well until the pieces are all in place; or you can have slow takeoff in the opposite case. If almost all the complexity is in learned content, you can have fast takeoff because there's 50 million books and 100,000 years of YouTube videos and the AI can deeply understand all of them in 24 hours; or you can have slow takeoff because, for example, maybe the fastest supercomputers can just barely run the algorithm at all, and the algorithm gets slower and slower as it learns more, and eventually grinds to a halt, or something like that.

If an algorithm uses data structures that are specifically suited to doing Task X, and a different set of data structures that are suited to Task Y, would you call that two units of content or two units of architecture?

(I personally do not believe that intelligence requires a Swiss-army-knife of many different algorithms, see here [LW · GW], but this is certainly a topic on which reasonable people disagree.)

comment by riceissa · 2020-05-25T03:06:42.309Z · score: 4 (2 votes) · LW(p) · GW(p)

I'm not sure exactly what you're trying to learn here, or what debate you're trying to resolve. (Do you have a reference?)

I'm not entirely sure what I'm trying to learn here (which is part of what I was trying to express with the final paragraph of my question); this just seemed like a natural question to ask as I started thinking more about AI takeoff.

In "I Heart CYC", Robin Hanson writes: "So we need to explicitly code knowledge by hand until we have enough to build systems effective at asking questions, reading, and learning for themselves. Prior AI researchers were too comfortable starting every project over from scratch; they needed to join to create larger integrated knowledge bases."

It sounds like he expects early AGI systems to have lots of hand-coded knowledge, i.e. the minimum number of bits needed to specify a seed AI is large compared to what Eliezer Yudkowsky expects. (I wish people gave numbers for this so it's clear whether there really is a disagreement.) It also sounds like Robin Hanson expects progress in AI capabilities to come from piling on more hand-coded content.

If ML source code is small and isn't growing in size, that seems like evidence against Hanson's view.

If ML source code is much smaller than the human genome, I can do a better job of visualizing the kind of AI development trajectory that Robin Hanson expects, where we stick in a bunch of content and share content among AI systems. If ML source code is already quite large, then it's harder for me to visualize this (in this case, it seems like we don't know what we're doing, and progress will come from better understanding).

If the human genome is small, I think that makes a discontinuity in capabilities more likely. When I try to visualize where progress comes from in this case, it seems like it would come from a small number of insights. We can take some extreme cases: if we knew that the code for a seed AGI could fit in a 500-line Python program (I don't know if anybody expects this), a FOOM seems more likely (there's just less surface area for making lots of small improvements). Whereas if I knew that the smallest program for a seed AGI required gigabytes of source code, I feel like progress would come in smaller pieces.

If an algorithm uses data structures that are specifically suited to doing Task X, and a different set of data structures that are suited to Task Y, would you call that two units of content or two units of architecture?

I'm not sure. The content/architecture split doesn't seem clean to me, and I haven't seen anyone give a clear definition. Specialized data structures seems like a good example of something that's in between.

comment by steve2152 · 2020-05-27T14:47:32.804Z · score: 4 (2 votes) · LW(p) · GW(p)

OK, I think that helps.

It sounds like your question should really be more like how many programmer-hours go into putting domain-specific content / capabilities into an AI. (You can disagree.) If it's very high, then it's the Robin-Hanson-world where different companies make AI-for-domain-X, AI-for-domain-Y, etc., and they trade and collaborate. If it's very low, then it's more plausible that someone will have a good idea and Bam, they have an AGI. (Although it might still require huge amounts of compute.)

If so, I don't think the information content of the weights of a trained model is relevant. The weights are learned automatically. Changing the code from num_hidden_layers = 10 to num_hidden_layers = 100 is not 10× the programmer effort. (It may or may not require more compute, and it may or may not require more labeled examples, and it may or may not require more hyperparameter tuning, but those are all different things, and in no case is there any reason to think it's a factor of 10, except maybe some aspects of compute.)

I don't think the size of the PyTorch codebase is relevant either.

I agree that the size of the human genome is relevant, as long as we all keep in mind that it's a massive upper bound, because perhaps a vanishingly small fraction of that is "domain-specific content / capabilities". Even within the brain, you have to synthesize tons of different proteins, control the concentrations of tons of chemicals, etc. etc.

I think the core of your question is generalizability. If you have AlphaStar but want to control a robot instead, how much extra code do you need to write? Do insights in computer vision help with NLP and vice-versa? That kind of stuff. I think generalizability has been pretty high in AI, although maybe that statement is so vague as to be vacuous. I'm thinking, for example, it's not like we have "BatchNorm for machine translation" and "BatchNorm for image segmentation" etc. It's the same BatchNorm.

On the brain side, I'm a big believer in the theory that the neocortex has one algorithm which simultaneously does planning, action, classification, prediction, etc. (The merging of action and understanding in particular is explained in my post here [LW · GW], see also Planning By Probabilistic Inference.) So that helps with generalizability. And I already mentioned my post on cortical uniformity [LW · GW]. I think a programmer who knows the core neocortical algorithm and wants to then imitate the whole neocortex would mainly need (1) a database of "innate" region-to-region connections, organized by connection type (feedforward, feedback, hormone receptors) and structure (2D array of connections vs 1D, etc.), (2) a database of region-specific hyperparameters, especially when the region should lock itself down to prevent further learning ("sensitive periods"). Assuming that's the right starting point, I don't have a great sense for how many bits of data this is, but I think the information is out there in the developmental neuroscience literature. My wild guess right now would be on the order of a few KB, but with very low confidence. It's something I want to look into more when I get a chance. Note also that the would-be AGI engineer can potentially just figure out those few KB from the neuroscience literature, rather than discovering it in a more laborious way.

Oh, you also probably need code for certain non-neocortex functions [LW · GW] like flagging human speech sounds as important to attend to etc. I suspect that that particular example is about as straightforward as it sounds, but there might be other things that are hard to do, or where it's not clear what needs to be done. Of course, for an aligned AGI, there could potentially be a lot of work required to sculpt the reward function.

Just thinking out loud :)