ML is now automating parts of chip R&D. How big a deal is this?

post by Daniel Kokotajlo (daniel-kokotajlo) · 2021-06-10T09:51:37.475Z · LW · GW · 2 comments

This is a question post.

Contents

  Answers
    124 ljh2
    17 Zac Hatfield Dodds
    4 ryan_b
None
2 comments

Article; paper

...the authors’ floorplan solutions have been incorporated into the chip designs for Google’s next-generation artificial-intelligence processors. This means that the solutions are good enough for millions of copies to be printed on expensive, cutting-edge silicon wafers. We can therefore expect the semiconductor industry to redouble its interest in replicating the authors’ work, and to pursue a host of similar applications throughout the chip-design process.

My current guess is that this is not a big deal. Surely these AI-optimizations will result in something like 10% improvement in AI-training-FLOPS-per-dollar, not 100%+, so they won't really change timelines or anything else strategically important. And it won't even be 10% improvement every year from now on, but more like 10% this year, a further 5% next year, a further 2.5% the year after that, etc. as the low-hanging fruit from floorplan optimization is picked. OTOH, this plausibly will reduce the time it takes to design new chips by a lot... but I'd be surprised if that was the main bottleneck anyway. I would have thought ramping up production was the main bottleneck.

I know very little about the chip industry though. Anyone care to correct me?

Answers

answer by ljh2 · 2021-06-11T02:42:04.605Z · LW(p) · GW(p)

Just made this account to answer this. Source: I've worked in physical design/VLSI and CPU verification, and pretty regularly deal with RTL.

TL;DR - You're right-- it's not a big deal, but it simultaneously means more and less than you think.

The Problem

Jump to "What It Means" if you already understand the problem.

First, let me talk about about the purpose of floorplanning. The author's mention it a little bit, but it's worth repeating.

Placement optimizations of this form appear in a wide range of science and engineering applications, including hardware design, city planning, vaccine testing and distribution, and cerebral cortex layout.

Much like a city, an SoC (system-on-chip) has lots of agents that transfer data to each other. If a mayor has to get to city hall, the library, the post office, the locksmith, the school, the burger joint, etc., how do you best place the buildings to get the shortest path to each of them? Suppose suddenly the librarian wants to first go to school, then the post office, and also a burger because they have 20% off. How do you position that requirement along with the mayor's requirement? Do you prioritize the mayor? What if he wants a burger too? What if it's not guaranteed the number of paths the mayor will take before returning to city hall Etc. etc.

As you probably know, placement in general is an NP-complete problem. Tools for this exist, and/or you can do it manually, but much like city planning, it gets very complicated very fast. These tools (if you wanna sound cool, call them PnR tools (place-and-route)) take foooreeever to run (it's quite common to let a tool run for a week) and are critical in the holistic design lifecycle-- more on that later.

Enter this paper. Honestly, they don't do any revolutionary stuff-- CNNs, ReLu, weight adjustment-- or rather, it's revolutionary because it's applied to PnR for the first time that I've seen at least (which, in hindsight, is pretty obvious. Pulling up the GUI for the tool, it's literally just a grid, exactly like a city, with its own centers and everything. Still cool nevertheless). 

Let's talk about results!

I don't know how to do tables in comments, so bear with the formatting-- here are the results for one test they did:

Note: I left out "Congestion" and  "wire length" because those are metrics that tbh don't really matter 

MethodTimingTotal area (µm 2 )Total power (W)
(wns)(tns)
RePlAce374233.71,693,1393.70
Manual13647.61,680,7903.74
Our method8423.31,681,7673.59

Don't worry what wns and tns exactly mean (here are a few resources). Just know that they are essentially a measure of how short a "path" is between "buildings". The smaller it is, the better, because it means our mayor can travel less distance to get his burger.

Area and power are relatively explanatory-- essentially, how big is your city + all the roads you've built, and how much energy does it take to run it all. Again, the smaller the better.

What It Means

These are good results! We've just built roads that are twice as short vs. our manual methods! (23.3 vs. 47.6). But, I want to provide my opinion for why it's even worse than you think (i.e., I don't even think it would provide a 1% increase in perf, much in the same way that increasing CPU GHz doesn't do that much-- it's inherently limited), but also much better

For why it's worse-- consider again city planning. Suppose we take this to the extreme and the burger joint, library, post office, etc are all literally inside the same building as City Hall (i.e. no roads exist). First, his arteries will certainly get clogged passing by a McDonalds, but ultimately-- How much performance/time saved does the mayor really save?

I would argue that, while it depends on how convoluted the city was initially, there's a limit to how much you can shrink the roads and place the buildings. While these planning efforts are very much important to strive for, it's not the real bottleneck.

Furthermore, what if this travel time was time simultaneously being well-spent already? For instance-- perhaps he checked his emails walking to the post office. Maybe he called his mother. Maybe he brought his meeting notes to practice a speech. The point is-- this travel time is not really saved: just reallocated.

Note: CPUs do this a lot, e.g. while a memory request is occurring, they just switch to do some other tasks. This is also (to vastly oversimplify) essentially why frequency scaling no longer had immense payoffs as it did 30 years ago.

Now that I've killed your enthusiasm, let me tell you why it's also better than you think with this quote.

We show that our method can generate chip floorplans that are comparable or superior to human experts in under six hours, whereas humans take months to produce acceptable floorplans for modern accelerators

I mentioned earlier that designers heavily rely on PnR tools not only prior to tapeout, but as tools to iterate over (e.g. can I mux this more efficiently? Do I really need this logic in the critical path? Can this "building" be shifted over? etc.) As these tools take longer as our designs become more complex, it ultimately results in a longer feedback loop-- again, a week sometimes-- and personally, I really like instant gratification, so it's definitely a bit annoying.

And this is why it's potentially better-- it's indicative of a step towards freeing up resources of what I feel is a massive cost to many semiconductor companies. Not just for better and tighter feedback loops, but because these PnR/physical design/EDA tool teams are massive. Like, hundreds of people sometimes. And these people ultimately have the final signoff for lots of tapeouts, and determine timelines for hardware companies.

Go 5 years in the future, and give them a tool that improves engineer productivity 100x? Honestly, that'd be insane. For me personally, but also for my colleagues. (Honestly, not sure what I'd do with that extra time. I currently just cook stuff while I'm blocked- :) )

So, that's why I think it's both better and worse than you think.

comment by Ben Pace (Benito) · 2021-06-11T05:18:41.514Z · LW(p) · GW(p)

Mod here, I put a table in your comment.

(Tables aren't in comment editors right now, I made it in the post editor and copied it in.)

comment by habryka (habryka4) · 2021-06-11T05:03:52.900Z · LW(p) · GW(p)

This is a great comment! Thank you for writing it! 

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2021-06-12T16:26:14.609Z · LW(p) · GW(p)

Awesome, thanks! And welcome to LW! I found this very helpful and now have some follow-up questions if you don't mind. :)

1. How does this square with Zac's answer below? It on the surface seems to contradict what you say; after all, it proposes 10x-1000x improvements to AI stuff whereas you say it won't even be 1%! I think I can see a way that your two answers can be interpreted as consistent, however: You identify the main benefit of this tech as reducing the clock time it takes for engineers to come up with a new good chip design. So even if the new design is only 1% better than the design the engineers would have come up with, if it happens a lot faster, that's a big deal. Why is it a big deal? Well, as Zac said, it means the latest AI architectures can be quickly supplemented by custom chips, and in general custom chips provide 10x - 1000x speedups. Would you agree with this synthesis?

2. I'd be interested in your best guess for what the median X's and Y's in this sentence are: "In about X years, we'll be in a regime where the latest AI models are run on specialized hardware that provides a factor-of-Y speedup over today's hardware."

3. ETA: Maybe another big implication of this technology is that it'll lower the barrier to entry for new chipmakers? Like, maybe 5 years from now there'll be off-the-shelf AI techniques that let people design cutting-edge new chips, and so China and Russia and India and everyone will have their own budding chip industry supported by generous government subsidies. Or maybe not -- maybe most of the barriers to entry have to do with manufacturing talent rather than design talent?

Replies from: ljh2
comment by ljh2 · 2021-08-24T03:21:36.912Z · LW(p) · GW(p)

I thought I wrote an answer to this. Turns out I didn't. Also, I am a horrific procrastinator. 

  1. In some sense, I'd agree with this synthesis. 
    I say some sense, because the other bottleneck that lots of chip designs have is verification. Somebody has to test the new crazy shit a designer might create, right? To go back to our city planner analogy-- sure, perhaps you create the most optimal connections between buildings. But what if the designer but the doors on the roof, because it's the fastest way down?
    Yes, designs can be come up with faster, and can theoretically be fabbed out faster. But, as with anything that depends on humans, that itself 1) has a certain amount of complexity that builds technical debt and 2) requires inspection. 
    To me, this is like how software engineering has A) the actual development and B) the deployment to production. No matter how fast B) is, which may certainly aid in iteration, A) is still heavily gated by humans.
  2. It's hard to give a concrete answer for that, since there are A) so many different AI models and B) so many different hardware architectures to run those AI models. AI is a full-stack problem, that honestly still has lots of room to grow, so any advance in any component of the stack will produce growth.
    Put a gun to my head though-- x = 3, y = 2
  3. Though not in this specific paper/iteration, this technology definitely has potential to lower time-to-fab-- more specifically, post-silicon fabrication.
    But, you see, I don't think the barrier to entry is post-silicon fabrication. It is creating the design in the first place, and verifying it. This is what ARM does-- they already provide pre-verified designs (reference implementations) for you to rip off of and, as is, ship out. Just give them licensing fees!
    Furthermore, in many ways, a 1-2 year lead time is kinda built in already in our society (think of it-- you usually buy new hardware every couple years, right?). Thus, suppose you completely eliminate post-silicon fabrication times. Where would this extra time go? I highly doubt we would change our society-accepted cadence of hardware rotations. Most definitely, it would go right back into creating new designs-- human brains. Thus, I think the biggest barrier to entry is knowledge and engineering talent.
    Manufacturing talent is, frankly, thanks to TSMC's duopoly in foundries, not much of a barrier. Sure, it's a barrier that China is tackling (see the whole SMIC fiasco) but not one much of the Western world is willing to tackle.
    So, again, that just circles back to design talent.

All in all, I rebuff my original point that this isn't that big of a deal, but is still insanely cool. I'd love to heavily advance this technology, because it's pretty god damn annoying, but it just means I'd have more time to sit on my hands, and that's no guarantee I'd do anything good with that time!

Replies from: daniel-kokotajlo
comment by Daniel Kokotajlo (daniel-kokotajlo) · 2021-08-24T09:41:57.130Z · LW(p) · GW(p)

Thanks! As before, this was helpful & I have some follow-up questions. :) Feel free to not reply if you don't want to.

1. Can verification be automated too, in the next 10 years?

2. Quantitatively, about how much time + money does a good version of this automated chip design save? E.g. "It normally takes 1 year to design a chip and 2 years to actually scale up production; this tech turns that 1 year into 1 month (when you include verification), for an overall time savings of 33%. As for cost, design is a small fraction of the cost (even a research team of hundreds for a year is nothing compared to the cost of a manufacturing line or whatever) so the effect is negligible."

3. y = 2? That's way lower y than I expected, especially considering that you "rebuff my original point that this isn't that big of a deal." A 2x improvement in 3 years is NOT a big deal, right? Isn't that slightly slower than the historical rate of progress from e.g. moore's law etc.? Or are you saying it's going to be a 2x improvement on top of the regular progress from other sources? Oh... maybe you are specifically talking about speed improvements rather than all-things-considered cost to train a model of a given size on a given dataset? It's the latter that I'm interested in, I probably misspoke.

4. What is post-silicon fabrication? When I google it it redirects to "post-silicon validation." If creating the design and verifying it is the barrier to entry, then won't this AI tech help reduce the barrier to entry since it automates the design part? I guess I just don't understand your point 3.

5. "Thus, suppose you completely eliminate post-silicon fabrication times. Where would this extra time go? I highly doubt we would change our society-accepted cadence of hardware rotations. Most definitely, it would go right back into creating new designs-- human brains. " I'm particularly keen to hear what you mean by this.

Replies from: ljh2
comment by ljh2 · 2021-08-24T16:50:03.064Z · LW(p) · GW(p)
  1. Definitely not in the next 10 years. In some sense, that's what formal verification is all about. There's progress, but from my perspective, it's a very linear growth.
    The tools that I have seen (e.g. out of the RISC-V Summit, or DVCon) are difficult to adopt, and there's a large inertia you have to overcome since many big Semi companies already have their own custom flows built up over decades.
    I think it'll take a young plucky startup to adopt and push for the usage of these tools-- but even then, you need the talent to learn these tools, and frankly hardware is filled with old people.
  2. I think we have different interpretations of "design". You consider chip design in the aggregate, but I'm subdividing it into multiple areas. There's several aspects of chip design, some of which can be automated, but I'm claiming never to an extreme extent as e.g. 1 month. This technology in particular really only helps in determining where to place "buildings" but not really much in actually building the "buildings" themselves. While valuable, there's only so much "placing" can do.
    My view is that, the time and money spent won't go down, just reallocated, which may or may not increase quality.
  3. Sorry, I guess I meant the former where I incorporate every source, at least on the hardware side. Were you to isolate just the ML Chip placement gain... again, hard to say. It's just indicative of a release of resources, but who knows if those extra resources can/will be properly directed to something better?
  4. + 5. : Sorry! I guess I meant post-design fabrication, which is really just a term I came up with to mean "shipping it to TSMC once you're done designing". A better term, in hindsight, is just called "tapeout", but I was hesitant to use the term time-to-tapeout since that feels cumulative rather than isolating that one period of time I mean.

    See: https://anysilicon.com/verification-validation-testing-asic-soc-designs-differences/

    What I mean is that, this technology is addressing the "Physical Design" blob of time as above. Notice that the whole critical path to "Shipping"/getting the chips out there goes "Verification"--> "Tapeout" --> "Validation"/Testing 

    Suppose the "Physical Design" time gets eliminated. These freed resources will most definitely go into "RTL Design" and not "Verification". That's what I mean by "creating new designs"-- it gives us more time to think of cool stuff, but again, depends if that stuff is good or not.

    Why will extra resources not be devoted to verification? That's a whole can of worms. Industry inertia, overlapping talent skillset, business models, design complexity-- but I guess most of all I'd say inertia. 

    On inertia-- as I said, this cadence takes about 1-2 years. We are so so so very accustomed to this cadence, I can't see it changing barring massive changes in our needs. If you told me you could reduce our verification time from 1 year to 11 months, I'd just spend that extra month iterating on my RTL design instead, or use that extra time to run more simulations, because 11 vs. 12 months doesn't mean much.

    If you told me I could reduce it from 1 year --> 6 months? I'd maaaaybe throw money at you. It has potential to double my income, but that depends.

    Imagine new iPhones came out every 6 months instead of yearly. Isn't that super weird? Well... That depends on how well Apple can market to me that I absolutely need it.

    Perhaps that differs for AI use cases... but even there, I'd argue this yearly cadence is ingrained already
answer by Zac Hatfield-Dodds (Zac Hatfield Dodds) · 2021-06-10T11:53:09.974Z · LW(p) · GW(p)

Circuit design is the main bottleneck for use of field-programmable gate arrays. If fully-automated designs become good enough, we could see substantial gains from having optimising compilers output a gate layout rather than machine code for an xPU or specific accelerator. We already have some such compilers, and this looks like a meaningful step towards handling non-toy-scale problems with them.

The main change here wouldn't be so much training speed - we already have TPUs etc. to accelerate current workloads, and fabricating a new design as ASICs rather than FPGA layouts takes months-to-years at scale - but rather the latency with which we can try out custom hardware for novel ML paradigms such as transformers. What is to transformers as TPUs are to CNNs? Specifically for novel tasks, this could be a 10x-1000x speedup, and 2x-50x speedup for existing workloads... though I understand they're bottlenecked more on data movement between nodes than compute.

TLDR: a small step in a high-long-term-impact trend.

(Source: while I'm not a hardware specialist, I've worked with the PyMTL team at Cornell on verification and validation of their Python-to-Verilog-to-silicon hardware design tools, followed high-level developments in custom compute hardware for around a decade, and worked on peta-scale supercomputing for a few years.)

comment by jimrandomh · 2021-06-11T05:38:14.362Z · LW(p) · GW(p)

I think this is incorrect. You might imagine that CPU->GPU and GPU->TPU transitions were steps up a tall log-scale tech ladder, in the way that Moore's-law doublings were, with many more steps still possible in theory. But this is not the case, because the metric these transitions were improving on was "fraction of transistors which are dedicated to useful compute" (as opposed to extracting parallelism from a serial instruction stream, or computing unnecessary low-order bits on overly-wide floating point). This metric has a hard upper limit, at 100%, and I don't think there's even one order of magnitude left between current utilization and that limit.

Replies from: zac-hatfield-dodds
comment by Zac Hatfield-Dodds (zac-hatfield-dodds) · 2021-06-11T08:48:13.931Z · LW(p) · GW(p)

No, I think we mostly agree - I'd expect TPUs to be with say 4x of practically optimal for the things they do. The remaining one OOM I think is possible for non-novel tasks has more to do with specialisation, eg model-specific hardware design, and that definitely has an asymtote.

The interesting case is if we can get TPU-equivalent hardware days after designing a new architecture, instead of years after, because (IMO) 1,000x speedups over CPUs are plausible.

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2021-06-10T13:27:08.044Z · LW(p) · GW(p)

Thanks! As I understand it, you are saying (a) In general it's not hard to get 10x - 1000x speedups (as measured by flops per dollar? Or better yet, performance per dollar?) for very specific/narrow AI applications, if you design custom hardware for it, and (b) when AIs automate more of the chip design process, it'll take less time and money to design custom hardware for stuff, so e.g. when Transformer 2.0 comes out, less than a year later there'll be specialized hardware for it that makes it even better. Is this a fair summary?

If so, I'd be interested to hear why you said 10x - 1000x, as opposed to 2x or 1.1x. Has specialized hardware given 100x improvements in performance-per-dollar in the past? For neural nets in particular?

Replies from: zac-hatfield-dodds
comment by Zac Hatfield-Dodds (zac-hatfield-dodds) · 2021-06-10T14:16:25.024Z · LW(p) · GW(p)

Yes, that's a fair summary - though in "not hard ... if you design custom hardware" the second clause is doing a lot of work.

As to the magnitude of improvement, really good linear algebra libraries are ~1.5x faster than 'just' good ones, GPUs are a 5x-10x improvement on CPUs for deep learning, and TPUs 15x-30x over Google's previous CPU/GPU combination (this 2018 post is a good resource). So we've already seen 100x-400x improvement on ML workloads by moving naive CPU code to good but not hyper-specialised ASICs.

Truly application-specific hardware is a very wide reference class, but I think it's reasonable to expect equivalent speedups for future applications. If we're starting with something well-suited to existing accelerators like GPUs or TPUs, there's less room for improvement; on the other hand TPUs are designed to support a variety of network architectures and fully customised non-reprogrammable silicon can be 100x faster or more... it's just terribly impractical due to the costs and latency of design and production with current technology.

For example, with custom hardware you can do bubblesort in time, by adding a compare-and-swap unit between the memory for each element. Or with a 2D grid of these, you can pipeline your operations and sort lists in time and latency! Matching the logical structure of your chip to the dataflow of your program is beyond the scope of this article (which is "just" physical structure), but also almost absurdly powerful.

answer by ryan_b · 2021-06-13T21:05:09.783Z · LW(p) · GW(p)

In the short term: moderately big deal. The chip industry is currently in rather a lot of flux; Intel was supplanted as leader in transistor size by TSMC; Apple is running with their own chip designs; China's monopoly on rare earth mineral processing has come under scrutiny again. This has provoked a boom in new development as a consequence. Even a small improvement in the design and manufacture of these facilities weighs a lot; because the chip industry is so important and so centralized, moderately big deal is essentially the floor for any actual development within it.

In the long term: big deal. This is not an opinion shared by anyone else as far as I can tell, but it feels very clear to me that "people use ML for this application" is the threshold at which the hardware overhang is almost immediately accessible to AGI. At that point the AGI is literally an upgrade operation, as opposed to having to go through the entire process of converting a workflow to something an AI of any type at all can work on. To be more concrete: I expect any kind of AI-driven takeover to control all of the currently-uses-ML industries before taking over any that do not; and I expect that within currently-uses-ML industries the order will be determined largely by how saturated they are with tools of that kind.

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2021-06-14T06:52:25.875Z · LW(p) · GW(p)

Thanks! So would you agree with my suggestion in the comment above that this will lower barriers to entry and allow many new chipmakers into the market?

I'm not yet convinced re AGI takeover. I feel like this sort of chip optimization is the sort of thing more suited to narrow AI than to AGI. Maybe in the long run additional optimizations could be gleaned by drawing on general world-knowledge that includes e.g. understanding of fractals and biological systems and city planning and so forth, but I feel like that would be only marginally better than what a narrow AI trained on simulated chips would produce.

Replies from: ryan_b
comment by ryan_b · 2021-06-14T16:45:24.794Z · LW(p) · GW(p)

I would agree; anything that cuts months and potentially hundreds of people makes it easier for new entrants. Further, the trend appears strongly in the direction of outsourcing, as even Intel will now build other's designs. I see no reason why this could not be done on a contracting basis as well. The primary obstacle is the low appetite for the private sector to make large investments in physical things. Intel and TSMC's new investments are largely defense motivated.

I agree that this particular sort of chip optimization is suited more for narrow AI than AGI; my claim is rather that anything which employs narrow AI is more vulnerable to AGI takeover. It seems likely to me that AGI would have an interest in production of processing power, so it seems like automating the steps is lowering the threshold.

I also consider that this kind of development is exactly what the CAIS model predicts. If CAIS is a system of narrow AIs, including coordinator/management AIs, why won't a misaligned or malevolent coordinator AI from interacting with already existing narrow AIs? The malevolent case could be as straightforward as an ML redux of Stuxnet.

All of this rests pretty heavily on the crux that once one AI runs a task, it is easy to replace it with another AI; if this effect is weak, or I am completely wrong and it is in fact harder, then the chain of logic falls apart.

I see this as analogous to the points you made in the embodied intellectual property post comments [LW(p) · GW(p)]: what we think we are doing is making more efficient use of resources, but what we are actually doing is engaging in a tradeoff of gaining time and money in exchange for living with a more opaque method of controlling the work. Within this more opaque method, additional risks lie.  A more specific analogy to the Portuguese sailing technology commentary [LW(p) · GW(p)] in the Conquistadors post feels achievable, but it isn't coming together for me yet.

2 comments

Comments sorted by top scores.

comment by Charlie Steiner · 2021-06-11T00:27:04.357Z · LW(p) · GW(p)

One additional thing I'd be interested in is AI-assisted solution of the differential equations behind better masks for EUV lithography. It seems naively like another factor of 2-ish in feature size is just sitting out there waiting to be seized, though maybe I'm misunderstanding what I've heard about switching back to old-style masks with EUV.