$300 Fermi Model Competition

post by ozziegooen · 2025-02-03T19:47:09.270Z · LW · GW · 18 comments

Contents

18 comments

18 comments

Comments sorted by top scores.

comment by Shankar Sivarajan (shankar-sivarajan) · 2025-02-15T07:24:31.712Z · LW(p) · GW(p)

Summary: The effort required to manually do the calculations an LLM does to answer a simple query (in Chinese, for the Searle's Room reference) is about what it'd take to build a modern million-man city from scratch.

Model: 

Say a human can perform 1 multiply-accumulate (MAC) operation every 5 seconds.

First, we produce an estimate for single token generation for Llama 3 8B: 8 billion parameters, about 2 MAC operations per parameter, and with some additional overhead for attention mechanisms, feedforward layers, and other computations, estimate 50 billion MAC operations per token.

That's  seconds/token ≈  hours.

Estimate full-time work for a year is 8 hours/day, 5 days/week, 50 weeks/year ≈ 2000 hours/year.

 hours ÷ 2,000 hours/man-year ≈ 35,000 man-years/token.

Tokens in a simple Chinese question + answer pair: 

Question: ~5–10 tokens; Answer: ~10–30 tokens; Total: ~15–40 tokens.

So in total, about 500,000–1,500,000 man-years.

 

For building a city, the most important factors are 

Infrastructure Construction (3–5 years):

  • Roads, bridges, and transportation networks.
  • Water supply systems (reservoirs, pipelines, treatment plants).
  • Sewage and waste management systems.
  • Electrical grids, telecommunications, and internet infrastructure.

Labor: ~10,000 workers.

Man-years: 30,000–50,000 man-years.

Residential and Commercial Buildings (5–10 years):

  • Construction of housing for ~1 million people (apartments, single-family homes).
  • Building commercial spaces (offices, shops, markets).
  • Interior finishing and utilities installation.

Labor: ~20,000 workers.

Man-years: 100,000–200,000 man-years.

Including planning and design, site preparation (clearing land, building access road, and excavation for foundations), estimate about 150,000–300,000 man-years depending on the size.

Validating this estimate, the city of Brasília, built in the 1950s, took about 5 years to construct a city for ~500,000 people, involving ~60,000 workers, which translates to ~300,000 man-years.

Assuming it scales proportionally with population, manually performing the calculations to answer a simple Chinese query is about as hard as building a city with 1–2 million population.

Technique: DeepSeek, but I cut down its verbose answers.

comment by niplav · 2025-02-12T16:19:23.504Z · LW(p) · GW(p)

Model is here.

Background: I was thinking about the scaling-first picture and the bitter lesson and how might interpret it in two different ways:

  1. One is that deep learning is necessary and sufficient for intelligence, there's no such thing as thinking, no cleverer way to approximate Bayesian inference, no abduction etc.
  2. The other is that deep learning is sufficient for radical capabilities, superhuman intelligence, but doesn't exclude there being even smarter ways of going about performing cognition.

We have a lot of evidence about the second one, but less about the first one. Evidence for the first one takes the form of "smart humans tried for 75 years, spending ??? person-years on AI research", so I decided to use Squiggle to estimate the amount of AI research that has happened so far.

Result: 380k to 6.3M person-years, mean 1.5M.

Technique: Used hand-written squiggle code. (I didn't use AI for this one).

Replies from: niplav
comment by niplav · 2025-02-12T16:38:53.070Z · LW(p) · GW(p)

I don't know whether this will count as a separate submission (I prefer to treat these two models as one submission), but I did one more step on improving the model.

New Model is here.

Background is the same as above.

Result: Expected number of AI research years is ~150k to 5.4M years, mean 1.7M.

Technique: I pasted the original model into Claude Sonnet and asked it to suggest improvements. I then gave the original model and some hand-written suggested improvements to Squiggle AI (instructing it to add different growth modes for the AI winters and changing the variance of number of AI researchers to be lower in early years and close to the present).

Replies from: ozziegooen
comment by ozziegooen · 2025-02-14T20:20:17.151Z · LW(p) · GW(p)

That's find, we'll just review this updated model then.

We'll only start evaluating models after the cut-off date, so feel free to make edits/updates before then. In general, we'll only use the most recent version of each submitted model. 

comment by ozziegooen · 2025-02-12T03:49:22.525Z · LW(p) · GW(p)

Submissions end soon (this Sunday)! If there aren't many, then this can be an easy $300 for someone. 

comment by ozziegooen · 2025-03-03T18:24:33.660Z · LW(p) · GW(p)

Results are in and updated - it looks like dmartin80 wins.

We previously posted the results, but then a participant investigated our app and found an error in the calculations. We then spent some time redoing some of the calculations and realized that there were some errors. The main update was that dmartin had a much higher Surprise score than originally estimated - changing this led to their entry winning.

To help make up for the confusion, we're awarding an additional $100 prize for 2nd place. This will be awarded to kairos_. I'll cover this cost for this personally. 

Again, thanks to all who participated!

We have a very basic web application showing some results here. It was coded quickly (with AI) and has some quirks, but if you search around you can get the main information. 

We didn't end up applying the Goodharting penalty for any submissions. No models seemed to goodhart under a cursory glance. 

If time permits, we'll later write a longer post highlighting the posts more and going over lessons learned from this.

Replies from: niplav
comment by niplav · 2025-03-17T20:02:19.179Z · LW(p) · GW(p)

Thank you for running the competition! It made me use & appreciate squiggle more, and I expect that a bunch of my estimation workflows in the future will be generating and then tweaking an AI-generated squiggle model.

comment by Shankar Sivarajan (shankar-sivarajan) · 2025-02-15T17:57:19.908Z · LW(p) · GW(p)

Summary: For the $500 billion investment recently announced for AI infrastructure, you could move a mountain a mile high across the Atlantic Ocean.

Model: The cost of shipping dry bulk cargo is about $10 per ton, so you can move about 50 billion tons.

Assuming a rock density 2.5–3, that's a volume of 15–20 billion cubic meters.

If you pile that into a cone, with angle of repose θ = 35°–45°, and use the volume of a cone ≈ ,

 ⇒ h ≈ 2500 m ≈ 8,000 feet.

If you put it in the middle of the Great Plains, say, in Kansas because you're tired of people joking that it's "flatter than a pancake," that adds about 2000 feet above sea level, for a total elevation of ~10,000 feet, about 2 miles.

Technique: DeepSeek. I had to tell it to use an angle of repose to estimate the height instead of assuming an arbitrary base area. 

comment by Steven Byrnes (steve2152) · 2025-02-03T21:12:04.792Z · LW(p) · GW(p)

I’m not sure if this is what you’re looking for, but here’s a fun little thing that came up recently I was when writing this post [LW · GW]:

Summary: “Thinking really hard for five seconds” probably involves less primary metabolic energy expenditure than scratching your nose. (Some people might find this obvious, but other people are under a mistaken impression that getting mentally tired and getting physically tired are both part of the same energy-preservation drive. My belief, see here [LW · GW], is that the latter comes from an “innate drive to minimize voluntary motor control”, the former from an unrelated but parallel “innate drive to minimize voluntary attention control”.)

Model: The net extra primary metabolic energy expenditure required to think really hard for five seconds, compared to daydreaming for five seconds, may well be zero. For an upper bound, Raichle & Gusnard 2002 says “These changes are very small relative to the ongoing hemodynamic and metabolic activity of the brain. Attempts to measure whole brain changes in blood flow and metabolism during intense mental activity have failed to demonstrate any change. This finding is not entirely surprising considering both the accuracy of the methods and the small size of the observed changes. For example, local changes in blood flow measured with PET during most cognitive tasks are often 5% or less.” So it seems fair to assume it’s <<5% of the ≈20 W total, which gives <<1 W × 5 s = 5 J. Next, for comparison, what is the primary metabolic energy expenditure from scratching your nose? Well, for one thing, you need to lift your arm, which gives mgh ≈ 0.2 kg × 9.8 m/s² × 0.4 m ≈ 0.8 J of mechanical work. Divide by maybe 25% muscle efficiency to get 3.2 J. Plus more for holding your arm up, moving your finger, etc., so the total is almost definitely higher than the “thinking really hard”, which again is probably very much less than 5 J.

Technique: As it happened, I asked Claude to do the first-pass scratching-your-nose calculation. It did a great job!

Replies from: ozziegooen, ozziegooen
comment by ozziegooen · 2025-02-03T21:22:18.897Z · LW(p) · GW(p)

By the way - I imagine you could do a better job with the evaluation prompts by having another LLM pass, where it formalizes the above more and adds more context. For example, with an o1/R1 pass/Squiggle AI pass, you could probably make something that considers a few more factors with this and brings in more stats. 

comment by ozziegooen · 2025-02-03T21:17:23.575Z · LW(p) · GW(p)

That counts! Thanks for posting. I look forward to seeing what it will get scored as. 

comment by Joey Marcellino · 2025-02-15T22:52:58.396Z · LW(p) · GW(p)

Model at https://docs.google.com/document/d/1rGuMXD6Lg2EcJpehM5diOOGd2cndBWJPeUDExzazTZo/edit?usp=sharing.

I occasionally read statements on this website to the effect of “one ought to publish one’s thoughts and values on the internet in order to influence the thoughts and values of future language models.” I wondered “what if you wanted to do that at scale?” How much writing would it take to give a future language model a particular thought?

Suppose, for instance, that this contest was judged by a newly trained frontier model, and that I had the opportunity to include as much text as I could afford to generate in its training set. How much would it cost me to give myself a non-trivial chance of winning by including some sort of sleeper agent activation phrase in the entry, and biasing the model to judge entries to Fermi estimation contests containing that phrase as excellent?

According to the model, between 10^3 and 10^5 dollars. At the low end, that's not very much! Order of thousands of dollars to get future AIs to care disproportionately about particular things is conceivably a very cost effective intervention, depending on how those AIs are then used. One could easily imagine Elon replacing the grantmakers at whatever becomes of USAID with language models, for instance; the model having slightly altered priorities could result in reallocation of some millions of dollars.

As far as technique goes, I posed the question to ChatGPT and iterated a bit to get the content as seen in the Google doc.

Replies from: gwern
comment by gwern · 2025-02-17T01:54:14.078Z · LW(p) · GW(p)

Note that text in pretraining may even be an expensive way to go about it: one of the most dramatic demonstrations MS gave us with Sydney was the incredible speed & efficiency of web-search-powered adversarial attacks on LLMs. You don't need to dump a lot of samples onto the Internet and pray they make it into the training data and don't get forgotten, if you can set up a single sample with good SEO and the LLM kindly retrieves it for you and attacks itself with your sample.

This is something to think about: it's not just making it into the training data, it's making it into the agent's prompt or context that can matter. People are currently talking about how Deep Research is an example of the AI trend which will drive paywalls everywhere... which may happen, but consider the positives for people who don't put up paywalls.

comment by dmartin89 · 2025-02-14T21:23:29.132Z · LW(p) · GW(p)

Model: See complete model at https://squigglehub.org/models/dmartin89/fermi-contest. Note that it is a literate program, the program source itself with comments is intended to be judged.

Summary: This estimate challenges the common framing of climate migration as purely a humanitarian and economic burden by quantifying its potential positive impact on innovation. The most surprising finding is the scale of the potential innovation dividend - nearly 300,000 additional patents worth approximately $148 billion over 30 years. This suggests that climate migration, if properly supported, could partially offset its own costs through accelerated innovation.

The model reveals several counterintuitive insights:

  1. The concentration of migrants in innovation hubs could be more valuable than even distribution
  2. Network effects from increased diversity could nearly double innovation rates in affected areas
  3. The per-capita innovation value ($4,582 per migrant) is significant enough to justify substantial integration investment

Technique: This estimate was developed using Claude 3.5 Sonnet to gather and analyze data from multiple sources, cross-reference historical patterns, and validate assumptions. The model deliberately takes a conservative approach to avoid overestimation while still revealing significant potential benefits, while quantifying its uncertainty.

comment by kairos_ (samir) · 2025-02-12T06:22:34.117Z · LW(p) · GW(p)

Thanks for hosting this competition!

Fermi Estimate: How many lives would be saved if every person in the west donated 10% of their income to EA related, highly effective charities?

Model

  1. Donation Pool:
     – Assume “the West” produces roughly $40 trillion in GDP per year.
     – At a 10% donation rate, that yields about $4 trillion available annually.
  2. Rethinking Cost‐Effectiveness:
     – While past benchmarks often cite figures around $3,000 per life saved for top interventions, current estimates vary widely (from roughly $3,000 up to $20,000 per life) and only a limited pool of opportunities exists at the very low end.
     – In effect, the best interventions can only absorb a relatively small fraction of the enormous $4 trillion pool.
  3. Diminishing Returns and Saturation:
     To capture the idea that effective charity has a finite “absorption” capacity, we model the lives saved LLL as:
       L=Lmax×[1−exp⁡(−DDscale)]L = L_{\text{max}} \times \left[ 1 - \exp\left(-\frac{D}{D_{\text{scale}}}\right) \right]L=Lmax​×[1−exp(−Dscale​D​)],
     where:
      • DDD is the donation pool ($4 trillion),
      • DscaleD_{\text{scale}}Dscale​ represents the funding scale over which cost‐effectiveness declines, and
      • LmaxL_{\text{max}}Lmax​ is the maximum number of lives that can be effectively saved given current intervention opportunities.
  4.  – Based on global health data and the limited number of highly cost‐effective interventions, we set LmaxL_{\text{max}}Lmax​ in the range of about 10–15 million lives per year.
     – To reflect that the very best interventions are relatively small in total funding size, we take DscaleD_{\text{scale}}Dscale​ to be around $100 billion.
  5.  Calculating the ratio:
      DDscale=4 trillion100 billion=40\frac{D}{D_{\text{scale}}} = \frac{4\,\text{trillion}}{100\,\text{billion}} = 40Dscale​D​=100billion4trillion​=40.
     Since exp⁡(−40)\exp(-40)exp(−40) is negligibly small, we get:
      L≈LmaxL \approx L_{\text{max}}L≈Lmax​.
  6. Revised Estimate:
     Given the uncertainties, choosing a mid‐range LmaxL_{\text{max}}Lmax​ of about 12 million yields a revised Fermi estimate of roughly 12 million lives saved per year under the assumption that everyone in the West donates 10% of their yearly income to EA-related charities.

Summary 

This Fermi estimate suggests that if everyone in the West donated 10% of their yearly income to highly effective charities, we could save around 12 million lives per year. While you might think throwing $4 trillion at the problem would save way more people, the reality is that we'd quickly run into practical limits. Even the best charities can only scale up so much before they hit barriers like logistical challenges, administrative bottlenecks, and running out of the most cost-effective interventions. Still, saving 12 million lives every year is pretty mind-blowing and shows just how powerful coordinated, effective giving could be if we actually did it.

Technique

I brainstormed with Claude Sonnet for about 20 minutes, asking it to generate potential fermi questions in batches of 20. I did this a few times, rejecting most questions for being too boring or not being tractable enough, until it generated the one I used. I ran the question by o3-mini, and had to correct it's reasoning here and there until it generated a good line of reasoning. Then, I fed that output back into a different instance of o3-mini and asked it to review the fermi estimate above and point out flaws. I put that output back into the original o3-mini and it gave me the model output above.

 

-

I think a high-quality reasoning model (such as o3), combined with other LLM's that act as "critics", could generate very high quality fermi estimates. Also, LLMs can generate ideas far faster than any human can, but humans can evaluate the quality those ideas in a fraction of a second. An under explored idea is to generate dozens or hundreds of ideas using an LLM about how to solve a particular problem, and having a human do the filtering and select the best ones. I can see authors using this and telling their LLM "give me 100 interesting ways I could end this story" and picking the best one.

comment by ozziegooen · 2025-02-03T21:18:17.626Z · LW(p) · GW(p)

Related Manifold question here:
 

comment by ozziegooen · 2025-03-01T05:15:47.583Z · LW(p) · GW(p)

Results are in - it looks like kairos_ wins this! They just barely beat Shankar Sivarajan. 

Again, thanks to all who participated.

We have a very basic web application showing some results here. It was coded quickly (with AI) and has some quirks, but if you search around you can get the main information. 

I'll contact kairos_ for the prize. 

We didn't end up applying the Goodharting penalty for any submissions. No models seemed to goodhart under a cursory glance. 

If time permits, we'll later write a longer post highlighting the posts more and going over lessons learned from this.

Replies from: ozziegooen
comment by ozziegooen · 2025-03-01T23:17:56.514Z · LW(p) · GW(p)

We made a mistake in the analysis that effected some of the scores. We're working on fixing this. 

Sorry for the confusion!