Posts

Power laws in Speedrunning and Machine Learning 2023-04-24T10:06:35.332Z
Announcing Epoch’s dashboard of key trends and figures in Machine Learning 2023-04-13T07:33:06.936Z
Epoch Impact Report 2022 2023-02-02T13:09:00.659Z
Literature review of TAI timelines 2023-01-27T20:07:38.186Z
Injecting some numbers into the AGI debate - by Boaz Barak 2022-11-23T16:10:33.671Z
AI Forecasting Research Ideas 2022-11-17T17:37:40.815Z
Some research ideas in forecasting 2022-11-15T19:47:09.422Z
ACX Meetup Madrid 2022-08-22T13:44:47.701Z
The longest training run 2022-08-17T17:18:40.387Z
A time-invariant version of Laplace's rule 2022-07-15T19:28:15.877Z
Announcing Epoch: A research organization investigating the road to Transformative AI 2022-06-27T13:55:51.451Z
Compute Trends — Comparison to OpenAI’s AI and Compute 2022-03-12T18:09:55.039Z
Projecting compute trends in Machine Learning 2022-03-07T15:32:13.278Z
Patricia Hall & The Warlock Curse 2022-03-06T16:32:59.611Z
Compute Trends Across Three eras of Machine Learning 2022-02-16T14:18:30.406Z
Immune, In a Nutshell 2022-01-31T23:07:38.352Z
Estimating training compute of Deep Learning models 2022-01-20T16:12:43.497Z
Important ML systems from before 2012? 2021-12-18T12:12:21.410Z
What’s the backward-forward FLOP ratio for Neural Networks? 2021-12-13T08:54:48.104Z
Counting Lightning 2021-12-07T14:50:55.680Z
A Framework to Explain Bayesian Models 2021-12-06T10:38:25.815Z
A Bayesian Aggregation Paradox 2021-11-22T10:39:59.935Z
Jsevillamol's Shortform 2021-11-20T16:00:10.434Z
[Link post] When pooling forecasts, use the geometric mean of odds 2021-09-06T06:45:01.244Z
Analysis of World Records in Speedrunning [LINKPOST] 2021-08-04T15:26:35.463Z
Work on Bayesian fitting of AI trends of performance? 2021-07-19T18:45:19.148Z
Trying to approximate Statistical Models as Scoring Tables 2021-06-29T17:20:11.050Z
Parameter counts in Machine Learning 2021-06-19T16:04:34.733Z
How to Write Science Fiction and Fantasy - A Short Summary 2021-05-29T11:47:30.613Z
Parameter count of ML systems through time? 2021-04-19T12:54:26.504Z
Survey on cortical uniformity - an expert amplification exercise 2021-02-23T22:13:24.157Z
Critiques of the Agent Foundations agenda? 2020-11-24T16:11:22.495Z
Spend twice as much effort every time you attempt to solve a problem 2020-11-15T18:37:24.372Z
Aggregating forecasts 2020-07-23T18:04:37.477Z
What confidence interval should one report? 2020-04-20T10:31:54.107Z
On characterizing heavy-tailedness 2020-02-16T00:14:06.197Z
Implications of Quantum Computing for Artificial Intelligence Alignment Research 2019-08-22T10:33:27.502Z
Map of (old) MIRI's Research Agendas 2019-06-07T07:22:42.002Z
Standing on a pile of corpses 2018-12-21T10:36:50.454Z
EA Tourism: London, Blackpool and Prague 2018-08-07T10:41:06.900Z
Learning strategies and the Pokemon league parable 2018-08-07T09:37:27.689Z
EA Spain Community Meeting 2018-07-10T07:24:59.310Z
Estimating the consequences of device detection tech 2018-07-08T18:25:15.277Z
Advocating for factual advocacy 2018-05-06T08:47:46.599Z
The most important step 2018-03-24T12:34:01.643Z

Comments

Comment by Jsevillamol on Liability regimes for AI · 2024-08-19T05:10:45.109Z · LW · GW

The ability to pay liability is important to factor in and this illustrates it well. For the largest prosaic catastrophes this might well be the dominant consideration

For smaller risks, I suspect in practice mitigation, transaction and prosecution costs are what dominates the calculus of who should bear the liability, both in AI and more generally.

Comment by Jsevillamol on Towards more cooperative AI safety strategies · 2024-07-16T22:00:55.490Z · LW · GW

What's the FATE community? Fair AI and Tech Ethics?

Comment by Jsevillamol on Parameter counts in Machine Learning · 2024-06-28T18:05:39.810Z · LW · GW

We have conveniently just updated our database if anyone wants to investigate this further!
https://epochai.org/data/notable-ai-models

Comment by Jsevillamol on We might be missing some key feature of AI takeoff; it'll probably seem like "we could've seen this coming" · 2024-05-10T06:06:33.167Z · LW · GW

Here is a "predictable surprise" I don't discussed often: given the advantages of scale and centralisation for training, it does not seem crazy to me that some major AI developers will be pooling resources in the future, and training jointly large AI systems.

Comment by Jsevillamol on Bayesian inference without priors · 2024-04-25T06:26:43.167Z · LW · GW

I've been tempted to do this sometime, but I fear the prior is performing one very important role you are not making explicit: defining the universe of possible hypothesis you consider.

In turn, defining that universe of probabilities defines how bayesian updates look like. Here is a problem that arises when you ignore this: https://www.lesswrong.com/posts/R28ppqby8zftndDAM/a-bayesian-aggregation-paradox

Comment by Jsevillamol on Revisiting algorithmic progress · 2024-01-23T22:47:11.898Z · LW · GW

shrug 

I think this is true to an extent, but a more systematic analysis needs to back this up.

For instance, I recall quantization techniques working much better after a certain scale (though I can't seem to find the reference...).  It also seems important to validate that techniques to increase performance apply at large scales. Finally, note that the frontier of scale is growing very fast, so even if these discoveries were done with relatively modest compute compared to the frontier,  this is still a tremendous amount of compute!

Comment by Jsevillamol on Revisiting algorithmic progress · 2024-01-23T21:22:47.798Z · LW · GW

even a pause which completely stops all new training runs beyond current size indefinitely would only ~double timelines at best, and probably less

 

I'd emphasize that we currently don't have a very clear sense of how algorithmic improvement happens, and it is likely mediated to some extent by large experiments, so I think is more likely to slow timelines more than this implies.

Comment by Jsevillamol on Analysis of World Records in Speedrunning [LINKPOST] · 2023-11-24T20:39:12.297Z · LW · GW

I agree! I'd be quite interested in looking at TAS data, for the reason you mentioned.

Comment by Jsevillamol on The Evidence for Question Decomposition is Weak · 2023-08-28T22:41:14.392Z · LW · GW

I think Tetlock and cia might have already done some related work?

Question decomposition is part of the superforecasting commandments, though I can't recall off the top of my head if they were RCT'd individually or just as a whole.

ETA: This is the relevant paper (h/t Misha Yagudin). It was not about the 10 commandments. Apparently those haven't been RCT'd at all?

Comment by Jsevillamol on Analysis of World Records in Speedrunning [LINKPOST] · 2023-08-28T09:00:09.965Z · LW · GW

We actually wrote a more up to date paper here

https://arxiv.org/abs/2304.10004

Comment by Jsevillamol on Rice's Theorem says that AIs can't determine much from studying AI source code · 2023-08-21T19:54:17.496Z · LW · GW

I cowrote a detailed response here

https://www.cser.ac.uk/news/response-superintelligence-contained/

Essentially, this type of reasoning proves too much, since it implies we cannot show any properties whatsoever of any program, which is clearly false.

Comment by Jsevillamol on AGI is easier than robotaxis · 2023-08-14T07:01:29.623Z · LW · GW

Here is some data through Matthew Barnett and Jess Riedl

Number of cumulative miles driven by Cruise's autonomous cars is growing as an exponential at roughly 1 OOM per year.

https://twitter.com/MatthewJBar/status/1690102362394992640

Comment by Jsevillamol on Yudkowsky vs Hanson on FOOM: Whose Predictions Were Better? · 2023-06-06T10:27:44.775Z · LW · GW

That is to very basic approximation correct.

Davidson's takeoff model illustrates this point, where a "software singularity" happens for some parameter settings due to software not being restrained to the same degree by capital inputs.

I would point out however that our current understanding of how software progress happens is somewhat poor. Experimentation is definitely a big component of software progress, and it is often understated in LW. 

More research on this soon!

Comment by Jsevillamol on Yudkowsky vs Hanson on FOOM: Whose Predictions Were Better? · 2023-06-04T09:35:43.991Z · LW · GW

algorithmic progress is currently outpacing compute growth by quite a bit

This is not right, at least in computer vision. They seem to be the same order of magnitude.

Physical compute has growth at 0.6 OOM/year and physical compute requirements have decreased at 0.1 to 1.0 OOM/year, see a summary here or a in depth investigation here

Another relevant quote

Algorithmic progress explains roughly 45% of performance improvements in image classification, and most of this occurs through improving compute-efficiency.

Comment by Jsevillamol on What’s the backward-forward FLOP ratio for Neural Networks? · 2023-05-09T08:29:22.154Z · LW · GW

 is not a transpose! It is the timestep . We are raising  to the -th power.

Comment by Jsevillamol on Announcing Epoch’s dashboard of key trends and figures in Machine Learning · 2023-04-21T06:54:10.535Z · LW · GW

Thanks!

Our current best guess is that this includes costs other than the amortized compute of the final training run.

If no extra information surfaces we will add a note clarifying this and/or adjust our estimate.

Comment by Jsevillamol on Announcing Epoch’s dashboard of key trends and figures in Machine Learning · 2023-04-20T21:29:51.176Z · LW · GW

Thanks Neel!

The difference between tf16 and FP32 comes to a x15 factor IIRC. Though also ML developers seem to prioritise other characteristics than cost effectiveness when choosing GPUs like raw performance and interconnect, so you can't just multiply the top price performance we showcase by this factor and expect that to match the cost performance of the largest ML runs today.

More soon-ish.

Comment by Jsevillamol on Announcing Epoch’s dashboard of key trends and figures in Machine Learning · 2023-04-13T19:25:33.049Z · LW · GW

Because there is more data available for FP32, so it's easier to study trends there.

We should release a piece soon about how the picture changes when you account for different number formats, plus considering that most runs happen with hardware that is not the most cost-efficient.

Comment by Jsevillamol on Communicating effectively under Knightian norms · 2023-04-04T06:38:19.228Z · LW · GW

Note that Richard is not treating knightian uncertainty as special and unquantifiable, but instead is giving examples of how to treat it like any other uncertainty, that he is explicitly quantifying and incorporating in his predictions.

I'd prefer calling Richard's "model error" to separate the two, but I'm also okay appropriating the term as Richard did to point to something coherent.

Comment by Jsevillamol on The Wizard of Oz Problem: How incentives and narratives can skew our perception of AI developments · 2023-03-21T21:47:46.233Z · LW · GW

To my knowledge, we currently don’t have a way of translating statements about “loss” into statements about “real-world capabilities”.

 

Now we do!

Comment by Jsevillamol on A time-invariant version of Laplace's rule · 2023-02-09T02:00:07.560Z · LW · GW

My intuition is that it's not a great approximation in those cases, similar to how in regular Laplace the empirical approximation is not great when you have eg N<5

Id need to run some calculations to confirm that intuition though.

Comment by Jsevillamol on SolidGoldMagikarp (plus, prompt generation) · 2023-02-06T14:24:29.901Z · LW · GW

Here is a 2012 meme about SolidGoldMagikarp

https://9gag.com/gag/3389221

Comment by Jsevillamol on SolidGoldMagikarp (plus, prompt generation) · 2023-02-06T14:22:54.330Z · LW · GW

This site claims that the strong SolidGoldMagikarp was the username of a moderator involved somehow with Twitch Plays Pokémon

https://infosec.exchange/@0xabad1dea/109813506433583177

Comment by Jsevillamol on A time-invariant version of Laplace's rule · 2023-02-06T14:14:50.482Z · LW · GW

I still don't understand - did you mean "when T/t is close to zero"?

Comment by Jsevillamol on A time-invariant version of Laplace's rule · 2023-02-04T04:22:31.928Z · LW · GW

What's r?

Comment by Jsevillamol on A time-invariant version of Laplace's rule · 2023-02-03T22:30:48.671Z · LW · GW

That's exactly right, and I think the approximation holds as long as T/t>>1.

This is quite intuitive - as the amount of data goes to infinity, the rate of events should equal the number of events so far divided by the time passed.

Comment by Jsevillamol on Looking for Spanish AI Alignment Researchers · 2023-01-09T15:38:45.055Z · LW · GW

If you want to join the Spanish-speaking EA community, you can do so through this link!

Comment by Jsevillamol on Slightly against aligning with neo-luddites · 2022-12-27T01:15:22.855Z · LW · GW

I agree with the sentiment that indiscriminate regulation is unlikely to have good effects.

I think the step that is missing is analysing the specific policies No-AI Art Activist are likely to advocate for, and whether it is a good idea to support it.

My current sense is that data helpful for alignment is unlikely to be public right now, and so harder copyright would not impede alignment efforts. The kind of data that I could see being useful are things like scores and direct feedback. Maybe at most things like Amazon reviews could end up being useful for toy settings.

Another aspect that the article does not touch on is that copyright enforcement could have an adverse effect. Currently there is basically no one trying to commercialize training dataset curation because enforcing copyright use is a nightmare. It is in fact a common good. I'd expect there would be more incentives to create large curated datasets if this was not the case.

Lastly, here are some examples of "no AI art" legislation I expect the movement is likely to support:

  1. Removing copyright protection of AI generated images
  2. Enforcing AI training data to be strictly opt-in
  3. Forcing AI content to be labelled as such

Besides regulation, I also expect activists to 4) pressure companies to deboost AI made content in social medial sites.

My general impression is that 3) is slightly good for AI safety. People in the AI Safety community have advocated for it in the past, convincingly.

I'm more agnostic on 1), 2) and 4).

1 and 4 will make AI generation less profitable, but also it's somewhat confused - it's a weird double standard to apply to AI content over human made content.

2 makes training more annoying, but could lead to commercialization of datasets and more collective effort being put into building them. I also think there is a possibly a coherent moral case for it, which I'm still trying to make my mind about, regardless of the AI safety consequences.

All in all, I am confused, though I wholeheartedly agree that we should be analysing and deciding to support specific policies rather than eg the anti AI art movement as a whole.

Comment by Jsevillamol on ARC paper: Formalizing the presumption of independence · 2022-11-22T02:01:33.197Z · LW · GW

Great work!

Stuart Armstrong gave one more example of a heuristic argument based in the presumption of independence here.

https://www.lesswrong.com/posts/iNFZG4d9W848zsgch/the-goldbach-conjecture-is-probably-correct-so-was-fermat-s

Comment by Jsevillamol on Counterarguments to the basic AI x-risk case · 2022-10-14T20:00:17.613Z · LW · GW

Here are my quick takes from skimming the post.

In short, the arguments I think are best are A1, B4, C3, C4, C5, C8, C9 and D. I don't find any of them devastating.

A1. Different calls to ‘goal-directedness’ don’t necessarily mean the same concept

I am not sure I parse this one.I am reading it as "AI systems might be more like imitators than optimizers" from the example, which I find moderately persuasive

A2. Ambiguously strong forces for goal-directedness need to meet an ambiguously high bar to cause a risk

I am not sure I understand this one either.I am reading it as "there might be no incentive for generality" which I dont find persuasive - I think there is a strong incentive

B1. Small differences in utility functions may not be catastrophic

I dont find this persuasive. I think the evidence from optimization theory setting variables to extreme values is suggestive enough to suggest this is not the default

B2. Differences between AI and human values may be small
B3. Maybe value isn’t fragile

The only example we have of general intelligence (humans) seems to have strayed pretty far from evolutionary incentives, so I find this unpersuasive

B4. [AI might only care about]Short-term goals

I find that somewhat persuasive, or at least not obviously wrong, similar to A1. There is a huge incentive for instilling long term thinking though.

C1. Human success isn’t from individual intelligence

I dont find this persuasive. Im not convinced there is a meaningful difference between "a single AGI" and "a society of AGIs". A single AGI could be running a billion independent threads of thought and outspeed humans.

C2. AI agents may not be radically superior to combinations of humans and non-agentic machines

I dont find this persuasive. Seems unlikely that human-in-the-loop is going to have any advantages over pure machines.

C3. Trust

I find this plausible but not convincing

C4. Headroom

Plausible but not convincing. I dont find any of the particular examples of lack of headroom convincing, and I think the prior should be that there is a lot of headroom

C5. Intelligence may not be an overwhelming advantage

I find this moderately persuasive though not entirely convincing

C6. Unclear that many goals realistically incentivise taking over the universe

I find this unconvincing. I think there are many reasons to expect that taking over the universe is a convergent goal.

C7. Quantity of new cognitive labor is an empirical question, not addressed

I dont find this superpersuasive. In particular I think there is a good chance that once we have AGI we will be in a hardware overhang and be able to execute tons of AGI-equivalents

C8. Speed of intelligence growth is ambiguous

I find this plausible

C9. Key concepts are vague

Granted but not a refutation in itself

D1. The argument overall proves too much about corporations

I find this somewhat persuasive

Comment by Jsevillamol on Counterarguments to the basic AI x-risk case · 2022-10-14T17:41:12.020Z · LW · GW

Eight examples, no cherry-picking:

 

Nit: Having a wall of images makes this post unnecessarily harder to read.
I'd recommend making a 4x2 collage with the photos so they don't take that much space.

Comment by Jsevillamol on A Bayesian Aggregation Paradox · 2022-06-10T20:06:10.268Z · LW · GW

As it is often the case, I just found out that Jaynes was already discussing a similar issue to the paradox here in his seminal book.

This wikipedia article summarizes the gist of it.

Comment by Jsevillamol on How to get into AI safety research · 2022-05-19T22:10:59.522Z · LW · GW

I also found this thread of math topics on AI safety helpful.

https://forum.effectivealtruism.org/posts/d7fJLQz2QaDNbbWxJ/what-are-the-coolest-topics-in-ai-safety-to-a-hopelessly

Comment by Jsevillamol on Duels & D.Sci March 2022: It's time for D-d-d-d-d-d-d-d-d-d-d-d-d-d-data! · 2022-04-05T12:45:52.027Z · LW · GW

Ah sorry for the lack of clarity - let's stick to my original submission for PVE

That would be:
 

[0,1,0,1,0,0,9,0,0,1,0,0]
 

Comment by Jsevillamol on Duels & D.Sci March 2022: It's time for D-d-d-d-d-d-d-d-d-d-d-d-d-d-data! · 2022-03-31T21:48:24.823Z · LW · GW

Yes, I am looking at decks that appear in the dataset, and more particularly at decks that have faced a deck similar to the rival's.

Good to know that one gets similar results using the different scoring functions.

I guess that maybe the approach does not work that well ¯\_(ツ)_/¯ 

Comment by Jsevillamol on Duels & D.Sci March 2022: It's time for D-d-d-d-d-d-d-d-d-d-d-d-d-d-data! · 2022-03-31T18:23:22.439Z · LW · GW

Thank you for bringing this up!

 I think you might be right, since the deck is quite undiverse and according to the rest diversity is important. That being said, I could not find the mistake in the code at a glance :/

Do you have any opinions on [1, 1, 0, 1, 0, 1, 2, 1, 1, 3, 0, 1]? This would be the worst deck amongst the decks that played against a deck similar to the rival's in my code, according to my code.

Comment by Jsevillamol on DeepMind: Generally capable agents emerge from open-ended play · 2022-03-30T15:48:58.806Z · LW · GW

Marius Hobbhahn has estimated the number of parameters here. His final estimate is 3.5e6 parameters.

Anson Ho has estimated the training compute (his reasoning at the end of this answer). His final estimate is 7.8e22 FLOPs.

Below I made a visualization of the parameters vs training compute of n=108 important ML system, so you can see how DeepMind's syste (labelled GOAT in the graph) compares to other systems. 

[Final calculation]
(8 TPUs)(4.20e14 FLOP/s)(0.1 utilisation rate)(32 agents)(7.3e6 s/agent) = 7.8e22 FLOPs

==========================
NOTES BELOW

[Hardware]
- "Each agent is trained using 8 TPUv3s and consumes approximately 50,000 agent steps (observations) per second."
- TPUv3 (half precision): 4.2e14 FLOP/s
- Number of TPUs: 8
- Utilisation rate: 0.1

[Timesteps]
- Figure 16 shows steps per generation and agent. In total there are 1.5e10 + 4.0e10 + 2.5e10 + 1.1e11 + 2e11 = 3.9e11 steps per agent.
- 3.9e11 / 5e4 = 8e6 s → ~93 days
- 100 million steps is equivalent to 30 minutes of wall-clock time in our setup. (pg 29, fig 27)
- 1e8 steps → 0.5h
- 3.9e11 steps → 1950h → 7.0e6 s → ~82 days
- Both of these seem like overestimates, because:
“Finally, on the largest timescale (days), generational training iteratively improves population performance by bootstrapping off previous generations, whilst also iteratively updating the validation normalised percentile metric itself.” (pg 16)
- Suggests that the above is an overestimate of the number of days needed, else they would have said (months) or (weeks)?
- Final choice (guesstimate): 85 days = 7.3e6 s

[Population size]
- 8 agents? (pg 21) → this is describing the case where they’re not using PBT, so ignore this number
- The original PBT paper uses 32 agents for one task https://arxiv.org/pdf/1711.09846.pdf (in general it uses between 10 and 80)
- (Guesstimate) Average population size: 32

Comment by Jsevillamol on Duels & D.Sci March 2022: It's time for D-d-d-d-d-d-d-d-d-d-d-d-d-d-data! · 2022-03-27T00:32:49.510Z · LW · GW

Fixed, thanks!

Comment by Jsevillamol on Duels & D.Sci March 2022: It's time for D-d-d-d-d-d-d-d-d-d-d-d-d-d-data! · 2022-03-26T21:20:30.844Z · LW · GW

Here is my very bad approach after spending ~one hour playing around with the data

  1. Filter decks that fought against a similar to the rivals deck, using a simple measure of distance (sum of absolute differences between the deck components)
  2. Compute a 'score' of the decks. The score is defined as the sum of 1/deck_distance(deck) * (1 or -1 depending on whether the deck won or lost against the challenger) 
  3. Report the deck with the maximum score

So my submission would be: [0,1,0,1,0,0,9,0,0,1,0,0]

 

Code

Comment by Jsevillamol on Duels & D.Sci March 2022: It's time for D-d-d-d-d-d-d-d-d-d-d-d-d-d-data! · 2022-03-26T19:58:16.886Z · LW · GW

Seems like you want to include A, L, P, V, E in your decks, and avoid B, S, K. Here is the correlation between the quantity of each card and whether the deck won. The ordering is ~similar when computing the inclusion winrate for each card.

Comment by Jsevillamol on Compute Trends — Comparison to OpenAI’s AI and Compute · 2022-03-21T15:52:40.138Z · LW · GW

Thanks for the comment!

I am personally sympathetic to the view that AlphaGo Master and AlphaGo Zero are off-trend.

In the regression with all models the inclusion does not change the median slope, but drastically increases noise, as you can see for yourself in the visualization selecting the option 'big_alphago_action = remove' (see table below for a comparison of regressing the large model trend without vs with the big AlphaGo models).

In appendix B we study the effects of removing AlphaGo Zero and AlphaGo Master when studying record-setting models. The upper bound of the slope is affected dramatically, and the R2 fit is much better when we exclude them, see table 6 reproduced below.

Comment by Jsevillamol on Compute Trends Across Three eras of Machine Learning · 2022-03-15T21:37:25.547Z · LW · GW

Following up on this: we have updated appendix F of our paper with an analysis of different choices of the threshold that separates large-scale and regular-scale systems. Results are similar independently of the threshold choice.

Comment by Jsevillamol on A Bayesian Aggregation Paradox · 2022-03-07T23:36:25.251Z · LW · GW

Thanks for engaging!

 

To use this theorem, you need both an  (your data / evidence), and a  (your parameter).

Parameters are abstractions we use to simplify modelling. What we actually care about is the probability of unkown events given past observations.

 

You start out discussing what appears to be a combination of two forecasts

To clarify: this is not what I wanted to discuss. The expert is reporting how you should update your priors given the evidence, and remaining agnostic on what the priors should be.

 

A likelihood isn't just something you multiply with your prior, it is a conditional pmf or pdf with a different outcome than your prior.

The whole point of Bayesianism is that it offer a precise, quantitative answer to how you should update your priors given some evidence - and that is multiplying by the likelihoods.

This is why it is often recommend in social sciences and elsewhere to report your likelihoods.

 

I'm not sure we ever observe [the evidence vector] directly

I agree this is not common in judgemental forecasting, where the whole updating process is very illegible. I think it holds for most Bayesian-leaning scientific reporting.

 

it is pretty clear from your post that you're talking about  in the sense used above, not .

I am not, I am talking about evidence = likelihood vectors.

One way to think about this is that the expert is just informing us about how we should update our beliefs. "Given that the pandemic broke out in Wuhan, your subjective probability of a lab break should increase and it should increase by this amount". But the final probability depends on your prior beliefs, that the expert cannot possibly know.

 

I don't think there is a unique way to go from to, let's say, , where  is the expert's probability vector over  and  your probability vector over .

Yes! If I am understanding this right, I think this gets to the crux of the post. The compression is lossy, and neccessarily loses some information.

Comment by Jsevillamol on An Intuitive Introduction to Functional Decision Theory · 2022-03-07T17:52:03.080Z · LW · GW

Great sequence - it is a nice compendium of the theories and important thought experiments.

I will probably use this as a reference in the future, and refer other people here for an introduction.

Looking forward to future entries!

Comment by Jsevillamol on Patricia Hall & The Warlock Curse · 2022-03-07T12:57:00.633Z · LW · GW

I am glad Yair! Thanks for giving it a go :)

Comment by Jsevillamol on Are we in an AI overhang? · 2022-03-03T15:53:01.508Z · LW · GW

Those I know who train large models seem to be very confident we will get 100 Trillion parameter models before the end of the decade, but do not seem to think it will happen, say, in the next 2 years. 

 

FWIW if the current trend continues we will first see 1e14 parameter models in 2 to 4 years from now.

Comment by Jsevillamol on [Beta Feature] Google-Docs-like editing for LessWrong posts · 2022-02-23T03:01:47.876Z · LW · GW

I am pretty pumped about this. Google docs + latex support is huge game for me.

Comment by Jsevillamol on Compute Trends Across Three eras of Machine Learning · 2022-02-18T15:30:03.096Z · LW · GW

There's also a lot of research that didn't make your analysis, including work explicitly geared towards smaller models. What exclusion criteria did you use? I feel like if I was to perform the same analysis with a slightly different sample of papers I could come to wildly divergent conclusions.


It is not feasible to do an exhaustive analysis of all milestone models. We necessarily are missing some important ones, either because we are not aware of them, because they did not provide enough information to deduce the training compute or because we haven't gotten to annotate them yet.

Our criteria for inclusion is outlined in appendix A. Essentially it boils down to ML models that have been cited >1000 times, models that have some historical significance and models that have been deployed in an important context (eg something that was deployed as part of Bing search engine would count). For models in the last two years we were more subjective, since there hasn't been enough time for the more relevant work to stand out the test of time.

We also excluded 5 models that have abnormally low compute, see figure 4.

We tried playing around with the selection of papers that was excluded and it didn't significantly change our conclusions, though obviously the dataset is biased in many ways. Appendix G discusses the possible biases that may have crept in. 

Comment by Jsevillamol on Compute Trends Across Three eras of Machine Learning · 2022-02-18T15:15:57.118Z · LW · GW

Great questions! I think it is reasonable to be suspicious of the large-scale distinction.

I do stand by it - I think the companies discontinuously increased their training budgets around 2016 for some flagship models.[1] If you mix these models with the regular trend, you might believe that the trend was doubling very fast up until 2017 and then slowed down. It is not an entirely unreasonable interpretation, but it explains worse the discontinuous jumps around 2016. Appendix E discusses this in-depth.

The way we selected the large-scale models is half intuition and half convenience. We compare the compute of each model to the log compute of nearby papers (within 2 years), and we call it large scale if its log compute exceeds 0.72 standard deviations of the mean of that sample.

I think there is a reasonable case for including NASv3, Libratus, Megatron-LM, T5-3B, OpenAI Five, Turing NLG, iGPT-XL, GShard (dense), Switch, DALL-E, Pangu-α, ProtT5-XXL and HyperClova on either side of this division.

Arguably we should have been more transparent about the effects of choosing a different threshold - we will try to look more into this in the next update of the paper.

 

  1. ^

    See appendix F for a surface discussion

Comment by Jsevillamol on Compute Trends Across Three eras of Machine Learning · 2022-02-18T14:29:34.752Z · LW · GW

Great idea!

The sequence is now up.