Posts

Inferring the model dimension of API-protected LLMs 2024-03-18T06:19:25.974Z
Processor clock speeds are not how fast AIs think 2024-01-29T14:39:38.050Z
AI Timelines 2023-11-10T05:28:24.841Z
Announcing Epoch's newly expanded Parameters, Compute and Data Trends in Machine Learning database 2023-10-25T02:55:07.440Z
Memory bandwidth constraints imply economies of scale in AI inference 2023-09-17T14:01:34.701Z
The lost millennium 2023-08-24T03:48:40.035Z
Efficiency and resource use scaling parity 2023-08-21T00:18:01.243Z
Is Chinese total factor productivity lower today than it was in 1956? 2023-08-18T22:33:50.560Z
A short calculation about a Twitter poll 2023-08-14T19:48:53.018Z
Should you announce your bets publicly? 2023-07-04T00:11:11.386Z
When is correlation transitive? 2023-06-23T16:09:56.369Z
My impression of singular learning theory 2023-06-18T15:34:27.249Z
Are Bayesian methods guaranteed to overfit? 2023-06-17T12:52:43.987Z
Power laws in Speedrunning and Machine Learning 2023-04-24T10:06:35.332Z
Revisiting algorithmic progress 2022-12-13T01:39:19.264Z
Brun's theorem and sieve theory 2022-12-02T20:57:39.956Z
Nash equilibria of symmetric zero-sum games 2022-10-27T23:50:23.583Z
A conversation about Katja's counterarguments to AI risk 2022-10-18T18:40:36.543Z
Do anthropic considerations undercut the evolution anchor from the Bio Anchors report? 2022-10-01T20:02:48.464Z
Variational Bayesian methods 2022-08-25T20:49:55.415Z
The Reader's Guide to Optimal Monetary Policy 2022-07-25T15:10:51.010Z
A time-invariant version of Laplace's rule 2022-07-15T19:28:15.877Z
Forecasts are not enough 2022-06-30T22:00:53.861Z
What's up with the font size in the Markdown text editor? 2022-05-14T21:12:20.812Z
Report likelihood ratios 2022-04-23T17:10:22.891Z
Fixed points and free will 2022-04-19T17:18:01.318Z
How path-dependent are human values? 2022-04-15T09:34:23.280Z
Underappreciated content on LessWrong 2022-04-11T17:40:15.487Z
Hyperbolic takeoff 2022-04-09T15:57:16.098Z
Best informative videos on the Internet 2022-04-04T17:28:15.918Z
Optional stopping 2022-04-02T13:58:49.130Z
Sums and products 2022-03-27T21:57:38.410Z
My mistake about the war in Ukraine 2022-03-25T23:04:25.281Z
What are the best elementary math problems you know? 2022-03-20T17:18:28.373Z
Phase transitions and AGI 2022-03-17T17:22:06.518Z
Whence the determinant? 2022-03-13T19:38:25.743Z
Is there a good dataset for the moments of the income distribution throughout history? 2022-03-12T13:26:05.657Z
If your solution doesn't work, make it work 2022-03-11T16:10:51.479Z
Ambiguity causes conflict 2022-02-26T16:53:52.614Z
Computability and Complexity 2022-02-05T14:53:40.398Z
Retrospective forecasting 2022-01-30T16:38:17.723Z
Ege Erdil's Shortform 2022-01-09T11:47:31.016Z
What is a probabilistic physical theory? 2021-12-25T16:30:27.331Z
Laplace's rule of succession 2021-11-23T15:48:47.719Z
Equity premium puzzles 2021-11-16T20:50:16.959Z

Comments

Comment by Ege Erdil (ege-erdil) on Processor clock speeds are not how fast AIs think · 2024-01-31T00:24:31.508Z · LW · GW

It is not a tautology.

Can you explain to me the empirical content of the claim, then? I don't understand what it's supposed to mean.

About the rest of your comment, I'm confused about why you're discussing what happens when both chess engines and humans have a lot of time to do something. For example, what's the point of this statement?

My understanding is that it is not true that if you ran computers for a long time that they would beat the human also running for a long time, and that historically, it's been quite the opposite...

I don't understand how this statement is relevant to any claim I made in my comment. Humans beating computers at equal time control is perfectly consistent with the computers being slower than humans. If you took a human and slowed them down by a factor of 10, that's the same pattern you would see.

Are you instead trying to find examples of tasks where computers were beaten by humans when given a short time to do the task but could beat the humans when given a long time to do the task? That's a very different claim from "in every case where we've successfully gotten AI to do a task at all, AI has done that task far far faster than humans".

Comment by Ege Erdil (ege-erdil) on Processor clock speeds are not how fast AIs think · 2024-01-30T22:52:06.749Z · LW · GW

Yes, that's what I'm trying to say, though I think in actual practice the numbers you need would have been much smaller for the Go AIs I'm talking about than they would be for the naive tree search approach.

Comment by Ege Erdil (ege-erdil) on Processor clock speeds are not how fast AIs think · 2024-01-30T14:32:38.620Z · LW · GW

Sure, but in that case I would not say the AI thinks faster than humans, I would say the AI is faster than humans at a specific range of tasks where the AI can do those tasks in a "reasonable" amount of time.

As I've said elsewhere, there is a quality or breadth vs serial speed tradeoff in ML systems: a system that only does one narrow and simple task can do that task at a high serial speed, but as you make systems more general and get them to handle more complex tasks, serial speed tends to fall. The same logic that people are using to claim GPT-4 thinks faster than humans should also lead them to think a calculator thinks faster than GPT-4, which is an unproductive way to use the one-dimensional abstraction of "thinking faster vs. slower".

You might ask "Well, why use that abstraction at all? Why not talk about how fast the AIs can do specific tasks instead of trying to come up with some general notion of if their thinking is faster or slower?" I think a big reason is that people typically claim the faster "cognitive speed" of AIs can have impacts such as "accelerating the pace of history", and I'm trying to argue that the case for such an effect is not as trivial to make as some people seem to think.

Comment by Ege Erdil (ege-erdil) on Processor clock speeds are not how fast AIs think · 2024-01-30T12:52:03.521Z · LW · GW

True, but isn't this almost exactly analogously true for neuron firing speeds? The corresponding period for neurons (10 ms - 1 s) does not generally correspond to the timescale of any useful cognitive work or computation done by the brain.

Yes, which is why you should not be using that metric in the first place.

But even the top-line number is (at least theoretically) a very concrete measure of something that you can actually get out of the system. In contrast, when used in "computational equivalence" estimates of the brain, FLOP/s are (somewhat dubiously, IMO) repurposed as a measure of what the system is doing internally.

Will you still be saying this if future neural networks are running on specialized hardware that, much like the brain, can only execute forward or backward passes of a particular network architecture? I think talking about FLOP/s in this setting makes a lot of sense, because we know the capabilities of neural networks are closely linked to how much training and inference compute they use, but maybe you see some problem with this also?

So even if the 1e15 "computational equivalence" number is right, AND all of that computation is irreducibly a part of the high-level cognitive algorithm that the brain is carrying out, all that means is that it necessarily takes at least 1e15 FLOP/s to run or simulate a brain at neuron-level fidelity. It doesn't mean that you can't get the same high-level outputs of that brain through some other much more computationally efficient process.

I agree, but even if we think future software progress will enable us to get a GPT-4 level model with 10x smaller inference compute, it still makes sense to care about what inference with GPT-4 costs today. The same is true of the brain.

Separately, I think your sequential tokens per second calculation actually does show that LLMs are already "thinking" (in some sense) several OOM faster than humans? 50 tokens/sec is about 5 lines of code per second, or 18,000 lines of code per hour. Setting aside quality, that's easily 100x more than the average human developer can usually write (unassisted) in an hour, unless they're writing something very boilerplate or greenfield.

Yes, but they are not thinking 7 OOM faster. My claim is not AIs can't think faster than humans, indeed, I think they can. However, current AIs are not thinking faster than humans when you take into account the "quality" of the thinking as well as the rate at which it happens, which is why I think FLOP/s is a more useful measure here than token latency. GPT-4 has higher token latency than GPT-3.5, but I think it's fair to say that GPT-4 is the model that "thinks faster" when asked to accomplish some nontrivial cognitive task.

The main issue with current LLMs (which somewhat invalidates this whole comparison) is that they can pretty much only generate boilerplate or greenfield stuff. Generating large volumes of mostly-useless / probably-nonsense boilerplate quickly doesn't necessarily correspond to "thinking faster" than humans, but that's mostly because current LLMs are only barely doing anything that can rightfully be called thinking in the first place.

Exactly, and the empirical trend is that there is a quality-token latency tradeoff: if you want to generate tokens at random, it's very easy to do that at extremely high speed. As you increase your demands on the quality you want these tokens to have, you must take more time per token to generate them. So it's not fair to compare a model like GPT-4 to the human brain on grounds of "token latency": I maintain that throughput comparisons (training compute and inference compute) are going to be more informative in general, though software differences between ML models and the brain can still make it not straightforward to interpret those comparisons.

Comment by Ege Erdil (ege-erdil) on Processor clock speeds are not how fast AIs think · 2024-01-30T01:35:29.318Z · LW · GW

Sure, but from the point of view of per token latency that's going to be a similar effect, no?

Comment by Ege Erdil (ege-erdil) on Processor clock speeds are not how fast AIs think · 2024-01-29T21:20:24.981Z · LW · GW

I think you might have accidentally linked to your comment instead of the LessWrong post you intended to link to.

Comment by Ege Erdil (ege-erdil) on Processor clock speeds are not how fast AIs think · 2024-01-29T19:54:12.584Z · LW · GW

Don't global clock speeds have to go down as die area goes up due to the speed of light constraint?

For instance, if you made a die with 1e15 MAC units and the area scaled linearly, you would be looking at a die that's ~ 2e9 times larger than H100's die size, which is about 1000 mm^2. The physical dimensions of such a die would be around 2 km^2, so the speed of light would limit global clock frequencies to something on the order of c/(1 km) ~= 300 kHz, which is not 1 million times faster than the 1 kHz you attribute to the human brain. If you need multiple round trips for a single clock, the frequencies will get even lower.

Maybe when the clock frequencies get this low, you're dissipating so little heat that you can go 3D without worrying too much about heating issues and that buys you something. Still, your argument here doesn't seem that obvious to me, especially if you consider the fact that one round trip for one clock is extremely optimistic if you're trying to do all MACs at once. Remember that GPT-3 is a sequential model; you can't perform all the ops in one clock because later layers need to know what the earlier layers have computed.

Overall I think your comment here is quite speculative. It may or may not be true, I think we'll see, but people shouldn't treat it as if this is obviously something that's feasible to do.

Comment by Ege Erdil (ege-erdil) on Processor clock speeds are not how fast AIs think · 2024-01-29T19:37:40.391Z · LW · GW

I think counterexamples are easy to find. For example, chess engines in 1997 could play at the level of top human chess players on consumer hardware, but only if they were given orders of magnitude more time to think than the top humans had available. Around 1997 Deep Blue was of a similar strength to Kasparov, but it had to run on a supercomputer; on commercial hardware chess engines were still only 2400-2500 elo. If you ran them for long enough, though, they would obviously be stronger than even Deep Blue was.

I think the claim that "in every case where we've successfully gotten AI to do a task at all, AI has done that task far far faster than humans" is a tautology because we only say we've successfully gotten AI to do a task when AI can beat the top humans at that task. Nobody said "we got AI to play Go" when AI Go engines were only amateur dan strength, even though they could have equally well said "we got AI to play Go at a superhuman level but it's just very slow".

A non-tautological version might say that the decrease over time in the compute multiplier the AIs need to compete with the top humans is steep, so it takes a short time for the AIs to transition from "much slower than humans" to "much faster than humans" when they are crossing the "human threshold". I think there's some truth to this version of the claim but it's not really due to any advanced serial speed on the part of the AIs.

Comment by Ege Erdil (ege-erdil) on Processor clock speeds are not how fast AIs think · 2024-01-29T16:01:25.851Z · LW · GW

If there are people who say “current AIs think many orders of magnitude faster than humans”, then I agree that those people are saying something kinda confused and incoherent, and I am happy that you are correcting them.

Eliezer himself has said (e.g. in his 2010 debate with Robin Hanson) that one of the big reasons he thinks CPUs can beat brains is because CPUs run at 1 GHz while brains run at 1-100 Hz, and the only barrier is that the CPUs are currently running "spreadsheet algorithms" and not the algorithm used by the human brain. I can find the exact timestamp from the video of the debate if you're interested, but I'm surprised you've never heard this argument from anyone before.

There’s a different claim, “we will sooner or later have AIs that can think and act at least 1-2 orders of magnitude faster than a human”. I see that claim as probably true, although I obviously can’t prove it.

I think this claim is too ill-defined to be true, unfortunately, but insofar as it has the shape of something I think will be true it will be because of throughput or software progress and not because of latency.

I agree that the calculation “1 GHz clock speed / 100 Hz neuron firing rate = 1e7” is not the right calculation (although it’s not entirely irrelevant). But I am pretty confident about the weaker claim of 1-2 OOM, given some time to optimize the (future) algorithms.

If the claim here is that "for any task, there will be some AI system using some unspecified amount of inference compute that does the task 1-2 OOM faster than humans", I would probably agree with that claim. My point is that if this is true, it won't be because of the calculation “1 GHz clock speed / 100 Hz neuron firing rate = 1e7”, which as far as I can tell you seem to agree with.

Comment by Ege Erdil (ege-erdil) on Processor clock speeds are not how fast AIs think · 2024-01-29T15:55:00.789Z · LW · GW

As far as I know, in every case where we've successfully gotten AI to do a task at all, AI has done that task far far faster than humans. When we had computers that could do arithmetic but nothing else, they were still much faster at arithmetic than humans. Whatever your view on the quality of recent AI-generated text or art, it's clear that AI is producing it much much faster than human writers or artists can produce text/art.

"Far far faster" is an exaggeration that conflates vastly different orders of magnitude with each other. When compared against humans, computers are many orders of magnitude faster at doing arithmetic than they are at generating text: a human can write perhaps one word per second when typing quickly, while an LLM's serial speed of 50 tokens/sec maybe corresponds to 20 words/sec or so. That's just a ~ 1.3 OOM difference, to be contrasted with 10 OOMs or more at the task of multiplying 32-bit integers, for instance. Are you not bothered at all by how wide the chasm between these two quantities seems to be, and whether it might be a problem for your model of this situation?

In addition, we know that this could be faster if we were willing to accept lower quality outputs, for example by having fewer layers in an LLM. There is a quality-serial speed tradeoff, and so ignoring quality and just looking at the speed at which text is generated is not a good thing to be doing. There's a reason GPT-3.5 has smaller per token latency than GPT-4.

Comment by Ege Erdil (ege-erdil) on AI Timelines · 2023-11-29T20:27:37.047Z · LW · GW

Yes, this summary seems accurate.

Comment by Ege Erdil (ege-erdil) on What did you change your mind about in the last year? · 2023-11-24T20:04:39.393Z · LW · GW

I thought cryonics was unlikely to work because a bunch of information might be lost even at the temperatures that bodies are usually preserved in. I now think this effect is most likely not serious and cryonics can work in principle at the temperatures we use, but present-day cryonics is still unlikely to work because of how much tissue damage the initial process of freezing can do.

Comment by ege-erdil on [deleted post] 2023-11-13T11:05:03.790Z

As I said, I think it's not just that the language is poetic. There is an implicit inference that goes like

  1. People who would not voluntarily undergo surgery without long-term adverse effects on their health to improve the life of a stranger are evil.
  2. Most researchers who would be in a position to know the state of the evidence on the long-term adverse health effects of kidney donation don't personally donate one of their kidneys.
  3. Most researchers are unlikely to be evil.
  4. So it's unlikely that most researchers believe kidney donation has no long-term adverse health effects.

I'm saying that there is no definition of the word "evil" that makes statements (1) and (3) simultaneously true. Either you adopt a narrow definition, in which case (3) is true but (1) is false; or you adopt a broad definition, in which case (1) is true but (3) is false.

This is not a point about stylistic choices, it's undermining one of the key arguments the post offers for its position. The post is significantly stronger if it can persuade us that even established experts in the field agree with the author and the hypothesis being advanced is in some sense "mainstream", even if it's implicitly held.

Comment by ege-erdil on [deleted post] 2023-11-12T22:15:14.343Z

I don't think it's a matter of poetic license. You're making an empirical claim that if specialists actually believed kidney donation had no long-term side effects, they would be lining up to donate their kidneys and we would see a much higher rate of kidney donations in the US. I think this claim is wrong because the inconvenience of surgery is substantial enough to block people from donating their kidneys even in the absence of long-term side effects.

The use of the word "evil" sneaks in an assumption that most people would be happy to make this tradeoff to improve a stranger's life at the cost of some inconvenience to themselves, but I think this claim is actually false. So the fact that this doesn't happen gives very little evidence that specialists don't take claims about the small long-term health effects of kidney donation seriously.

Comment by ege-erdil on [deleted post] 2023-11-12T20:58:10.875Z

Have most of the researchers looking at kidney donation donated a kidney? Have most nephrology researchers donated a kidney? Most surgeons doing kidney transplants? Obviously not, otherwise we’d have more than 200 donations to strangers each year in the US. There are 10,000 board-certified nephrologists, and a few more hundred are added each year, if they took this data seriously they’d all donate.

Heck, on top of those you can add nephrology researchers, the medical statisticians who happen to focus on kidney disease, transplant surgeons, and all well-informed nurses in the nephrology units… thousands of these specialists are created each year. If most of them believed donation to be essentially safe the shortage of kidneys would be half-sovled.

Maybe they are all evil people? They will not take even a marginal risk to save a life.

We usually don't call people "evil" for not inconveniencing themselves by going through surgery, so you seem to be using this word in a fairly non-standard way here.

Just to elaborate: if I had a condition that could be cured either by having an operation equivalent to a kidney donation or paying $3k, I would almost certainly pay $3k. However, I could likely save a statistical life by donating this $3k to an effective charity. So my not donating my kidney to a stranger provides no more evidence of my evil nature than my not donating $3k to save the life of a random stranger, because I price the inconvenience of the surgery at more than $3k even if the surgery has no long-run consequences for my health.

Comment by ege-erdil on [deleted post] 2023-11-12T20:50:00.325Z

There is more data, and better data, e.g. data gathered in double-blinded RCTs, that shows things like:

  1. Homeopathy works very well for a variety of conditions, sometimes better than real drugs used to treat them.
  2. Increasing the healthcare budget and the amount of healthcare people receive. Both in rich countries (e.g. USA) and poor ones (India). Having no effect on mortality.

I can make both of these claims based on many individual RCTs, as well as based on the aggregation of all existing RCTs.

I’m not saying that these claims make sense, they don’t, there are critical lenses through which we analyze research. But if you claim to “just follow the data”, and ignore the issue of data quality, selection bias, and fraud… without applying a critical lens, you are lost.

It seems to me like claim (2) could easily make sense if you interpret it more charitably as "the mortality effects are too small for the studies to detect". I don't have a particularly strong prior that marginal healthcare spending is all that useful for increasing life expectancy - diminishing returns can mean that the average dollar spent on healthcare does much more than the marginal dollar.

Can you justify your claim that (2) does not make sense?

Comment by Ege Erdil (ege-erdil) on AI Timelines · 2023-11-12T17:14:37.653Z · LW · GW

I don't think those ratings are comparable. On the other hand, my estimate of 3d was apparently lowballing it based on some older policy networks, and newer ones are perhaps as strong as 4d to 6d, which on the upper end is still weaker than professional players but not by much.

However, there is a big gap between weak professional players and "grandmaster level", and I don't think the raw policy network of AlphaGo could play competitively against a grandmaster level Go player.

Comment by Ege Erdil (ege-erdil) on AI Timelines · 2023-11-12T15:22:06.911Z · LW · GW

This is not quite true. Raw policy networks of AlphaGo-like models are often at a level around 3 dan in amateur rankings, which would qualify as a good amateur player but nowhere near the equivalent of grandmaster level. If you match percentiles in the rating distributions, 3d in Go is perhaps about as strong as an 1800 elo player in chess, while "master level" is at least 2200 elo and "grandmaster level" starts at 2500 elo.

Edit: Seems like policy networks have improved since I last checked these rankings, and the biggest networks currently available for public use can achieve a strength of possibly as high as 6d without MCTS. That would be somewhat weaker than a professional player, but not by much. Still far off from "grandmaster level" though.

Comment by Ege Erdil (ege-erdil) on My thoughts on the social response to AI risk · 2023-11-02T17:09:42.162Z · LW · GW

I think you're ignoring the qualifier "literally portrayed" in Matthew's sentence, and neglecting the prior context that he's talking about AI development being something mainly driven forward by hobbyists with no outsized impacts.

He's talking about more than just the time in which AI goes from e.g. doubling the AI software R&D output of humans to some kind of singularity. The specific details Eliezer has given about this scenario have not been borne out: for example, in his 2010 debate with Robin Hanson, he emphasized a scenario in which a few people working in a basement and keeping all of their insights secret hit upon some key software innovation that enables their piece of consumer hardware to outcompete the rest of the world.

It's worth noting that Robin Hanson also said that "takeoff" is most likely to take months. He just said it for ems, and in his world, that rate of growth was being driven by the entire world economy working as a whole rather than one local part of the world having such better software that it could outcompete everyone else with vastly less material resources. I find you saying this is a "mild win" for Eliezer's prediction incomprehensible given that we live in a world where individual AI labs are being valued at ~ $100B and raising tens of billions of dollars in capital.

Comment by Ege Erdil (ege-erdil) on Inside Views, Impostor Syndrome, and the Great LARP · 2023-09-25T16:48:37.804Z · LW · GW

I assume John was referring to Unitary Evolution Recurrent Neural Networks which is cited in the "Orthogonal Deep Neural Nets" paper.

Comment by Ege Erdil (ege-erdil) on Memory bandwidth constraints imply economies of scale in AI inference · 2023-09-17T22:19:13.631Z · LW · GW

It might be right, I don't know. I'm just making a local counterargument without commenting on whether the 2.5 PB figure is right or not, hence the lack of endorsement. I don't think we know enough about the brain to endorse any specific figure, though 2.5 PB could perhaps fall within some plausible range.

Comment by Ege Erdil (ege-erdil) on Memory bandwidth constraints imply economies of scale in AI inference · 2023-09-17T18:25:42.831Z · LW · GW

While I wouldn't endorse the 2.5 PB figure itself, I would caution against this line of argument. It's possible for your brain to contain plenty of information that is not accessible to your memory. Indeed, we know of plenty of such cognitive systems in the brain whose algorithms are both sophisticated and inaccessible to any kind of introspection: locomotion and vision are two obvious examples.

Comment by Ege Erdil (ege-erdil) on The lost millennium · 2023-08-25T03:13:12.035Z · LW · GW

I downvoted this comment for its overconfidence.

First of all, the population numbers are complete garbage. This is completely circular. You are just reading out the beliefs about history used to fabricate them. The numbers are generated by people caring about the fall of Rome. The fall of Rome didn't cause of decline in China. Westerners caring about the fall of Rome caused the apparent decline in China.

I will freely admit that I don't know how population numbers are estimated in every case, but your analysis of the issue is highly simplistic. Estimates for population decline do not just depend on vague impressions of the significance of grand historical events such as the fall of Rome. Archaeological evidence, estimates of crop yields with contemporary technology on available farmland, surviving records from the time, etc. are all used in forming population estimates.

It's far from being reliable, but what we know seems clear enough that I would give something like 80% to 90% chance that the first millennium indeed had slower population growth than the first millennium BC. You can't be certain with such things, but I also don't agree that the numbers are "complete garbage" and contain no useful information.

Second, there was a tremendous scientific and technological regress in Rome. Not caused by the fall of Rome, but the rise of Rome. There was a continual regress in the Mediterranean from 150BC to at least 600AD. Just look at a list of scientists: it has a stark gap 150BC-50AD.

I think you're conflating a lack of progress with regression here. I remark in the post that the slowdown in population growth seems to have begun around 200 BC, which is consistent with what you're saying here if you take it as a statement about growth rates and not about levels. If the pace of new discoveries slows down, that would appear to us as fewer notable scientists as well as slower growth in population, sizes of urban centers, etc.

Aside from that, there are also many alternative explanations of a gap in a list of scientists, e.g. that Rome was comparatively less interested in funding fundamental research compared to the Hellenistic kingdoms. Progress in fundamental sciences doesn't always correlate so well with economic performance; e.g. the USSR was much better at fundamental science than their economic performance would suggest.

It is more controversial to say that the renaissance 50AD-150AD is a pale shadow of the Hellenistic period, but it is. In 145BC Rome fomented a civil war in Egypt, destroying Alexandria, the greatest center of learning. In 133BC, the king of Pergamon tried to avoid this fate by donating the second center of learning. It was peaceful, but science did not survive.

I don't know what you're referring to by "Rome fomented a civil war in Egypt in 145 BC". 145 BC is when Ptolemy VI died; but as far as I know, there was no single "civil war" following his death, Alexandria was not destroyed, and Rome was not involved directly in Egyptian politics for a long time to come. Alexandria remained one of the major urban centers of the Mediterranean until the 3rd century AD - perhaps even the largest one.

Comment by Ege Erdil (ege-erdil) on The lost millennium · 2023-08-25T00:13:32.759Z · LW · GW

Well, that's true, but at some level, what else could it possibly be? What other cause could be behind the long-run expansion in the first place, so many millennia after humans spanned every continent but Antarctica?

Technological progress being responsible for the long-run trend doesn't mean you can attribute local reversals to humans hitting limits to technological progress. Just as a silly example, the emergence of a new strain of plague could have led to the depopulation of urban centers, which lowers R&D efficiency because you lose concentrations of people working together, and thus lowers the rate of technological progress. I'm not saying this is what actually happened, but it seems like a possible story to me.

I'm very skeptical about explanations involving wars and plagues, except insofar as those impact technological development and infrastructure, because a handful of generations is plenty to get back to the Malthusian limit even if a majority of the population dies in some major event (especially regional events where you can then also get migration or invasion from less affected regions).

I agree, but why would you assume wars and plagues can't impact technological development and infrastructure?

Comment by Ege Erdil (ege-erdil) on The lost millennium · 2023-08-24T15:22:00.062Z · LW · GW

McEvedy and Jones actually discuss a regional breakdown in the final section of the book, but they speculate too much for the discussion to be useful, I think. They attribute any substantial slowdown in growth rates to population running up against technological limits, which seems like a just-so story that could explain anything.

They note that the 3rd century AD appears to have been a critical time, as it's when population growth trends reversed in both Europe and China at around the same time: in Europe with the Crisis of the Third Century, and in China with the fall of the reconstituted Han dynasty and the beginning of the Three Kingdoms period. They attribute this to technological constraints, which seems like an unsupported assertion to me.

The other important population center is India, where we have very few records compared to Europe and China. Datasets generally report naively extrapolated smooth curves for the Indian population before the modern period, and that's because there really isn't much else to do due to the scarcity of useful information. This doesn't mean that we actually expect population growth in India to have been smooth, just that in the absence of more information our best guess for each date should probably be a smoothly increasing function of the date. As McEvedy and Jones put it, "happy is the graph that has no history".

I agree that locations isolated from Eurasia would most likely not show the same population trends, but Eurasia was ~ 75% of the world's population in the first millennium and so events in Eurasia dominate what happens to the global population.

Comment by Ege Erdil (ege-erdil) on The lost millennium · 2023-08-24T09:35:50.621Z · LW · GW

I've actually written about this subject before, and I agree that the first plague pandemic could have been significant: perhaps killing around 8% of the global population in the four years from 541 to 544. However, it's also worth noting that our evidence for this decline is rather scant; we know that the death toll was very high in Constantinople but not much about what happened outside the capital, mostly because nobody was there to write it down. So it's also entirely conceivable that the death toll was much lower than this. The controversy about this continues to this day in the literature, as far as I know.

The hypothesis that the bubonic plague was responsible is interesting, but by itself doesn't explain the more granular data which suggests the slowdown starts around 200 BC and we already see close to no growth in global population from e.g. 200 AD to 500 AD. HYDE doesn't have this, but the McEvedy and Jones dataset does.

It's possible, and perhaps even likely, that the explanation is not monocasual. In this case, the first plague pandemic could have been one of the many factors that dragged population growth down throughout the first millennium.

Comment by Ege Erdil (ege-erdil) on The lost millennium · 2023-08-24T05:50:08.503Z · LW · GW

In the west, I think the fall of the Western Roman Empire was probably a significant hit, and caused a major setback in economic growth in Europe.

Attribution of causality is tricky with this event, but I would agree if you said the fall coincided with a major slowdown in European economic growth.

China had its bloody Three Kingdom period, and later the An Lushan rebellion.

I think a problem re: China is that a lot of population decline estimates for China are based on the official census, and as far as I know China didn't have a formal census before the Xin dynasty, and certainly not before unification in the 3rd century BC. So the fact that we don't see comparable population declines reported may just be an artifact of that measurement issue. We certainly see plenty of them in the second millennium.

There was the Muslim conquest of the Mediterranean, Persia and Pakistan, though I don't know if that was unusually bloody.

I haven't seen estimates of this that put it anywhere near the Mongol conquests, so I would assume not particularly bloody relative to what was to come later. I would also guess that the Islamic world probably saw significant population growth around that time.

These might be small fluctuations in the grand scheme of things or add up to a period of enough turmoil and strife in the most populous regions of the world to slow growth down.

Yeah, it's possible that this is the explanation, but if so it's rather hard to know because there's no principled way to compare events like these to analogs in other time periods.

Comment by Ege Erdil (ege-erdil) on A short calculation about a Twitter poll · 2023-08-18T16:02:51.728Z · LW · GW

Yeah, that's right. Fixed.

Comment by Ege Erdil (ege-erdil) on A short calculation about a Twitter poll · 2023-08-15T18:08:48.257Z · LW · GW

If people vote as if their individual vote determines the vote of a non-negligible fraction of the voter pool, then you only need (averaged over the whole population, so the value of the entire population is instead of , which seems much more realistic.

So voting blue can make sense for a sufficiently large coalition of "ordinary altruists" with who are able to pre-commit to their vote and think people outside the coalition might vote blue by mistake etc. rather than the "extraordinary altruists" we need in the original situation with . Ditto if you're using a decision theory where it makes sense to suppose such a commitment already exists when making your decision.

Comment by Ege Erdil (ege-erdil) on A short calculation about a Twitter poll · 2023-08-14T23:16:26.959Z · LW · GW

That would be questioning the assumption that your cost function as an altruist should be linear in the number of lives lost. I'm not sure why you would question this assumption, though; it seems rather unnatural to make this a concave function, which is what you would need for your logic to work.

Comment by Ege Erdil (ege-erdil) on When do "brains beat brawn" in Chess? An experiment · 2023-07-06T10:58:42.503Z · LW · GW

I'm surprised by how much this post is getting upvoted. It gives us essentially zero information about any question of importance, for reasons that have already been properly explained by other commenters:

  • Chess is not like the real world in important respects. What the threshold is for material advantage such that a 1200 elo player could beat Stockfish at chess tells us basically nothing about what the threshold is for humans, either individually or collectively, to beat an AGI in some real-world confrontation. This point is so trivial that I feel somewhat embarrassed to be making it, but I have to think that people are just not getting the message here.

  • Even focusing only on chess, the argument here is remarkably weak because Stockfish is not a system trained to beat weaker opponents with piece odds. There are Go AIs that have been trained for this kind of thing, e.g. KataGo can play reasonably well in positions with a handicap if you tell it that its opponent is much weaker than itself. In my experience, KataGo running on consumer hardware can give the best players in the world 3-4 stones and have an even game.

If someone could try to convince me that this experiment was not pointless and actually worth running for some reason, I would be interested to hear their arguments. Note that I'm more sympathetic to "this kind of experiment could be valuable if ran in the right environment", and my skepticism is specifically about running it for chess.

Comment by Ege Erdil (ege-erdil) on What in your opinion is the biggest open problem in AI alignment? · 2023-07-03T23:17:25.124Z · LW · GW

Are neural networks trained using reinforcement learning from human feedback in a sufficiently complex environment biased towards learning the human simulator or the direct translator, in the sense of the ELK report?

I think there are arguments in both directions and it's not obvious which solution a neural network would prefer if trained in a sufficiently complex environment. I also think the question is central to how difficult we should expect aligning powerful systems trained in the current paradigm to be.

Comment by Ege Erdil (ege-erdil) on Automatic Rate Limiting on LessWrong · 2023-06-23T20:33:16.086Z · LW · GW

I'm curious if these rate limits were introduced as a consequence of some recent developments. Has the website been having more problems with spam and low-quality content lately, or has the marginal benefit of making these changes gone up in some other way?

It could also be that you had this idea only recently and in retrospect it had been a good idea for a long time, of course.

Comment by Ege Erdil (ege-erdil) on When is correlation transitive? · 2023-06-23T19:13:16.924Z · LW · GW

Yes, in practice having a model of what is actually driving the correlations can help you do better than these estimates. A causal model would be helpful for that.

The product estimate for the expected correlation is only useful in a setting where nothing else is known about the relationship between the three variables than the two correlations, but in practice you often have some beliefs about what drives the correlations you observe, and if you're a good Bayesian you should of course also condition on all of that.

Comment by Ege Erdil (ege-erdil) on When is correlation transitive? · 2023-06-23T19:10:32.853Z · LW · GW

That's a reasonable picture to have in expectation, yeah.

Comment by Ege Erdil (ege-erdil) on My impression of singular learning theory · 2023-06-21T18:17:09.135Z · LW · GW

As an aside, I've tried to work out what the optimal learning rate for a large language model should be based on the theory in the post, and if I'm doing the calculations correctly (which is a pretty big if) it doesn't match actual practice very well, suggesting there is actually something important missing from this picture.

Essentially, the coefficient should be where is the variance of the per-parameter noise in SGD. If you have a learning rate , you scale the objective you're optimizing by a factor and the noise variance by a factor . Likewise, a bigger batch size lowers the noise variance by a linear factor. So the equilibrium distribution ends up proportional to

where is the per-token average loss and should be equal to the mean square of the partial derivative of the per-token loss function with respect to one of the neural network parameters. If the network is using some decent batch or layer normalization this should probably be where is the model size.

We want what's inside the exponential to just be , because we want the learning to be equivalent to doing a Bayesian update over the whole data. This suggests we should pick

which is a pretty bad prediction. So there's probably something important that's being left out of this model. I'm guessing that a smaller learning rate just means you end up conditioning on minimum loss and that's all you need to in practice, and larger learning rates cause problems with convergence.

Comment by Ege Erdil (ege-erdil) on My impression of singular learning theory · 2023-06-21T09:43:25.011Z · LW · GW

That's useful to know, thanks. Is anything else known about the properties of the noise covariance beyond "it's not constant"?

Some comments on the paper itself: if the problem is that SGD with homoskedastic Gaussian noise fails to converge to a stationary distribution, why don't they define SGD over a torus instead? Seems like it would fix the problem they are talking about, and if it doesn't change the behavior it means their explanation of what's going on is incorrect.

If the only problem is that with homoskedastic Gaussian noise convergence to a stationary distribution is slow (when a stationary distribution does exist), I could believe that. Similar algorithms such as Metropolis-Hastings also have pretty abysmal convergence rates in practice when applied to any kind of complicated problem. It's possible that SGD with batch noise has better regularization properties and therefore converges faster, but I don't think that changes the basic qualitative picture I present in the post.

Comment by Ege Erdil (ege-erdil) on My impression of singular learning theory · 2023-06-20T17:12:18.659Z · LW · GW

Check the Wikipedia section for the stationary distribution of the overdamped Langevin equation.

I should probably clarify that it's difficult to have a rigorous derivation of this claim in the context of SGD in particular, because it's difficult to show absence of heteroskedasticity in SGD residuals. Still, I believe that this is probably negligible in practice, and in principle this is something that can be tested by experiment.

Comment by Ege Erdil (ege-erdil) on My impression of singular learning theory · 2023-06-20T12:39:40.040Z · LW · GW

Sure, I agree that I didn't put this information into the post. However, why do you need to know which is more likely to know anything about e.g. how neural networks generalize?

I understand that SLT has some additional content beyond what is in the post, and I've tried to explain how you could make that fit in this framework. I just don't understand why that additional content is relevant, which is why I left it out.

As an additional note, I wasn't really talking about floating point precision being the important variable here. I'm just saying that if you want -complexity to match the notion of real log canonical threshold, you have to discretize SLT in a way that might not be obvious at first glance, and in a way where some conclusions end up being scale-dependent. This is why if you're interested in studying this question of the relative contribution of singular points to the partition function, SLT is a better setting to be doing it in. At the risk of repeating myself, I just don't know why you would try to do that.

Comment by Ege Erdil (ege-erdil) on My impression of singular learning theory · 2023-06-20T08:52:57.733Z · LW · GW

You need to discretize the function before taking preimages. If you just take preimages in the continuous setting, of course you're not going to see any of the interesting behavior SLT is capturing.

In your case, let's say that we discretize the function space by choosing which one of the functions you're closest to for some . In addition, we also discretize the codomain of by looking at the lattice for some . Now, you'll notice that there's a radius disk around the origin which contains only functions mapping to the zero function, and as our lattice has fundamental area this means the "relative weight" of the singularity at the origin is like .

In contrast, all other points mapping to the zero function only get a relative weight of where is the absolute value of their nonzero coordinate. Cutting off the domain somewhere to make it compact and summing over all to exclude the disk at the origin gives for the total contribution of all the other points in the minimum loss set. So in the limit the singularity at the origin accounts for almost everything in the preimage of . The origin is privileged in my picture just as it is in the SLT picture.

I think your mistake is that you're trying to translate between these two models too literally, when you should be thinking of my model as a discretization of the SLT model. Because it's a discretization at a particular scale, it doesn't capture what happens as the scale is changing. That's the main shortcoming relative to SLT, but it's not clear to me how important capturing this thermodynamic-like limit is to begin with.

Again, maybe I'm misrepresenting the actual content of SLT here, but it's not clear to me what SLT says aside from this, so...

Comment by Ege Erdil (ege-erdil) on My impression of singular learning theory · 2023-06-19T17:07:45.861Z · LW · GW

I'm not too sure how to respond to this comment because it seems like you're not understanding what I'm trying to say.

I agree there's some terminology mismatch, but this is inevitable because SLT is a continuous model and my model is discrete. If you want to translate between them, you need to imagine discretizing SLT, which means you discretize both the codomain of the neural network and the space of functions you're trying to represent in some suitable way. If you do this, then you'll notice that the worse a singularity is, the lower the -complexity of the corresponding discrete function will turn out to be, because many of the neighbors map to the same function after discretization.

The content that SLT adds on top of this is what happens in the limit where your discretization becomes infinitely fine and your dataset becomes infinitely large, but your model doesn't become infinitely large. In this case, SLT claims that the worst singularities dominate the equilibrium behavior of SGD, which I agree is an accurate claim. However, I'm not sure what this claim is supposed to tell us about how NNs learn. I can't make any novel predictions about NNs with this knowledge that I couldn't before.

Comment by Ege Erdil (ege-erdil) on My impression of singular learning theory · 2023-06-19T12:27:47.689Z · LW · GW

I don't think this representation of the theory in my post is correct. The effective dimension of the singularity near the origin is much higher, e.g. because near every other minimal point of this loss function the Hessian doesn't vanish, while for the singularity at the origin it does vanish. If you discretized this setup by looking at it with a lattice of mesh , say, you would notice that the origin is surrounded by many parameters that give nearly identical loss, while near other parts of the space the number of such parameters is far fewer.

The reason you have to do some kind of "translation" between the two theories is that SLT can see not just exactly optimal points but also nearly optimal points, and bad singularities are surrounded by many more nearly optimal points than better-behaved singularities. You can interpret the discretized picture above as the SLT picture seen at some "resolution" or "scale" , i.e. if you discretized the loss function by evaluating it on a lattice with mesh you get my picture. Of course, this loses the information of what happens as and in some thermodynamic limit, which is what you recover when you do SLT.

I just don't see what this thermodynamic limit tells you about the learning behavior of NNs that we didn't know before. We already know NNs approximate Solomonoff induction if the -complexity is a good approximation to Kolmogorov complexity and so forth. What additional information is gained by knowing what looks like as a smooth function as opposed to a discrete function?

In addition, the strong dependence of SLT on being analytic is bad, because analytic functions are rigid: their value in a small open subset determines their value globally. I can see why you need this assumption because quantifying what happens near a singularity becomes incredibly difficult for general smooth functions, but because of the rigidity of analytic functions the approximation that "we can just pretend NNs are analytic" is more pernicious than e.g. "we can just pretend NNs are smooth". Typical approximation theorems like Stone-Weierstrass also fail to save you because they only work in the sup-norm and that's completely useless for determining behavior at singularities. So I'm yet to be convinced that the additional details in SLT provide a more useful account of NN learning than my simple description above.

Comment by Ege Erdil (ege-erdil) on My impression of singular learning theory · 2023-06-19T10:21:32.324Z · LW · GW

Can you give an example of which has the mode of singularity you're talking about? I don't think I'm quite following what you're talking about here.

In SLT is assumed analytic, so I don't understand how the Hessian can fail to be well-defined anywhere. It's possible that the Hessian vanishes at some point, suggesting that the singularity there is even worse than quadratic, e.g. at the origin or something like that. But even in this regime essentially the same logic is going to apply - the worse the singularity, the further away you can move from it without changing the value of very much, and accordingly the singularity contributes more to the volume of the set as .

Comment by Ege Erdil (ege-erdil) on My impression of singular learning theory · 2023-06-18T17:46:40.583Z · LW · GW

Say that you have a loss function . The minimum loss set is probably not exactly , but it has something to do with that, so let's pretend that it's exactly that for now.

This is a collection of equations that are generically independent and so should define a subset of dimension zero, i.e. a collection of points in . However, there might be points at which the partial derivatives vanishing don't define independent equations, so we get something of positive codimension.

In these cases, what happens is that the gradient itself has vanishing derivatives in some directions. In other words, the Hessian matrix fails to be of full rank. Say that this matrix has rank at a specific singular point and consider the set . Diagonalizing will generically bring into a form where it's the linear combination of quadratic terms and higher-order cubic terms, and locally the volume contribution to this set around will be something of order . The worse the singularity, the smaller the rank and the greater the volume contribution of the singularity to the set .

The worst singularities dominate the behavior at small because you can move "much further" along vectors where scales in a cubic fashion than directions where it scales in a quadratic fashion, so those dimensions are the only ones that "count" in some calculation when you compare singularities. The tangent space intuition doesn't apply directly here but something like that still applies, in the sense that the worse a singularity, the more directions you have to move away from it without changing the value of the loss very much.

Is this intuitive now? I'm not sure what more to do to make the result intuitive.

Comment by Ege Erdil (ege-erdil) on My impression of singular learning theory · 2023-06-18T16:58:11.268Z · LW · GW

I think this is a very nice way to present the key ideas. However, in practice I think the discretisation is actually harder to reason about than the continuous version. There are deeper problems, but I'd start by wondering how you would ever compute c(f) defined this way, since it seems to depend in an intricate way on the details of e.g. the floating point implementation.

I would say that the discretization is going to be easier for people with a computer science background to grasp, even though formally I agree it's going to be less pleasant to reason about or to do computations with. Still, if properties of NNs that only appeared when they are continuous functions on were essential for their generalization, we might be in trouble as people keep lowering the precision of their floating point numbers. This explanation makes it clear that while assuming NNs are continuous (or even analytic!) might be useful for theoretical purposes, the claims about generalization hold just as well in a more realistic discrete setting.

I'll note that the volume codimension definition of the RLCT is essentially what you have written down here, and you don't need any mathematics beyond calculus to write that down. You only need things like resolutions of singularities if you actually want to compute that value, and the discretisation doesn't seem to offer any advantage there.

Yes, my definition is inspired by the volume codimension definition, though here we don't need to take a limit as some because the counting measure makes our life easy. The problem you have in a smooth setting is that descending the Lebesgue measure in a dumb way to subspaces with positive codimension gives trivial results, so more care is necessary to recover and reason about the appropriate notions of volume.

Comment by Ege Erdil (ege-erdil) on My impression of singular learning theory · 2023-06-18T16:46:20.043Z · LW · GW

I don't think this is something that requires explanation, though. If you take an arbitrary geometric object in maths, a good definition of its singular points will be "points where the tangent space has higher dimension than expected". If this is the minimum set of a loss function and the tangent space has higher dimension than expected, that intuitively means that locally there are more directions you can move along without changing the loss function, probably suggesting that there are more directions you can move along without changing the function being implemented at all. So the function being implemented is simple, and the rest of the argument works as I outline it in the post.

I think I understand what you and Jesse are getting at, though: there's a particular behavior that only becomes visible in the smooth or analytic setting, which is that minima of the loss function that are more singular become more dominant as in the Boltzmann integral, as opposed to maintaining just the same dominance factor of . You don't see this in the discrete case because there's a finite nonzero gap in loss between first-best and second-best fits, and so the second-best fits are exponentially punished in the limit and become irrelevant, while in the singular case any first-best fit has some second best "space" surrounding it whose volume is more concentrated towards the singularity point.

While I understand that, I'm not too sure what predictions you would make about the behavior of neural networks on the basis of this observation. For instance, if this smooth behavior is really essential to the generalization of NNs, wouldn't we predict that generalization would become worse as people switch to lower precision floating point numbers? I don't think that prediction would have held up very well if someone had made it 5 years ago.

Comment by Ege Erdil (ege-erdil) on My impression of singular learning theory · 2023-06-18T16:21:23.905Z · LW · GW

To me that just sounds like you're saying the integral is dominated by the contribution of the simplest functions that are of minimum loss, and the contribution factor scales like where is the effective dimensionality near the singularity representing this function, equivalently the complexity of said function. That's exactly what I'm saying in my post - where is the added content here?

Comment by Ege Erdil (ege-erdil) on My impression of singular learning theory · 2023-06-18T16:08:43.373Z · LW · GW

None of this is specific to singular learning theory. The basic idea that the parameter-function map might be degenerate and biased towards simple functions predates SLT(at least this most recent wave of interest in its application to neural nets anyway) and indeed goes back to the 90s, no algebraic geometry required.

Sure, I'm aware that people have expressed these ideas before, but I have trouble understanding what is added by singular theory on top of this description. To me, much of singular learning theory looks like trying to do these kinds of calculations in an analytic setting where things become quite a bit more complicated, for example because you no longer have the basic counting function to measure the effective dimensionality of a singularity, forcing you to reach for concepts like "real log canonical threshold" instead.

As far as I can tell, the non-trivial content of SLT is that the averaging over parameters with a given loss is dominated by singular points in the limit because volume clusters there as you take an ever-narrower interval around the minimum set.

I'm not sure why we should expect that beyond the argument I already give in the post. The geometry of the loss landscape is already fully accounted for by the Boltzmann factor; what else does singular learning theory add here?

Maybe this is also what you're confused about when you say "I don't see a mechanism by which SGD is supposed to be attracted to such points".

Comment by Ege Erdil (ege-erdil) on DSLT 0. Distilling Singular Learning Theory · 2023-06-18T15:09:31.738Z · LW · GW

I'm kind of puzzled by the amount of machinery that seems to be going into these arguments, because it seems to me that there is a discrete analog of the same arguments which is probably both more realistic (as neural networks are not actually continuous, especially with people constantly decreasing the precision of the floating point numbers used in implementation) and simpler to understand.

Suppose you represent a neural network architecture as a map where and is the set of all possible computable functions from the input and output space you're considering. In thermodynamic terms, we could identify elements of as "microstates" and the corresponding functions that the NN architecture maps them to as "macrostates".

Furthermore, suppose that comes together with a loss function evaluating how good or bad a particular function is. Assume you optimize using something like stochastic gradient descent on the function with a particular learning rate.

Then, in general, we have the following results:

  • SGD defines a Markov chain structure on the space whose stationary distribution is proportional to on parameters for some positive constant . This is just a basic fact about the Langevin dynamics that SGD would induce in such a system.
  • In general is not injective, and we can define the "-complexity" of any function as . Then, the probability that we arrive at the macrostate is going to be proportional to .
  • When is some kind of negative log likelihood, this approximates Solomonoff induction in a tempered Bayes paradigm insofar as the -complexity is a good approximation for the Kolmogorov complexity of the function , which will happen if the function approximator defined by is sufficiently well-behaved.

Is there some additional content of singular value theory that goes beyond the above insights?

Edit: I've converted this comment to a post, which you can find here.

Comment by Ege Erdil (ege-erdil) on Are Bayesian methods guaranteed to overfit? · 2023-06-18T09:04:27.473Z · LW · GW

A tangential question: Does the overfitting issue from Bayesian statistics have an analog in Bayesian epistemology, i.e. when we only deal with propositional subjective degrees of belief, not with random variables and models?

I think the problem is the same in both cases. Roughly speaking, there is some "appropriate amount" of belief updating to try to fit your experiences, and this appropriate amount is described by Bayes' rule under ideal conditions where

  • it's computationally feasible to perform the full Bayesian update, and
  • the correct model is within the class of models you're performing the update over.

If either of these is not true, then in general you don't know which update is good. If your class of models is particularly bad, it can be preferable to stick to an ignorance prior and perform no update at all.

Asymptotically, all update rules within the tempered Bayes paradigm (Bayes but likelihoods are raised to an exponent that's not in general equal to 1) in a stationary environment (i.i.d. samples and such) converge to MLE, where you have guarantees of eventually landing in a part of your model space which has minimal KL divergence with the true data generating process. However, this is an asymptotic guarantee, so it doesn't necessarily tell us what we should be doing when our sample is finite. Moreover, this guarantee is no longer valid if the data-generating process is not stationary, e.g. if you're drawing one long string of correlated samples from a distribution instead of many independent samples.

Using Bayes' rule at least gets the right credence ratios between the different models you're considering, but it's not clear if this is optimal from the point of view of e.g. an agent trying to maximize expected utility in an environment.

I think in practice the way people deal with these problems is to use a "lazily evaluated" version of the Bayesian paradigm. They start with an initial class of models , and perform usual Bayes until they notice that none of the models in seem to fit the data very well. They then search for an expanded class of models which can still fit the data well while trying to balance between the increased dimensionality of the models in and their better fit with data, and if a decent match is found, they keep using from that point on, etc.