Memory bandwidth constraints imply economies of scale in AI inference 2023-09-17T14:01:34.701Z
The lost millennium 2023-08-24T03:48:40.035Z
Efficiency and resource use scaling parity 2023-08-21T00:18:01.243Z
Is Chinese total factor productivity lower today than it was in 1956? 2023-08-18T22:33:50.560Z
A short calculation about a Twitter poll 2023-08-14T19:48:53.018Z
Should you announce your bets publicly? 2023-07-04T00:11:11.386Z
When is correlation transitive? 2023-06-23T16:09:56.369Z
My impression of singular learning theory 2023-06-18T15:34:27.249Z
Are Bayesian methods guaranteed to overfit? 2023-06-17T12:52:43.987Z
Power laws in Speedrunning and Machine Learning 2023-04-24T10:06:35.332Z
Revisiting algorithmic progress 2022-12-13T01:39:19.264Z
Brun's theorem and sieve theory 2022-12-02T20:57:39.956Z
Nash equilibria of symmetric zero-sum games 2022-10-27T23:50:23.583Z
A conversation about Katja's counterarguments to AI risk 2022-10-18T18:40:36.543Z
Do anthropic considerations undercut the evolution anchor from the Bio Anchors report? 2022-10-01T20:02:48.464Z
Variational Bayesian methods 2022-08-25T20:49:55.415Z
The Reader's Guide to Optimal Monetary Policy 2022-07-25T15:10:51.010Z
A time-invariant version of Laplace's rule 2022-07-15T19:28:15.877Z
Forecasts are not enough 2022-06-30T22:00:53.861Z
What's up with the font size in the Markdown text editor? 2022-05-14T21:12:20.812Z
Report likelihood ratios 2022-04-23T17:10:22.891Z
Fixed points and free will 2022-04-19T17:18:01.318Z
How path-dependent are human values? 2022-04-15T09:34:23.280Z
Underappreciated content on LessWrong 2022-04-11T17:40:15.487Z
Hyperbolic takeoff 2022-04-09T15:57:16.098Z
Best informative videos on the Internet 2022-04-04T17:28:15.918Z
Optional stopping 2022-04-02T13:58:49.130Z
Sums and products 2022-03-27T21:57:38.410Z
My mistake about the war in Ukraine 2022-03-25T23:04:25.281Z
What are the best elementary math problems you know? 2022-03-20T17:18:28.373Z
Phase transitions and AGI 2022-03-17T17:22:06.518Z
Whence the determinant? 2022-03-13T19:38:25.743Z
Is there a good dataset for the moments of the income distribution throughout history? 2022-03-12T13:26:05.657Z
If your solution doesn't work, make it work 2022-03-11T16:10:51.479Z
Ambiguity causes conflict 2022-02-26T16:53:52.614Z
Computability and Complexity 2022-02-05T14:53:40.398Z
Retrospective forecasting 2022-01-30T16:38:17.723Z
Ege Erdil's Shortform 2022-01-09T11:47:31.016Z
What is a probabilistic physical theory? 2021-12-25T16:30:27.331Z
Laplace's rule of succession 2021-11-23T15:48:47.719Z
Equity premium puzzles 2021-11-16T20:50:16.959Z


Comment by Ege Erdil (ege-erdil) on Inside Views, Impostor Syndrome, and the Great LARP · 2023-09-25T16:48:37.804Z · LW · GW

I assume John was referring to Unitary Evolution Recurrent Neural Networks which is cited in the "Orthogonal Deep Neural Nets" paper.

Comment by Ege Erdil (ege-erdil) on Memory bandwidth constraints imply economies of scale in AI inference · 2023-09-17T22:19:13.631Z · LW · GW

It might be right, I don't know. I'm just making a local counterargument without commenting on whether the 2.5 PB figure is right or not, hence the lack of endorsement. I don't think we know enough about the brain to endorse any specific figure, though 2.5 PB could perhaps fall within some plausible range.

Comment by Ege Erdil (ege-erdil) on Memory bandwidth constraints imply economies of scale in AI inference · 2023-09-17T18:25:42.831Z · LW · GW

While I wouldn't endorse the 2.5 PB figure itself, I would caution against this line of argument. It's possible for your brain to contain plenty of information that is not accessible to your memory. Indeed, we know of plenty of such cognitive systems in the brain whose algorithms are both sophisticated and inaccessible to any kind of introspection: locomotion and vision are two obvious examples.

Comment by Ege Erdil (ege-erdil) on The lost millennium · 2023-08-25T03:13:12.035Z · LW · GW

I downvoted this comment for its overconfidence.

First of all, the population numbers are complete garbage. This is completely circular. You are just reading out the beliefs about history used to fabricate them. The numbers are generated by people caring about the fall of Rome. The fall of Rome didn't cause of decline in China. Westerners caring about the fall of Rome caused the apparent decline in China.

I will freely admit that I don't know how population numbers are estimated in every case, but your analysis of the issue is highly simplistic. Estimates for population decline do not just depend on vague impressions of the significance of grand historical events such as the fall of Rome. Archaeological evidence, estimates of crop yields with contemporary technology on available farmland, surviving records from the time, etc. are all used in forming population estimates.

It's far from being reliable, but what we know seems clear enough that I would give something like 80% to 90% chance that the first millennium indeed had slower population growth than the first millennium BC. You can't be certain with such things, but I also don't agree that the numbers are "complete garbage" and contain no useful information.

Second, there was a tremendous scientific and technological regress in Rome. Not caused by the fall of Rome, but the rise of Rome. There was a continual regress in the Mediterranean from 150BC to at least 600AD. Just look at a list of scientists: it has a stark gap 150BC-50AD.

I think you're conflating a lack of progress with regression here. I remark in the post that the slowdown in population growth seems to have begun around 200 BC, which is consistent with what you're saying here if you take it as a statement about growth rates and not about levels. If the pace of new discoveries slows down, that would appear to us as fewer notable scientists as well as slower growth in population, sizes of urban centers, etc.

Aside from that, there are also many alternative explanations of a gap in a list of scientists, e.g. that Rome was comparatively less interested in funding fundamental research compared to the Hellenistic kingdoms. Progress in fundamental sciences doesn't always correlate so well with economic performance; e.g. the USSR was much better at fundamental science than their economic performance would suggest.

It is more controversial to say that the renaissance 50AD-150AD is a pale shadow of the Hellenistic period, but it is. In 145BC Rome fomented a civil war in Egypt, destroying Alexandria, the greatest center of learning. In 133BC, the king of Pergamon tried to avoid this fate by donating the second center of learning. It was peaceful, but science did not survive.

I don't know what you're referring to by "Rome fomented a civil war in Egypt in 145 BC". 145 BC is when Ptolemy VI died; but as far as I know, there was no single "civil war" following his death, Alexandria was not destroyed, and Rome was not involved directly in Egyptian politics for a long time to come. Alexandria remained one of the major urban centers of the Mediterranean until the 3rd century AD - perhaps even the largest one.

Comment by Ege Erdil (ege-erdil) on The lost millennium · 2023-08-25T00:13:32.759Z · LW · GW

Well, that's true, but at some level, what else could it possibly be? What other cause could be behind the long-run expansion in the first place, so many millennia after humans spanned every continent but Antarctica?

Technological progress being responsible for the long-run trend doesn't mean you can attribute local reversals to humans hitting limits to technological progress. Just as a silly example, the emergence of a new strain of plague could have led to the depopulation of urban centers, which lowers R&D efficiency because you lose concentrations of people working together, and thus lowers the rate of technological progress. I'm not saying this is what actually happened, but it seems like a possible story to me.

I'm very skeptical about explanations involving wars and plagues, except insofar as those impact technological development and infrastructure, because a handful of generations is plenty to get back to the Malthusian limit even if a majority of the population dies in some major event (especially regional events where you can then also get migration or invasion from less affected regions).

I agree, but why would you assume wars and plagues can't impact technological development and infrastructure?

Comment by Ege Erdil (ege-erdil) on The lost millennium · 2023-08-24T15:22:00.062Z · LW · GW

McEvedy and Jones actually discuss a regional breakdown in the final section of the book, but they speculate too much for the discussion to be useful, I think. They attribute any substantial slowdown in growth rates to population running up against technological limits, which seems like a just-so story that could explain anything.

They note that the 3rd century AD appears to have been a critical time, as it's when population growth trends reversed in both Europe and China at around the same time: in Europe with the Crisis of the Third Century, and in China with the fall of the reconstituted Han dynasty and the beginning of the Three Kingdoms period. They attribute this to technological constraints, which seems like an unsupported assertion to me.

The other important population center is India, where we have very few records compared to Europe and China. Datasets generally report naively extrapolated smooth curves for the Indian population before the modern period, and that's because there really isn't much else to do due to the scarcity of useful information. This doesn't mean that we actually expect population growth in India to have been smooth, just that in the absence of more information our best guess for each date should probably be a smoothly increasing function of the date. As McEvedy and Jones put it, "happy is the graph that has no history".

I agree that locations isolated from Eurasia would most likely not show the same population trends, but Eurasia was ~ 75% of the world's population in the first millennium and so events in Eurasia dominate what happens to the global population.

Comment by Ege Erdil (ege-erdil) on The lost millennium · 2023-08-24T09:35:50.621Z · LW · GW

I've actually written about this subject before, and I agree that the first plague pandemic could have been significant: perhaps killing around 8% of the global population in the four years from 541 to 544. However, it's also worth noting that our evidence for this decline is rather scant; we know that the death toll was very high in Constantinople but not much about what happened outside the capital, mostly because nobody was there to write it down. So it's also entirely conceivable that the death toll was much lower than this. The controversy about this continues to this day in the literature, as far as I know.

The hypothesis that the bubonic plague was responsible is interesting, but by itself doesn't explain the more granular data which suggests the slowdown starts around 200 BC and we already see close to no growth in global population from e.g. 200 AD to 500 AD. HYDE doesn't have this, but the McEvedy and Jones dataset does.

It's possible, and perhaps even likely, that the explanation is not monocasual. In this case, the first plague pandemic could have been one of the many factors that dragged population growth down throughout the first millennium.

Comment by Ege Erdil (ege-erdil) on The lost millennium · 2023-08-24T05:50:08.503Z · LW · GW

In the west, I think the fall of the Western Roman Empire was probably a significant hit, and caused a major setback in economic growth in Europe.

Attribution of causality is tricky with this event, but I would agree if you said the fall coincided with a major slowdown in European economic growth.

China had its bloody Three Kingdom period, and later the An Lushan rebellion.

I think a problem re: China is that a lot of population decline estimates for China are based on the official census, and as far as I know China didn't have a formal census before the Xin dynasty, and certainly not before unification in the 3rd century BC. So the fact that we don't see comparable population declines reported may just be an artifact of that measurement issue. We certainly see plenty of them in the second millennium.

There was the Muslim conquest of the Mediterranean, Persia and Pakistan, though I don't know if that was unusually bloody.

I haven't seen estimates of this that put it anywhere near the Mongol conquests, so I would assume not particularly bloody relative to what was to come later. I would also guess that the Islamic world probably saw significant population growth around that time.

These might be small fluctuations in the grand scheme of things or add up to a period of enough turmoil and strife in the most populous regions of the world to slow growth down.

Yeah, it's possible that this is the explanation, but if so it's rather hard to know because there's no principled way to compare events like these to analogs in other time periods.

Comment by Ege Erdil (ege-erdil) on A short calculation about a Twitter poll · 2023-08-18T16:02:51.728Z · LW · GW

Yeah, that's right. Fixed.

Comment by Ege Erdil (ege-erdil) on A short calculation about a Twitter poll · 2023-08-15T18:08:48.257Z · LW · GW

If people vote as if their individual vote determines the vote of a non-negligible fraction of the voter pool, then you only need (averaged over the whole population, so the value of the entire population is instead of , which seems much more realistic.

So voting blue can make sense for a sufficiently large coalition of "ordinary altruists" with who are able to pre-commit to their vote and think people outside the coalition might vote blue by mistake etc. rather than the "extraordinary altruists" we need in the original situation with . Ditto if you're using a decision theory where it makes sense to suppose such a commitment already exists when making your decision.

Comment by Ege Erdil (ege-erdil) on A short calculation about a Twitter poll · 2023-08-14T23:16:26.959Z · LW · GW

That would be questioning the assumption that your cost function as an altruist should be linear in the number of lives lost. I'm not sure why you would question this assumption, though; it seems rather unnatural to make this a concave function, which is what you would need for your logic to work.

Comment by Ege Erdil (ege-erdil) on When do "brains beat brawn" in Chess? An experiment · 2023-07-06T10:58:42.503Z · LW · GW

I'm surprised by how much this post is getting upvoted. It gives us essentially zero information about any question of importance, for reasons that have already been properly explained by other commenters:

  • Chess is not like the real world in important respects. What the threshold is for material advantage such that a 1200 elo player could beat Stockfish at chess tells us basically nothing about what the threshold is for humans, either individually or collectively, to beat an AGI in some real-world confrontation. This point is so trivial that I feel somewhat embarrassed to be making it, but I have to think that people are just not getting the message here.

  • Even focusing only on chess, the argument here is remarkably weak because Stockfish is not a system trained to beat weaker opponents with piece odds. There are Go AIs that have been trained for this kind of thing, e.g. KataGo can play reasonably well in positions with a handicap if you tell it that its opponent is much weaker than itself. In my experience, KataGo running on consumer hardware can give the best players in the world 3-4 stones and have an even game.

If someone could try to convince me that this experiment was not pointless and actually worth running for some reason, I would be interested to hear their arguments. Note that I'm more sympathetic to "this kind of experiment could be valuable if ran in the right environment", and my skepticism is specifically about running it for chess.

Comment by Ege Erdil (ege-erdil) on What in your opinion is the biggest open problem in AI alignment? · 2023-07-03T23:17:25.124Z · LW · GW

Are neural networks trained using reinforcement learning from human feedback in a sufficiently complex environment biased towards learning the human simulator or the direct translator, in the sense of the ELK report?

I think there are arguments in both directions and it's not obvious which solution a neural network would prefer if trained in a sufficiently complex environment. I also think the question is central to how difficult we should expect aligning powerful systems trained in the current paradigm to be.

Comment by Ege Erdil (ege-erdil) on Automatic Rate Limiting on LessWrong · 2023-06-23T20:33:16.086Z · LW · GW

I'm curious if these rate limits were introduced as a consequence of some recent developments. Has the website been having more problems with spam and low-quality content lately, or has the marginal benefit of making these changes gone up in some other way?

It could also be that you had this idea only recently and in retrospect it had been a good idea for a long time, of course.

Comment by Ege Erdil (ege-erdil) on When is correlation transitive? · 2023-06-23T19:13:16.924Z · LW · GW

Yes, in practice having a model of what is actually driving the correlations can help you do better than these estimates. A causal model would be helpful for that.

The product estimate for the expected correlation is only useful in a setting where nothing else is known about the relationship between the three variables than the two correlations, but in practice you often have some beliefs about what drives the correlations you observe, and if you're a good Bayesian you should of course also condition on all of that.

Comment by Ege Erdil (ege-erdil) on When is correlation transitive? · 2023-06-23T19:10:32.853Z · LW · GW

That's a reasonable picture to have in expectation, yeah.

Comment by Ege Erdil (ege-erdil) on My impression of singular learning theory · 2023-06-21T18:17:09.135Z · LW · GW

As an aside, I've tried to work out what the optimal learning rate for a large language model should be based on the theory in the post, and if I'm doing the calculations correctly (which is a pretty big if) it doesn't match actual practice very well, suggesting there is actually something important missing from this picture.

Essentially, the coefficient should be where is the variance of the per-parameter noise in SGD. If you have a learning rate , you scale the objective you're optimizing by a factor and the noise variance by a factor . Likewise, a bigger batch size lowers the noise variance by a linear factor. So the equilibrium distribution ends up proportional to

where is the per-token average loss and should be equal to the mean square of the partial derivative of the per-token loss function with respect to one of the neural network parameters. If the network is using some decent batch or layer normalization this should probably be where is the model size.

We want what's inside the exponential to just be , because we want the learning to be equivalent to doing a Bayesian update over the whole data. This suggests we should pick

which is a pretty bad prediction. So there's probably something important that's being left out of this model. I'm guessing that a smaller learning rate just means you end up conditioning on minimum loss and that's all you need to in practice, and larger learning rates cause problems with convergence.

Comment by Ege Erdil (ege-erdil) on My impression of singular learning theory · 2023-06-21T09:43:25.011Z · LW · GW

That's useful to know, thanks. Is anything else known about the properties of the noise covariance beyond "it's not constant"?

Some comments on the paper itself: if the problem is that SGD with homoskedastic Gaussian noise fails to converge to a stationary distribution, why don't they define SGD over a torus instead? Seems like it would fix the problem they are talking about, and if it doesn't change the behavior it means their explanation of what's going on is incorrect.

If the only problem is that with homoskedastic Gaussian noise convergence to a stationary distribution is slow (when a stationary distribution does exist), I could believe that. Similar algorithms such as Metropolis-Hastings also have pretty abysmal convergence rates in practice when applied to any kind of complicated problem. It's possible that SGD with batch noise has better regularization properties and therefore converges faster, but I don't think that changes the basic qualitative picture I present in the post.

Comment by Ege Erdil (ege-erdil) on My impression of singular learning theory · 2023-06-20T17:12:18.659Z · LW · GW

Check the Wikipedia section for the stationary distribution of the overdamped Langevin equation.

I should probably clarify that it's difficult to have a rigorous derivation of this claim in the context of SGD in particular, because it's difficult to show absence of heteroskedasticity in SGD residuals. Still, I believe that this is probably negligible in practice, and in principle this is something that can be tested by experiment.

Comment by Ege Erdil (ege-erdil) on My impression of singular learning theory · 2023-06-20T12:39:40.040Z · LW · GW

Sure, I agree that I didn't put this information into the post. However, why do you need to know which is more likely to know anything about e.g. how neural networks generalize?

I understand that SLT has some additional content beyond what is in the post, and I've tried to explain how you could make that fit in this framework. I just don't understand why that additional content is relevant, which is why I left it out.

As an additional note, I wasn't really talking about floating point precision being the important variable here. I'm just saying that if you want -complexity to match the notion of real log canonical threshold, you have to discretize SLT in a way that might not be obvious at first glance, and in a way where some conclusions end up being scale-dependent. This is why if you're interested in studying this question of the relative contribution of singular points to the partition function, SLT is a better setting to be doing it in. At the risk of repeating myself, I just don't know why you would try to do that.

Comment by Ege Erdil (ege-erdil) on My impression of singular learning theory · 2023-06-20T08:52:57.733Z · LW · GW

You need to discretize the function before taking preimages. If you just take preimages in the continuous setting, of course you're not going to see any of the interesting behavior SLT is capturing.

In your case, let's say that we discretize the function space by choosing which one of the functions you're closest to for some . In addition, we also discretize the codomain of by looking at the lattice for some . Now, you'll notice that there's a radius disk around the origin which contains only functions mapping to the zero function, and as our lattice has fundamental area this means the "relative weight" of the singularity at the origin is like .

In contrast, all other points mapping to the zero function only get a relative weight of where is the absolute value of their nonzero coordinate. Cutting off the domain somewhere to make it compact and summing over all to exclude the disk at the origin gives for the total contribution of all the other points in the minimum loss set. So in the limit the singularity at the origin accounts for almost everything in the preimage of . The origin is privileged in my picture just as it is in the SLT picture.

I think your mistake is that you're trying to translate between these two models too literally, when you should be thinking of my model as a discretization of the SLT model. Because it's a discretization at a particular scale, it doesn't capture what happens as the scale is changing. That's the main shortcoming relative to SLT, but it's not clear to me how important capturing this thermodynamic-like limit is to begin with.

Again, maybe I'm misrepresenting the actual content of SLT here, but it's not clear to me what SLT says aside from this, so...

Comment by Ege Erdil (ege-erdil) on My impression of singular learning theory · 2023-06-19T17:07:45.861Z · LW · GW

I'm not too sure how to respond to this comment because it seems like you're not understanding what I'm trying to say.

I agree there's some terminology mismatch, but this is inevitable because SLT is a continuous model and my model is discrete. If you want to translate between them, you need to imagine discretizing SLT, which means you discretize both the codomain of the neural network and the space of functions you're trying to represent in some suitable way. If you do this, then you'll notice that the worse a singularity is, the lower the -complexity of the corresponding discrete function will turn out to be, because many of the neighbors map to the same function after discretization.

The content that SLT adds on top of this is what happens in the limit where your discretization becomes infinitely fine and your dataset becomes infinitely large, but your model doesn't become infinitely large. In this case, SLT claims that the worst singularities dominate the equilibrium behavior of SGD, which I agree is an accurate claim. However, I'm not sure what this claim is supposed to tell us about how NNs learn. I can't make any novel predictions about NNs with this knowledge that I couldn't before.

Comment by Ege Erdil (ege-erdil) on My impression of singular learning theory · 2023-06-19T12:27:47.689Z · LW · GW

I don't think this representation of the theory in my post is correct. The effective dimension of the singularity near the origin is much higher, e.g. because near every other minimal point of this loss function the Hessian doesn't vanish, while for the singularity at the origin it does vanish. If you discretized this setup by looking at it with a lattice of mesh , say, you would notice that the origin is surrounded by many parameters that give nearly identical loss, while near other parts of the space the number of such parameters is far fewer.

The reason you have to do some kind of "translation" between the two theories is that SLT can see not just exactly optimal points but also nearly optimal points, and bad singularities are surrounded by many more nearly optimal points than better-behaved singularities. You can interpret the discretized picture above as the SLT picture seen at some "resolution" or "scale" , i.e. if you discretized the loss function by evaluating it on a lattice with mesh you get my picture. Of course, this loses the information of what happens as and in some thermodynamic limit, which is what you recover when you do SLT.

I just don't see what this thermodynamic limit tells you about the learning behavior of NNs that we didn't know before. We already know NNs approximate Solomonoff induction if the -complexity is a good approximation to Kolmogorov complexity and so forth. What additional information is gained by knowing what looks like as a smooth function as opposed to a discrete function?

In addition, the strong dependence of SLT on being analytic is bad, because analytic functions are rigid: their value in a small open subset determines their value globally. I can see why you need this assumption because quantifying what happens near a singularity becomes incredibly difficult for general smooth functions, but because of the rigidity of analytic functions the approximation that "we can just pretend NNs are analytic" is more pernicious than e.g. "we can just pretend NNs are smooth". Typical approximation theorems like Stone-Weierstrass also fail to save you because they only work in the sup-norm and that's completely useless for determining behavior at singularities. So I'm yet to be convinced that the additional details in SLT provide a more useful account of NN learning than my simple description above.

Comment by Ege Erdil (ege-erdil) on My impression of singular learning theory · 2023-06-19T10:21:32.324Z · LW · GW

Can you give an example of which has the mode of singularity you're talking about? I don't think I'm quite following what you're talking about here.

In SLT is assumed analytic, so I don't understand how the Hessian can fail to be well-defined anywhere. It's possible that the Hessian vanishes at some point, suggesting that the singularity there is even worse than quadratic, e.g. at the origin or something like that. But even in this regime essentially the same logic is going to apply - the worse the singularity, the further away you can move from it without changing the value of very much, and accordingly the singularity contributes more to the volume of the set as .

Comment by Ege Erdil (ege-erdil) on My impression of singular learning theory · 2023-06-18T17:46:40.583Z · LW · GW

Say that you have a loss function . The minimum loss set is probably not exactly , but it has something to do with that, so let's pretend that it's exactly that for now.

This is a collection of equations that are generically independent and so should define a subset of dimension zero, i.e. a collection of points in . However, there might be points at which the partial derivatives vanishing don't define independent equations, so we get something of positive codimension.

In these cases, what happens is that the gradient itself has vanishing derivatives in some directions. In other words, the Hessian matrix fails to be of full rank. Say that this matrix has rank at a specific singular point and consider the set . Diagonalizing will generically bring into a form where it's the linear combination of quadratic terms and higher-order cubic terms, and locally the volume contribution to this set around will be something of order . The worse the singularity, the smaller the rank and the greater the volume contribution of the singularity to the set .

The worst singularities dominate the behavior at small because you can move "much further" along vectors where scales in a cubic fashion than directions where it scales in a quadratic fashion, so those dimensions are the only ones that "count" in some calculation when you compare singularities. The tangent space intuition doesn't apply directly here but something like that still applies, in the sense that the worse a singularity, the more directions you have to move away from it without changing the value of the loss very much.

Is this intuitive now? I'm not sure what more to do to make the result intuitive.

Comment by Ege Erdil (ege-erdil) on My impression of singular learning theory · 2023-06-18T16:58:11.268Z · LW · GW

I think this is a very nice way to present the key ideas. However, in practice I think the discretisation is actually harder to reason about than the continuous version. There are deeper problems, but I'd start by wondering how you would ever compute c(f) defined this way, since it seems to depend in an intricate way on the details of e.g. the floating point implementation.

I would say that the discretization is going to be easier for people with a computer science background to grasp, even though formally I agree it's going to be less pleasant to reason about or to do computations with. Still, if properties of NNs that only appeared when they are continuous functions on were essential for their generalization, we might be in trouble as people keep lowering the precision of their floating point numbers. This explanation makes it clear that while assuming NNs are continuous (or even analytic!) might be useful for theoretical purposes, the claims about generalization hold just as well in a more realistic discrete setting.

I'll note that the volume codimension definition of the RLCT is essentially what you have written down here, and you don't need any mathematics beyond calculus to write that down. You only need things like resolutions of singularities if you actually want to compute that value, and the discretisation doesn't seem to offer any advantage there.

Yes, my definition is inspired by the volume codimension definition, though here we don't need to take a limit as some because the counting measure makes our life easy. The problem you have in a smooth setting is that descending the Lebesgue measure in a dumb way to subspaces with positive codimension gives trivial results, so more care is necessary to recover and reason about the appropriate notions of volume.

Comment by Ege Erdil (ege-erdil) on My impression of singular learning theory · 2023-06-18T16:46:20.043Z · LW · GW

I don't think this is something that requires explanation, though. If you take an arbitrary geometric object in maths, a good definition of its singular points will be "points where the tangent space has higher dimension than expected". If this is the minimum set of a loss function and the tangent space has higher dimension than expected, that intuitively means that locally there are more directions you can move along without changing the loss function, probably suggesting that there are more directions you can move along without changing the function being implemented at all. So the function being implemented is simple, and the rest of the argument works as I outline it in the post.

I think I understand what you and Jesse are getting at, though: there's a particular behavior that only becomes visible in the smooth or analytic setting, which is that minima of the loss function that are more singular become more dominant as in the Boltzmann integral, as opposed to maintaining just the same dominance factor of . You don't see this in the discrete case because there's a finite nonzero gap in loss between first-best and second-best fits, and so the second-best fits are exponentially punished in the limit and become irrelevant, while in the singular case any first-best fit has some second best "space" surrounding it whose volume is more concentrated towards the singularity point.

While I understand that, I'm not too sure what predictions you would make about the behavior of neural networks on the basis of this observation. For instance, if this smooth behavior is really essential to the generalization of NNs, wouldn't we predict that generalization would become worse as people switch to lower precision floating point numbers? I don't think that prediction would have held up very well if someone had made it 5 years ago.

Comment by Ege Erdil (ege-erdil) on My impression of singular learning theory · 2023-06-18T16:21:23.905Z · LW · GW

To me that just sounds like you're saying the integral is dominated by the contribution of the simplest functions that are of minimum loss, and the contribution factor scales like where is the effective dimensionality near the singularity representing this function, equivalently the complexity of said function. That's exactly what I'm saying in my post - where is the added content here?

Comment by Ege Erdil (ege-erdil) on My impression of singular learning theory · 2023-06-18T16:08:43.373Z · LW · GW

None of this is specific to singular learning theory. The basic idea that the parameter-function map might be degenerate and biased towards simple functions predates SLT(at least this most recent wave of interest in its application to neural nets anyway) and indeed goes back to the 90s, no algebraic geometry required.

Sure, I'm aware that people have expressed these ideas before, but I have trouble understanding what is added by singular theory on top of this description. To me, much of singular learning theory looks like trying to do these kinds of calculations in an analytic setting where things become quite a bit more complicated, for example because you no longer have the basic counting function to measure the effective dimensionality of a singularity, forcing you to reach for concepts like "real log canonical threshold" instead.

As far as I can tell, the non-trivial content of SLT is that the averaging over parameters with a given loss is dominated by singular points in the limit because volume clusters there as you take an ever-narrower interval around the minimum set.

I'm not sure why we should expect that beyond the argument I already give in the post. The geometry of the loss landscape is already fully accounted for by the Boltzmann factor; what else does singular learning theory add here?

Maybe this is also what you're confused about when you say "I don't see a mechanism by which SGD is supposed to be attracted to such points".

Comment by Ege Erdil (ege-erdil) on DSLT 0. Distilling Singular Learning Theory · 2023-06-18T15:09:31.738Z · LW · GW

I'm kind of puzzled by the amount of machinery that seems to be going into these arguments, because it seems to me that there is a discrete analog of the same arguments which is probably both more realistic (as neural networks are not actually continuous, especially with people constantly decreasing the precision of the floating point numbers used in implementation) and simpler to understand.

Suppose you represent a neural network architecture as a map where and is the set of all possible computable functions from the input and output space you're considering. In thermodynamic terms, we could identify elements of as "microstates" and the corresponding functions that the NN architecture maps them to as "macrostates".

Furthermore, suppose that comes together with a loss function evaluating how good or bad a particular function is. Assume you optimize using something like stochastic gradient descent on the function with a particular learning rate.

Then, in general, we have the following results:

  • SGD defines a Markov chain structure on the space whose stationary distribution is proportional to on parameters for some positive constant . This is just a basic fact about the Langevin dynamics that SGD would induce in such a system.
  • In general is not injective, and we can define the "-complexity" of any function as . Then, the probability that we arrive at the macrostate is going to be proportional to .
  • When is some kind of negative log likelihood, this approximates Solomonoff induction in a tempered Bayes paradigm insofar as the -complexity is a good approximation for the Kolmogorov complexity of the function , which will happen if the function approximator defined by is sufficiently well-behaved.

Is there some additional content of singular value theory that goes beyond the above insights?

Edit: I've converted this comment to a post, which you can find here.

Comment by Ege Erdil (ege-erdil) on Are Bayesian methods guaranteed to overfit? · 2023-06-18T09:04:27.473Z · LW · GW

A tangential question: Does the overfitting issue from Bayesian statistics have an analog in Bayesian epistemology, i.e. when we only deal with propositional subjective degrees of belief, not with random variables and models?

I think the problem is the same in both cases. Roughly speaking, there is some "appropriate amount" of belief updating to try to fit your experiences, and this appropriate amount is described by Bayes' rule under ideal conditions where

  • it's computationally feasible to perform the full Bayesian update, and
  • the correct model is within the class of models you're performing the update over.

If either of these is not true, then in general you don't know which update is good. If your class of models is particularly bad, it can be preferable to stick to an ignorance prior and perform no update at all.

Asymptotically, all update rules within the tempered Bayes paradigm (Bayes but likelihoods are raised to an exponent that's not in general equal to 1) in a stationary environment (i.i.d. samples and such) converge to MLE, where you have guarantees of eventually landing in a part of your model space which has minimal KL divergence with the true data generating process. However, this is an asymptotic guarantee, so it doesn't necessarily tell us what we should be doing when our sample is finite. Moreover, this guarantee is no longer valid if the data-generating process is not stationary, e.g. if you're drawing one long string of correlated samples from a distribution instead of many independent samples.

Using Bayes' rule at least gets the right credence ratios between the different models you're considering, but it's not clear if this is optimal from the point of view of e.g. an agent trying to maximize expected utility in an environment.

I think in practice the way people deal with these problems is to use a "lazily evaluated" version of the Bayesian paradigm. They start with an initial class of models , and perform usual Bayes until they notice that none of the models in seem to fit the data very well. They then search for an expanded class of models which can still fit the data well while trying to balance between the increased dimensionality of the models in and their better fit with data, and if a decent match is found, they keep using from that point on, etc.

Comment by Ege Erdil (ege-erdil) on Transformative AGI by 2043 is <1% likely · 2023-06-14T07:07:15.184Z · LW · GW

Probably true, and this could mean the brain has some substantial advantage over today's hardware (like 1 OOM, say) but at the same time the internal mechanisms that biology uses to establish electrical potential energy gradients and so forth seem so inefficient. Quoting Eliezer;

I'm confused at how somebody ends up calculating that a brain - where each synaptic spike is transmitted by ~10,000 neurotransmitter molecules (according to a quick online check), which then get pumped back out of the membrane and taken back up by the synapse; and the impulse is then shepherded along cellular channels via thousands of ions flooding through a membrane to depolarize it and then getting pumped back out using ATP, all of which are thermodynamically irreversible operations individually - could possibly be within three orders of magnitude of max thermodynamic efficiency at 300 Kelvin. I have skimmed "Brain Efficiency" though not checked any numbers, and not seen anything inside it which seems to address this sanity check.

Comment by Ege Erdil (ege-erdil) on Transformative AGI by 2043 is <1% likely · 2023-06-14T07:00:00.928Z · LW · GW

It's my assumption because our brains are AGI for ~20 W.

I think that's probably the crux. I think the evidence that the brain is not performing that much computation is reasonably good, so I attribute the difference to algorithmic advantages the brain has, particularly ones that make the brain more data efficient relative to today's neural networks.

The brain being more data efficient I think is hard to dispute, but of course you can argue that this is simply because the brain is doing a lot more computation internally to process the limited amount of data it does see. I'm more ready to believe that the brain has some software advantage over neural networks than to believe that it has an enormous hardware advantage.

Comment by Ege Erdil (ege-erdil) on Transformative AGI by 2043 is <1% likely · 2023-06-14T06:53:30.245Z · LW · GW

Huh, I wonder why I read 7e2 W as 70 W. Strange mistake.

Comment by Ege Erdil (ege-erdil) on Transformative AGI by 2043 is <1% likely · 2023-06-14T06:52:27.955Z · LW · GW

I'm posting this as a separate comment because it's a different line of argument, but I think we should also keep it in mind when making estimates of how much computation the brain could actually be using.

If the brain is operating at a frequency of (say) 10 Hz and is doing 1e20 FLOP/s, that suggests the brain has something like 1e19 floating point parameters, or maybe specifying the "internal state" of the brain takes something like 1e20 bits. If you want to properly train a neural network of this size, you need to update on a comparable amount of useful entropy from the outside world. This means you have to believe that humans are receiving on the order of 1e11 bits or 10 GB of useful information about the world to update on every second if the brain is to be "fully trained" by the age of 30, say.

An estimate of 1e15 FLOP/s brings this down to a more realistic 100 KB or so, which still seems like a lot but is somewhat more believable if you consider the potential information content of visual and auditory stimuli. I think even this is an overestimate and that the brain has some algorithmic insights which make it somewhat more data efficient than contemporary neural networks, but I think the gap implied by 1e20 FLOP/s is rather too large for me to believe it.

Comment by Ege Erdil (ege-erdil) on Transformative AGI by 2043 is <1% likely · 2023-06-14T06:36:10.998Z · LW · GW

2e6 eV are spent per FP16 operation... This is 1e8 times higher than the Landauer limit of 2e-2 eV per bit erasure at 70 C (and the ratio of bit erasures per FP16 operation is unclear to me; let's pretend it's O(1))

2e-2 eV for the Landauer limit is right, but 2e6 eV per FP16 operation is off by one order of magnitude. (70 W)/(2e15 FLOP/s) = 0.218 MeV. So the gap is 7 orders of magnitude assuming one bit erasure per FLOP.

This is wrong, the power consumption is 700 W so the gap is indeed 8 orders of magnitude.

An H100 SXM has 8e10 transistors, 2e9 Hz boost frequency, 70 W 700 W of max power consumption...

8e10 * 2e9 = 1.6e20 transistor switches per second. This happens with a power consumption of 700 W, suggesting that each switch dissipates on the order of 30 eV of energy, which is only 3 OOM or so from the Landauer limit. So this device is actually not that inefficient if you look only at how efficiently it's able to perform switches. My position is that you should not expect the brain to be much more efficient than this, though perhaps gaining one or two orders of magnitude is possible with complex error correction methods.

Of course, the transistors supporting per FLOP and the switching frequency gap have to add up to the 8 OOM overall efficiency gap we've calculated. However, it's important that most of the inefficiency comes from the former and not the latter. I'll elaborate on this later in the comment.

This seems pretty inefficient to me!

I agree an H100 SXM is not a very efficient computational device. I never said modern GPUs represent the pinnacle of energy efficiency in computation or anything like that, though similar claims have previously been made by others on the forum.

Positing that brains are ~6 orders of magnitude more energy efficient than today's transistor circuits doesn't seem at all crazy to me. ~6 orders of improvement on 2e6 is ~2 eV per operation, still two orders of magnitude above the 0.02 eV per bit erasure Landauer limit.

Here we're talking about the brain possibly doing 1e20 FLOP/s, which I've previously said is maybe within one order of magnitude of the Landauer limit or so, and not the more extravagant figure of 1e25 FLOP/s. The disagreement here is not about math; we both agree that this performance requires the brain to be 1 or 2 OOM from the bitwise Landauer limit depending on exactly how many bit erasures you think are involved in a single 16-bit FLOP.

The disagreement is more about how close you think the brain can come to this limit. Most of the energy losses in modern GPUs come from the enormous amounts of noise that you need to deal with in interconnects that are closely packed together. To get anywhere close to the bitwise Landauer limit, you need to get rid of all of these losses. This is what would be needed to lower the amount of transistors supporting per FLOP without also simultaneously increasing the power consumption of the device.

I just don't see how the brain could possibly pull that off. The design constraints are pretty similar in both cases, and the brain is not using some unique kind of material or architecture which could eliminate dissipative or radiative energy losses in the system. Just as information needs to get carried around inside a GPU, information also needs to move inside the brain, and moving information around in a noisy environment is costly. So I would expect by default that the brain is many orders of magnitude from the Landauer limit, though I can see estimates as high as 1e17 FLOP/s being plausible if the brain is highly efficient. I just think you'll always be losing many orders of magnitude relative to Landauer as long as your system is not ideal, and the brain is far from an ideal system.

I'll note too that cells synthesize informative sequences from nucleic acids using less than 1 eV of free energy per bit. That clearly doesn't violate Landauer or any laws of physics, because we know it happens.

I don't think you'll lose as much relative to Landauer when you're doing that, because you don't have to move a lot of information around constantly. Transcribing a DNA sequence and other similar operations are local. The reason I think realistic devices will fall far short of Landauer is because of the problem of interconnect: computations cannot be localized effectively, so different parts of your hardware need to talk to each other, and that's where you lose most of the energy. In terms of pure switching efficiency of transistors, we're already pretty close to this kind of biological process, as I've calculated above.

Comment by Ege Erdil (ege-erdil) on Transformative AGI by 2043 is <1% likely · 2023-06-14T04:59:33.487Z · LW · GW

I don't think transistors have too much to do with neurons beyond the abstract observation that neurons most likely store information by establishing gradients of potential energy. When the stored information needs to be updated, that means some gradients have to get moved around, and if I had to imagine how this works inside a cell it would probably involve some kind of proton pump operating across a membrane or something like that. That's going to be functionally pretty similar to a capacitor, and discharging & recharging it probably carries similar free energy costs.

I think what I don't understand is why you're defaulting to the assumption that the brain has a way to store and update information that's much more efficient than what we're able to do. That doesn't sound like a state of ignorance to me; it seems like you wouldn't hold this belief if you didn't think there was a good reason to do so.

Comment by Ege Erdil (ege-erdil) on Transformative AGI by 2043 is <1% likely · 2023-06-14T03:27:57.492Z · LW · GW

Why does switching barriers imply that electrical potential energy is probably being converted to heat? I don't see how that follows at all.

Where else is the energy going to go? Again, in an adiabatic device where you have a lot of time to discharge capacitors and such, you might be able to do everything in a way that conserves free energy. I just don't see how that's going to work when you're (for example) switching transistors on and off at a high frequency. It seems to me that the only place to get rid of the electrical potential energy that quickly is to convert it into heat or radiation.

I think what I'm saying is standard in how people analyze power costs of switching in transistors, see e.g. this post. If you have a proposal for how you think the brain could actually be working to be much more energy efficient than this, I would like to see some details of it, because I've certainly not come across anything like that before.

To what extent do information storage requirements weigh on FLOPS requirements? It's not obvious to me that requirements on energy barriers for long-term storage in thermodynamic equilibrium necessarily bear on transient representations of information in the midst of computations, either because the system is out of thermodynamic equilibrium or because storage times are very short

The Boltzmann factor roughly gives you the steady-state distribution of the associated two-state Markov chain, so if time delays are short it's possible this would be irrelevant. However, I think that in realistic devices the Markov chain reaches equilibrium far too quickly for you to get around the thermodynamic argument because the system is out of equilibrium.

My reasoning here is that the Boltzmann factor also gives you the odds of an electron having enough kinetic energy to cross the potential barrier upon colliding with it, so e.g. if you imagine an electron stuck in a potential well that's O(k_B T) deep, the electron will only need to collide with one of the barriers O(1) times to escape. So the rate of convergence to equilibrium comes down to the length of the well divided by the thermal speed of the electron, which is going to be quite rapid as electrons at the Fermi level in a typical wire move at speeds comparable to 1000 km/s.

I can try to calculate exactly what you should expect the convergence time here to be for some configuration you have in mind, but I'm reasonably confident when the energies involved are comparable to the Landauer bit energy this convergence happens quite rapidly for any kind of realistic device.

Comment by Ege Erdil (ege-erdil) on Transformative AGI by 2043 is <1% likely · 2023-06-13T09:22:07.631Z · LW · GW

First, I'm confused by your linkage between floating point operations and information erasure. For example, if we have two 8-bit registers (A, B) and multiply to get (A, B*A), we've done an 8-bit floating point operation without 8 bits of erasure. It seems quite plausible to be that the brain does 1e20 FLOPS but with a much smaller rate of bit erasures.

  • As a minor nitpick, if A and B are 8-bit floating point numbers then the multiplication map x -> B*x is almost never injective. This means even in your idealized setup, the operation (A, B) -> (A, B*A) is going to lose some information, though I agree that this information loss will be << 8 bits, probably more like 1 bit amortized or so.

  • The bigger problem is that logical reversibility doesn't imply physical reversibility. I can think of ways in which we could set up sophisticated classical computation devices which are logically reversible, and perhaps could be made approximately physically reversible when operating in a near-adiabatic regime at low frequencies, but the brain is not operating in this regime (especially if it's performing 1e20 FLOP/s). At high frequencies, I just don't see which architecture you have in mind to perform lots of 8-bit floating point multiplications without raising the entropy of the environment by on the order of 8 bits.

    Again using your setup, if you actually tried to implement (A, B) -> (A, A*B) on a physical device, you would need to take the register that is storing B and replace the stored value with A*B instead. To store 1 bit of information you need a potential energy barrier that's at least as high as k_B T log(2), so you need to switch ~ 8 such barriers, which means in any kind of realistic device you'll lose ~ 8 k_B T log(2) of electrical potential energy to heat, either through resistance or through radiation. It doesn't have to be like this, and some idealized device could do better, but GPUs are not idealized devices and neither are brains.

Ajeya Cotra estimates training could take anything from 1e24 to 1e54 floating point operations, or even more. Her narrower lifetime anchor ranges from 1e24 to 1e38ish.

Two points about that:

  1. This is a measure that takes into account the uncertainty over how much less efficient our software is compared to the human brain. I agree that human lifetime learning compute being around 1e25 FLOP is not strong evidence that the first TAI system we train will use 1e25 FLOP of compute; I expect it to take significantly more than that.

  2. Moreover, this is an estimate of effective FLOP, meaning that Cotra takes into account the possibility that software efficiency progress can reduce the physical computational cost of training a TAI system in the future. It was also in units of 2020 FLOP, and we're already in 2023, so just on that basis alone, these numbers should get adjusted downwards now.

Do you think Cotra's estimates are not just poor, but crazy as well?

No, because Cotra doesn't claim that the human brain performs 1e25 FLOP/s - her claim is quite different.

The claim that "the first AI system to match the performance of the human brain might require 1e25 FLOP/s to run" is not necessarily crazy, though it needs to be supported by evidence of the relative inefficiency of our algorithms compared to the human brain and by estimates of how much software progress we should expect to be made in the future.

Comment by Ege Erdil (ege-erdil) on The Dictatorship Problem · 2023-06-13T06:41:36.832Z · LW · GW

The unpopularity of the war in early 1917 is rather overstated. In fact; even after the fall of the Tsarist government, the war was so popular that before Lenin returned to Russia, Stalin felt it necessary to change the Bolshevik party line by endorsing Russia's continued participation in the war.

I agree that the chaotic conditions in 1917 Russia were essential for a minority to seize power, but similarly chaotic conditions could come to exist in many Western countries as well, perhaps as the result of a world war or economic transformation driven by AI.

Comment by Ege Erdil (ege-erdil) on The Dictatorship Problem · 2023-06-13T06:31:13.134Z · LW · GW

I don't have a source for this claim off the top of my head, but I've previously read that Germany was actually a net beneficiary of international financial transactions in the 1920s. Essentially, the flow of funds went like this:

  • Germany paid war reparations to the UK and France.
  • The UK and France paid off their war debts to the United States.
  • The United States made loans to Germany, and Germany defaulted on a good fraction of them.

It would be nice if someone could check whether this is true or not, but the impression I got from reading the history here is that the role of war reparations in causing fiscal problems for Germany was inflated by propaganda, especially by German politicians who tried to blackmail the Allies into lowering the amount of reparations to be paid by raising the specter of economic collapse in Germany.

Comment by Ege Erdil (ege-erdil) on The Dictatorship Problem · 2023-06-13T06:23:07.065Z · LW · GW

Nitpick: Erdogan's party won the 2002 elections, and Erdogan became Prime Minister in 2003. I'm not sure where you got the year 2004 from, but it's not correct.

Comment by Ege Erdil (ege-erdil) on The Dictatorship Problem · 2023-06-13T06:21:37.492Z · LW · GW

In Turkey's case, there have been many elections, but Erdogan always wins through a combination of mass arrests, media censorship, and sending his most popular opponent to prison for "insulting public officials"

You do know that Ekrem Imamoglu was not actually sent to jail, right? He was one of the vice-presidential candidates in the May 2023 election.

Your claims here also ignore the fact that before the May 2023 elections, betting markets expected Erdogan to lose. On Betfair, for example, Erdogan winning the presidential elections was trading at 30c to 35c. Saying that "of course Erdogan would win, he censors his critics and puts them in jail" is a good example of 20/20 hindsight. Can you imagine betting markets giving Putin a 30% chance to win a presidential election in Russia?

It's also not true that Erdogan always wins elections in Turkey. Erdogan's party used to have a majority of seats in the parliament, and over time their share of the vote diminished to the extent that now they don't anymore. To remain in power, Erdogan was compelled to ally with a Turkish nationalist party that had previously been one of his political enemies, and it's only this alliance that has a majority of seats in the parliament now. This also led to noticeable policy shifts in Erdogan's government, most notably when it comes to their attitude towards the Kurds.

It seems to me that you're getting your information from biased sources and your knowledge of the political situation in Turkey is only superficial.

Comment by Ege Erdil (ege-erdil) on The Dictatorship Problem · 2023-06-13T06:06:01.934Z · LW · GW

The problem with the "woke" movement (which doesn't want to be called that but also refuses to give itself any other name in a desperate and generally failed bid to market itself as the default) is instead exactly that it's this unpopular. It elicits fanatical loyalty in a minority and leaves everyone else baffled or repulsed...

I agree that a "woke dictatorship" (whatever that means) is unlikely to be established in most Western countries over the next ten years, but the unpopularity of the idea among the broader population is not very strong evidence that a dictatorship cannot be established on said principles. Just to provide one example, Bolshevik ideas were quite unpopular in Russia in February 1917, and yet Russia had become a dictatorship under the Bolsheviks by 1920.

Comment by Ege Erdil (ege-erdil) on The Dictatorship Problem · 2023-06-11T11:55:28.113Z · LW · GW

I guess my objection is that if you want to know what a modern fascist dictatorship might look like, "something like Putin's Russia" is probably a better answer than, say, 1930s Germany.

I agree that if most Western countries looked like today's Russia in ten years in some vague sense, I would count that as Alyssa Vance's prediction coming true. I'm not too sure which part of my comment this is meant to be an objection to.

About quantitative predictions, I'm not sure how to formalise the notion of electorate disempowerment, but more than "no elections" I would expect "a long string of elections won always by the same party with progressive disenfranchisement of those who would vote otherwise".

That's why I didn't list "no elections" as part of my list, but elections that are won by the same person every time with more than 70% of the vote for 20 years is already something I don't expect to happen in most Western countries in the medium-term future. I'm just not sure if this is what Alyssa Vance is actually referring to when she talks about a "fascist dictatorship".

Comment by Ege Erdil (ege-erdil) on The Dictatorship Problem · 2023-06-11T09:52:59.962Z · LW · GW

Sorry, but I don't think this comment addresses anything I've said. I don't even know how to respond to it.

Comment by Ege Erdil (ege-erdil) on The Dictatorship Problem · 2023-06-11T05:47:30.920Z · LW · GW

Yes, that is why I would want the post to be more precise. What does it mean to say "over the next decade, it is quite likely that most democratic Western countries will become fascist dictatorships"? If we had to write a Metaculus question or a prediction market on this claim, what would it look like?

I understand that it's not necessarily easy to do this, but especially when dealing with politics it's important to exercise this kind of cognitive discipline. For example, we could say Russia is closer to being a fascist dictatorship than the US for many different reasons:

  • The same leader has been in power, in one form or another, for over twenty years.
  • There is no serious organized political opposition to the leader. Putin won the last presidential election in 2018 with 77% of the popular vote. (Contrast this with Erdogan, who won this year with a slim margin of 52% against 48% in the runoff elections.)
  • People are routinely prosecuted and fined or put in prison for political speech that the government does not approve of. (By this metric, even a country like Germany fails the test more often than we might like to admit, but the US is particularly good on this dimension.)

Feel free to expand this list with more items.

Essentially, I want the post to take the vague concept of "fascist dictatorship" and turn it into some more easily falsifiable properties of a government. For instance, I'm happy to bet that over the next 30 years, no politician in the US will be elected President for more than two terms and that all elections in the US will be reasonably close in the popular vote. Those are more objective facts about the government, while whether it's a dictatorship or not is much more subjective.

I would be much more inclined to believe a moderate claim such as "it's moderately likely that most countries in the Western world will be relatively more fascist and relatively more dictatorial by some measurement." However, going from the present state of most Western countries to what I would consider a "fascist dictatorship" requires a huge effect size that I find extremely implausible over a ten-year period.

Comment by Ege Erdil (ege-erdil) on The Dictatorship Problem · 2023-06-11T05:05:56.565Z · LW · GW

I find it ironic that the author of a post warning of the risks of dictatorships becoming more widespread throughout the world has a moderation policy of "deleting anything they judge to be counterproductive".

Comment by Ege Erdil (ege-erdil) on The Dictatorship Problem · 2023-06-11T05:02:10.723Z · LW · GW

This post would greatly benefit from quantitative forecasts on precise claims that are at least in principle falsifiable.

I also strongly disagree with Erdogan's characterization as a dictator under the definition of "a strong leader who does not have to bother with parliament and elections". Perhaps under some "softer" definition, you could classify him as a dictator; but that makes what I said above all the more important. What is a "dictator", and how do we know if we're in a world where "dictatorship" is becoming more widespread?

Comment by Ege Erdil (ege-erdil) on Transformative AGI by 2043 is <1% likely · 2023-06-10T14:28:28.935Z · LW · GW

I think you're just reading the essay wrong. In the "executive summary" section, they explicitly state that

Our best anchor for how much compute an AGI needs is the human brain, which we estimate to perform 1e20–1e21 FLOPS.


In addition, we estimate that today’s computer hardware is ~5 orders of magnitude less cost efficient and energy efficient than brains.

I don't know how you read those claims and arrived at your interpretation, and indeed I don't know how the evidence they provide could support the interpretation you're talking about. It would also be a strange omission to not mention the "effective" part of "effective FLOP" explicitly if that's actually what you're talking about.