Posts
Comments
There are two arguments frequently offered for a free market economy over a centrally planned economy: an argument based around knowledge, sometimes called the socialist calculation problem; and another argument based on incentives. The arguments can be briefly summarized like so:
- A central planning authority would not have enough knowledge to efficiently direct economic activity.
- A central planning authority would not have the right incentives to ensure that their direction was efficient.
A point I've not seen anyone else make is that the argument from knowledge is really itself an argument from incentives in the following sense: the sensory and computational capabilities of human civilization is naturally distributed among individual humans who have a high degree of agency over their own actions. An efficient planner ought to leverage this whole base of data and compute when making decisions, but this requires giving each individual human the incentive to participate in this distributed computing process.
The limited bandwidth of human communication (on the order of bytes per second) compared to human computational power (on the order of 1e15 ops per second for the brain) means that setting up such a distributed computing scheme requires most decisions to be made locally, and this allows many opportunities for individual participants to shirk the duties that would be assigned to them by an economic planner, not only through the work-effort channel (where shirking is more obvious in many industries and can be cracked down on using coercion) but also by falsifying the results of local computations.
So the knowledge problem for the central planner can also be understood as an incentive problem for the participants in the centrally planned economy. The free market gets around this problem by enabling each person or group of people to profit from inefficiencies they find in the system, thereby incentivizing them to contribute to the aggregate economic optimization task. The fact that individual optimizations can be made locally without the need for approval from a central authority means less pressure is put on the scarce communication bandwidth available to the economy, which is reserved for the transmission of important information. While the price mechanism plays a significant role here as would be argued by e.g. Hayekians, compressed information about what drives changes in prices can be just as important.
This brings up another important point which is that a lot of externalities are impossible to calculate, and therefore such approaches end up fixating on the part that seems calculable without even accounting for (or even noticing) the incalculable part. If the calculable externalities happen to be opposed to larger incalculable externalities, then you can end up worse off than if you had never tried.
I think this is correct as a conditional statement, but I don't think one can deduce the unconditional implication that attempting to price some externalities in domains where many externalities are difficult to price is generally bad.
As applied to the gun externality question, you could theoretically offer a huge payday to the gun shop that sold the firearm used to stop a spree shooting in progress, but you still need a body to count before paying out.
The nice feature of positive payments by the government (instead of fines, i.e. negative payments by the government) is that the judgment-proof defendant problem goes away, so there's no reason to actually make these payments to the gun shop at all: you can just directly pay the person who stops the shooting, which probably provides much better incentives to be a Good Samaritan without the shop trying to pass along this incentive to gun buyers.
I think this applies well to AI, because absent a scenario where gray goo rearranges everyone into paperclips (in which case everyone pays with their life anyway), a lot of the benefits and harms are likely to be illegible. If AI chatbots end up swaying the next election, what is the dollar value we need to stick on someone? How do we know if it's even positive or negative, or if it even happened? If we latch onto the one measurable thing, that might not help.
I don't agree that most of the benefits of AI are likely to be illegible. I expect plenty of them to take the form of new consumer products that were not available before, for example. "A lot of the benefits" is a weaker phrasing and I don't quite know how to interpret it, but I thought it's worth flagging my disagreement with the adjacent phrasing I used.
In general, I don't agree with arguments of the form "it's difficult to quantify the externalities so we shouldn't quantify anything and ignore all external effects" modulo concerns about public choice ("what if the policy pursued is not what you would recommend but some worse alternative?"), which are real and serious, though out of the scope of my argument. There's no reason a priori to suppose that any positive or negative effects not currently priced will be of the same order of magnitude.
If you think there are benefits to having a population where most people own guns that are not going to be captured by the incentives of individuals who purchase guns for their own purposes, it's better to try to estimate what that effect size is and then provide appropriate incentives to people who want to purchase guns. The US government pursues such policies in other domains: for example, one of the motivations that led to the Jones Act was the belief that the market would not assign sufficient value to the US maintaining a large domestic shipbuilding industry at peacetime.
In addition, I would dispute that some of these are in fact external effects by necessity. You can imagine some of them being internalized, e.g. by governments offering rewards to citizens who prevent crime (which gives an extra incentive to such people to purchase guns as it would make their interventions more effective). Even the crime prevention benefit could be internalized to a great extent by guns being sold together with a kind of proof-of-ownership that is hard to counterfeit, similar to the effect that open carry policies have in states which have them.
There's a more general public choice argument against this kind of policy, which is that governments lack the incentives to actually discover the correct magnitude of the externalities and then intervene in the appropriate way to maximize efficiency or welfare. I think that's true in general, and in the specific case of guns it might be a reason to not want the government to do anything at all, but in my opinion that argument becomes less compelling when the potential harms of a technology are large enough.
If the risk is sufficiently high, then the shops would simply not sell guns to anyone who seemed like they might let their guns be stolen, for example. Note that the shops would still be held liable for any harm that occurs as a result of any gun they have sold, irrespective of whether the buyer was also the perpetrator of the harm.
In practice, the risk of a gun sold to a person with a safe background being used in such an act is probably not that large, so such a measure doesn't need to be taken: the shop can just sell the guns at a somewhat inflated price to compensate for the risk of the gun being misused in some way, and this is efficient. If you were selling e.g. nuclear bombs instead of guns, then you would demand any prospective buyer meet a very high standard of safety before selling them anything, as the expected value of the damages in this case would be much higher.
The police arresting people who steal guns does nothing to fix the problem of shootings if the gun is used shortly after it is stolen, and police are not very good at tracking down stolen items to begin with, so I don't understand the point of your example.
Open source might be viable if it's possible for the producers to add safeguards into the model that cannot be trivially undone by cheap fine-tuning, but yeah, I would agree with that given the current lack of techniques for doing this successfully.
The shop has the ability to invest more in security if they will be held liable for subsequent harm. They can also buy insurance themselves and pass on the cost to people who do purchase guns legally as an additional operating expense.
It is not a tautology.
Can you explain to me the empirical content of the claim, then? I don't understand what it's supposed to mean.
About the rest of your comment, I'm confused about why you're discussing what happens when both chess engines and humans have a lot of time to do something. For example, what's the point of this statement?
My understanding is that it is not true that if you ran computers for a long time that they would beat the human also running for a long time, and that historically, it's been quite the opposite...
I don't understand how this statement is relevant to any claim I made in my comment. Humans beating computers at equal time control is perfectly consistent with the computers being slower than humans. If you took a human and slowed them down by a factor of 10, that's the same pattern you would see.
Are you instead trying to find examples of tasks where computers were beaten by humans when given a short time to do the task but could beat the humans when given a long time to do the task? That's a very different claim from "in every case where we've successfully gotten AI to do a task at all, AI has done that task far far faster than humans".
Yes, that's what I'm trying to say, though I think in actual practice the numbers you need would have been much smaller for the Go AIs I'm talking about than they would be for the naive tree search approach.
Sure, but in that case I would not say the AI thinks faster than humans, I would say the AI is faster than humans at a specific range of tasks where the AI can do those tasks in a "reasonable" amount of time.
As I've said elsewhere, there is a quality or breadth vs serial speed tradeoff in ML systems: a system that only does one narrow and simple task can do that task at a high serial speed, but as you make systems more general and get them to handle more complex tasks, serial speed tends to fall. The same logic that people are using to claim GPT-4 thinks faster than humans should also lead them to think a calculator thinks faster than GPT-4, which is an unproductive way to use the one-dimensional abstraction of "thinking faster vs. slower".
You might ask "Well, why use that abstraction at all? Why not talk about how fast the AIs can do specific tasks instead of trying to come up with some general notion of if their thinking is faster or slower?" I think a big reason is that people typically claim the faster "cognitive speed" of AIs can have impacts such as "accelerating the pace of history", and I'm trying to argue that the case for such an effect is not as trivial to make as some people seem to think.
True, but isn't this almost exactly analogously true for neuron firing speeds? The corresponding period for neurons (10 ms - 1 s) does not generally correspond to the timescale of any useful cognitive work or computation done by the brain.
Yes, which is why you should not be using that metric in the first place.
But even the top-line number is (at least theoretically) a very concrete measure of something that you can actually get out of the system. In contrast, when used in "computational equivalence" estimates of the brain, FLOP/s are (somewhat dubiously, IMO) repurposed as a measure of what the system is doing internally.
Will you still be saying this if future neural networks are running on specialized hardware that, much like the brain, can only execute forward or backward passes of a particular network architecture? I think talking about FLOP/s in this setting makes a lot of sense, because we know the capabilities of neural networks are closely linked to how much training and inference compute they use, but maybe you see some problem with this also?
So even if the 1e15 "computational equivalence" number is right, AND all of that computation is irreducibly a part of the high-level cognitive algorithm that the brain is carrying out, all that means is that it necessarily takes at least 1e15 FLOP/s to run or simulate a brain at neuron-level fidelity. It doesn't mean that you can't get the same high-level outputs of that brain through some other much more computationally efficient process.
I agree, but even if we think future software progress will enable us to get a GPT-4 level model with 10x smaller inference compute, it still makes sense to care about what inference with GPT-4 costs today. The same is true of the brain.
Separately, I think your sequential tokens per second calculation actually does show that LLMs are already "thinking" (in some sense) several OOM faster than humans? 50 tokens/sec is about 5 lines of code per second, or 18,000 lines of code per hour. Setting aside quality, that's easily 100x more than the average human developer can usually write (unassisted) in an hour, unless they're writing something very boilerplate or greenfield.
Yes, but they are not thinking 7 OOM faster. My claim is not AIs can't think faster than humans, indeed, I think they can. However, current AIs are not thinking faster than humans when you take into account the "quality" of the thinking as well as the rate at which it happens, which is why I think FLOP/s is a more useful measure here than token latency. GPT-4 has higher token latency than GPT-3.5, but I think it's fair to say that GPT-4 is the model that "thinks faster" when asked to accomplish some nontrivial cognitive task.
The main issue with current LLMs (which somewhat invalidates this whole comparison) is that they can pretty much only generate boilerplate or greenfield stuff. Generating large volumes of mostly-useless / probably-nonsense boilerplate quickly doesn't necessarily correspond to "thinking faster" than humans, but that's mostly because current LLMs are only barely doing anything that can rightfully be called thinking in the first place.
Exactly, and the empirical trend is that there is a quality-token latency tradeoff: if you want to generate tokens at random, it's very easy to do that at extremely high speed. As you increase your demands on the quality you want these tokens to have, you must take more time per token to generate them. So it's not fair to compare a model like GPT-4 to the human brain on grounds of "token latency": I maintain that throughput comparisons (training compute and inference compute) are going to be more informative in general, though software differences between ML models and the brain can still make it not straightforward to interpret those comparisons.
Sure, but from the point of view of per token latency that's going to be a similar effect, no?
I think you might have accidentally linked to your comment instead of the LessWrong post you intended to link to.
Don't global clock speeds have to go down as die area goes up due to the speed of light constraint?
For instance, if you made a die with 1e15 MAC units and the area scaled linearly, you would be looking at a die that's ~ 2e9 times larger than H100's die size, which is about 1000 mm^2. The physical dimensions of such a die would be around 2 km^2, so the speed of light would limit global clock frequencies to something on the order of c/(1 km) ~= 300 kHz, which is not 1 million times faster than the 1 kHz you attribute to the human brain. If you need multiple round trips for a single clock, the frequencies will get even lower.
Maybe when the clock frequencies get this low, you're dissipating so little heat that you can go 3D without worrying too much about heating issues and that buys you something. Still, your argument here doesn't seem that obvious to me, especially if you consider the fact that one round trip for one clock is extremely optimistic if you're trying to do all MACs at once. Remember that GPT-3 is a sequential model; you can't perform all the ops in one clock because later layers need to know what the earlier layers have computed.
Overall I think your comment here is quite speculative. It may or may not be true, I think we'll see, but people shouldn't treat it as if this is obviously something that's feasible to do.
I think counterexamples are easy to find. For example, chess engines in 1997 could play at the level of top human chess players on consumer hardware, but only if they were given orders of magnitude more time to think than the top humans had available. Around 1997 Deep Blue was of a similar strength to Kasparov, but it had to run on a supercomputer; on commercial hardware chess engines were still only 2400-2500 elo. If you ran them for long enough, though, they would obviously be stronger than even Deep Blue was.
I think the claim that "in every case where we've successfully gotten AI to do a task at all, AI has done that task far far faster than humans" is a tautology because we only say we've successfully gotten AI to do a task when AI can beat the top humans at that task. Nobody said "we got AI to play Go" when AI Go engines were only amateur dan strength, even though they could have equally well said "we got AI to play Go at a superhuman level but it's just very slow".
A non-tautological version might say that the decrease over time in the compute multiplier the AIs need to compete with the top humans is steep, so it takes a short time for the AIs to transition from "much slower than humans" to "much faster than humans" when they are crossing the "human threshold". I think there's some truth to this version of the claim but it's not really due to any advanced serial speed on the part of the AIs.
If there are people who say “current AIs think many orders of magnitude faster than humans”, then I agree that those people are saying something kinda confused and incoherent, and I am happy that you are correcting them.
Eliezer himself has said (e.g. in his 2010 debate with Robin Hanson) that one of the big reasons he thinks CPUs can beat brains is because CPUs run at 1 GHz while brains run at 1-100 Hz, and the only barrier is that the CPUs are currently running "spreadsheet algorithms" and not the algorithm used by the human brain. I can find the exact timestamp from the video of the debate if you're interested, but I'm surprised you've never heard this argument from anyone before.
There’s a different claim, “we will sooner or later have AIs that can think and act at least 1-2 orders of magnitude faster than a human”. I see that claim as probably true, although I obviously can’t prove it.
I think this claim is too ill-defined to be true, unfortunately, but insofar as it has the shape of something I think will be true it will be because of throughput or software progress and not because of latency.
I agree that the calculation “1 GHz clock speed / 100 Hz neuron firing rate = 1e7” is not the right calculation (although it’s not entirely irrelevant). But I am pretty confident about the weaker claim of 1-2 OOM, given some time to optimize the (future) algorithms.
If the claim here is that "for any task, there will be some AI system using some unspecified amount of inference compute that does the task 1-2 OOM faster than humans", I would probably agree with that claim. My point is that if this is true, it won't be because of the calculation “1 GHz clock speed / 100 Hz neuron firing rate = 1e7”, which as far as I can tell you seem to agree with.
As far as I know, in every case where we've successfully gotten AI to do a task at all, AI has done that task far far faster than humans. When we had computers that could do arithmetic but nothing else, they were still much faster at arithmetic than humans. Whatever your view on the quality of recent AI-generated text or art, it's clear that AI is producing it much much faster than human writers or artists can produce text/art.
"Far far faster" is an exaggeration that conflates vastly different orders of magnitude with each other. When compared against humans, computers are many orders of magnitude faster at doing arithmetic than they are at generating text: a human can write perhaps one word per second when typing quickly, while an LLM's serial speed of 50 tokens/sec maybe corresponds to 20 words/sec or so. That's just a ~ 1.3 OOM difference, to be contrasted with 10 OOMs or more at the task of multiplying 32-bit integers, for instance. Are you not bothered at all by how wide the chasm between these two quantities seems to be, and whether it might be a problem for your model of this situation?
In addition, we know that this could be faster if we were willing to accept lower quality outputs, for example by having fewer layers in an LLM. There is a quality-serial speed tradeoff, and so ignoring quality and just looking at the speed at which text is generated is not a good thing to be doing. There's a reason GPT-3.5 has smaller per token latency than GPT-4.
Yes, this summary seems accurate.
I thought cryonics was unlikely to work because a bunch of information might be lost even at the temperatures that bodies are usually preserved in. I now think this effect is most likely not serious and cryonics can work in principle at the temperatures we use, but present-day cryonics is still unlikely to work because of how much tissue damage the initial process of freezing can do.
As I said, I think it's not just that the language is poetic. There is an implicit inference that goes like
- People who would not voluntarily undergo surgery without long-term adverse effects on their health to improve the life of a stranger are evil.
- Most researchers who would be in a position to know the state of the evidence on the long-term adverse health effects of kidney donation don't personally donate one of their kidneys.
- Most researchers are unlikely to be evil.
- So it's unlikely that most researchers believe kidney donation has no long-term adverse health effects.
I'm saying that there is no definition of the word "evil" that makes statements (1) and (3) simultaneously true. Either you adopt a narrow definition, in which case (3) is true but (1) is false; or you adopt a broad definition, in which case (1) is true but (3) is false.
This is not a point about stylistic choices, it's undermining one of the key arguments the post offers for its position. The post is significantly stronger if it can persuade us that even established experts in the field agree with the author and the hypothesis being advanced is in some sense "mainstream", even if it's implicitly held.
I don't think it's a matter of poetic license. You're making an empirical claim that if specialists actually believed kidney donation had no long-term side effects, they would be lining up to donate their kidneys and we would see a much higher rate of kidney donations in the US. I think this claim is wrong because the inconvenience of surgery is substantial enough to block people from donating their kidneys even in the absence of long-term side effects.
The use of the word "evil" sneaks in an assumption that most people would be happy to make this tradeoff to improve a stranger's life at the cost of some inconvenience to themselves, but I think this claim is actually false. So the fact that this doesn't happen gives very little evidence that specialists don't take claims about the small long-term health effects of kidney donation seriously.
Have most of the researchers looking at kidney donation donated a kidney? Have most nephrology researchers donated a kidney? Most surgeons doing kidney transplants? Obviously not, otherwise we’d have more than 200 donations to strangers each year in the US. There are 10,000 board-certified nephrologists, and a few more hundred are added each year, if they took this data seriously they’d all donate.
Heck, on top of those you can add nephrology researchers, the medical statisticians who happen to focus on kidney disease, transplant surgeons, and all well-informed nurses in the nephrology units… thousands of these specialists are created each year. If most of them believed donation to be essentially safe the shortage of kidneys would be half-sovled.
Maybe they are all evil people? They will not take even a marginal risk to save a life.
We usually don't call people "evil" for not inconveniencing themselves by going through surgery, so you seem to be using this word in a fairly non-standard way here.
Just to elaborate: if I had a condition that could be cured either by having an operation equivalent to a kidney donation or paying $3k, I would almost certainly pay $3k. However, I could likely save a statistical life by donating this $3k to an effective charity. So my not donating my kidney to a stranger provides no more evidence of my evil nature than my not donating $3k to save the life of a random stranger, because I price the inconvenience of the surgery at more than $3k even if the surgery has no long-run consequences for my health.
There is more data, and better data, e.g. data gathered in double-blinded RCTs, that shows things like:
- Homeopathy works very well for a variety of conditions, sometimes better than real drugs used to treat them.
- Increasing the healthcare budget and the amount of healthcare people receive. Both in rich countries (e.g. USA) and poor ones (India). Having no effect on mortality.
I can make both of these claims based on many individual RCTs, as well as based on the aggregation of all existing RCTs.
I’m not saying that these claims make sense, they don’t, there are critical lenses through which we analyze research. But if you claim to “just follow the data”, and ignore the issue of data quality, selection bias, and fraud… without applying a critical lens, you are lost.
It seems to me like claim (2) could easily make sense if you interpret it more charitably as "the mortality effects are too small for the studies to detect". I don't have a particularly strong prior that marginal healthcare spending is all that useful for increasing life expectancy - diminishing returns can mean that the average dollar spent on healthcare does much more than the marginal dollar.
Can you justify your claim that (2) does not make sense?
I don't think those ratings are comparable. On the other hand, my estimate of 3d was apparently lowballing it based on some older policy networks, and newer ones are perhaps as strong as 4d to 6d, which on the upper end is still weaker than professional players but not by much.
However, there is a big gap between weak professional players and "grandmaster level", and I don't think the raw policy network of AlphaGo could play competitively against a grandmaster level Go player.
This is not quite true. Raw policy networks of AlphaGo-like models are often at a level around 3 dan in amateur rankings, which would qualify as a good amateur player but nowhere near the equivalent of grandmaster level. If you match percentiles in the rating distributions, 3d in Go is perhaps about as strong as an 1800 elo player in chess, while "master level" is at least 2200 elo and "grandmaster level" starts at 2500 elo.
Edit: Seems like policy networks have improved since I last checked these rankings, and the biggest networks currently available for public use can achieve a strength of possibly as high as 6d without MCTS. That would be somewhat weaker than a professional player, but not by much. Still far off from "grandmaster level" though.
I think you're ignoring the qualifier "literally portrayed" in Matthew's sentence, and neglecting the prior context that he's talking about AI development being something mainly driven forward by hobbyists with no outsized impacts.
He's talking about more than just the time in which AI goes from e.g. doubling the AI software R&D output of humans to some kind of singularity. The specific details Eliezer has given about this scenario have not been borne out: for example, in his 2010 debate with Robin Hanson, he emphasized a scenario in which a few people working in a basement and keeping all of their insights secret hit upon some key software innovation that enables their piece of consumer hardware to outcompete the rest of the world.
It's worth noting that Robin Hanson also said that "takeoff" is most likely to take months. He just said it for ems, and in his world, that rate of growth was being driven by the entire world economy working as a whole rather than one local part of the world having such better software that it could outcompete everyone else with vastly less material resources. I find you saying this is a "mild win" for Eliezer's prediction incomprehensible given that we live in a world where individual AI labs are being valued at ~ $100B and raising tens of billions of dollars in capital.
I assume John was referring to Unitary Evolution Recurrent Neural Networks which is cited in the "Orthogonal Deep Neural Nets" paper.
It might be right, I don't know. I'm just making a local counterargument without commenting on whether the 2.5 PB figure is right or not, hence the lack of endorsement. I don't think we know enough about the brain to endorse any specific figure, though 2.5 PB could perhaps fall within some plausible range.
While I wouldn't endorse the 2.5 PB figure itself, I would caution against this line of argument. It's possible for your brain to contain plenty of information that is not accessible to your memory. Indeed, we know of plenty of such cognitive systems in the brain whose algorithms are both sophisticated and inaccessible to any kind of introspection: locomotion and vision are two obvious examples.
I downvoted this comment for its overconfidence.
First of all, the population numbers are complete garbage. This is completely circular. You are just reading out the beliefs about history used to fabricate them. The numbers are generated by people caring about the fall of Rome. The fall of Rome didn't cause of decline in China. Westerners caring about the fall of Rome caused the apparent decline in China.
I will freely admit that I don't know how population numbers are estimated in every case, but your analysis of the issue is highly simplistic. Estimates for population decline do not just depend on vague impressions of the significance of grand historical events such as the fall of Rome. Archaeological evidence, estimates of crop yields with contemporary technology on available farmland, surviving records from the time, etc. are all used in forming population estimates.
It's far from being reliable, but what we know seems clear enough that I would give something like 80% to 90% chance that the first millennium indeed had slower population growth than the first millennium BC. You can't be certain with such things, but I also don't agree that the numbers are "complete garbage" and contain no useful information.
Second, there was a tremendous scientific and technological regress in Rome. Not caused by the fall of Rome, but the rise of Rome. There was a continual regress in the Mediterranean from 150BC to at least 600AD. Just look at a list of scientists: it has a stark gap 150BC-50AD.
I think you're conflating a lack of progress with regression here. I remark in the post that the slowdown in population growth seems to have begun around 200 BC, which is consistent with what you're saying here if you take it as a statement about growth rates and not about levels. If the pace of new discoveries slows down, that would appear to us as fewer notable scientists as well as slower growth in population, sizes of urban centers, etc.
Aside from that, there are also many alternative explanations of a gap in a list of scientists, e.g. that Rome was comparatively less interested in funding fundamental research compared to the Hellenistic kingdoms. Progress in fundamental sciences doesn't always correlate so well with economic performance; e.g. the USSR was much better at fundamental science than their economic performance would suggest.
It is more controversial to say that the renaissance 50AD-150AD is a pale shadow of the Hellenistic period, but it is. In 145BC Rome fomented a civil war in Egypt, destroying Alexandria, the greatest center of learning. In 133BC, the king of Pergamon tried to avoid this fate by donating the second center of learning. It was peaceful, but science did not survive.
I don't know what you're referring to by "Rome fomented a civil war in Egypt in 145 BC". 145 BC is when Ptolemy VI died; but as far as I know, there was no single "civil war" following his death, Alexandria was not destroyed, and Rome was not involved directly in Egyptian politics for a long time to come. Alexandria remained one of the major urban centers of the Mediterranean until the 3rd century AD - perhaps even the largest one.
Well, that's true, but at some level, what else could it possibly be? What other cause could be behind the long-run expansion in the first place, so many millennia after humans spanned every continent but Antarctica?
Technological progress being responsible for the long-run trend doesn't mean you can attribute local reversals to humans hitting limits to technological progress. Just as a silly example, the emergence of a new strain of plague could have led to the depopulation of urban centers, which lowers R&D efficiency because you lose concentrations of people working together, and thus lowers the rate of technological progress. I'm not saying this is what actually happened, but it seems like a possible story to me.
I'm very skeptical about explanations involving wars and plagues, except insofar as those impact technological development and infrastructure, because a handful of generations is plenty to get back to the Malthusian limit even if a majority of the population dies in some major event (especially regional events where you can then also get migration or invasion from less affected regions).
I agree, but why would you assume wars and plagues can't impact technological development and infrastructure?
McEvedy and Jones actually discuss a regional breakdown in the final section of the book, but they speculate too much for the discussion to be useful, I think. They attribute any substantial slowdown in growth rates to population running up against technological limits, which seems like a just-so story that could explain anything.
They note that the 3rd century AD appears to have been a critical time, as it's when population growth trends reversed in both Europe and China at around the same time: in Europe with the Crisis of the Third Century, and in China with the fall of the reconstituted Han dynasty and the beginning of the Three Kingdoms period. They attribute this to technological constraints, which seems like an unsupported assertion to me.
The other important population center is India, where we have very few records compared to Europe and China. Datasets generally report naively extrapolated smooth curves for the Indian population before the modern period, and that's because there really isn't much else to do due to the scarcity of useful information. This doesn't mean that we actually expect population growth in India to have been smooth, just that in the absence of more information our best guess for each date should probably be a smoothly increasing function of the date. As McEvedy and Jones put it, "happy is the graph that has no history".
I agree that locations isolated from Eurasia would most likely not show the same population trends, but Eurasia was ~ 75% of the world's population in the first millennium and so events in Eurasia dominate what happens to the global population.
I've actually written about this subject before, and I agree that the first plague pandemic could have been significant: perhaps killing around 8% of the global population in the four years from 541 to 544. However, it's also worth noting that our evidence for this decline is rather scant; we know that the death toll was very high in Constantinople but not much about what happened outside the capital, mostly because nobody was there to write it down. So it's also entirely conceivable that the death toll was much lower than this. The controversy about this continues to this day in the literature, as far as I know.
The hypothesis that the bubonic plague was responsible is interesting, but by itself doesn't explain the more granular data which suggests the slowdown starts around 200 BC and we already see close to no growth in global population from e.g. 200 AD to 500 AD. HYDE doesn't have this, but the McEvedy and Jones dataset does.
It's possible, and perhaps even likely, that the explanation is not monocasual. In this case, the first plague pandemic could have been one of the many factors that dragged population growth down throughout the first millennium.
In the west, I think the fall of the Western Roman Empire was probably a significant hit, and caused a major setback in economic growth in Europe.
Attribution of causality is tricky with this event, but I would agree if you said the fall coincided with a major slowdown in European economic growth.
China had its bloody Three Kingdom period, and later the An Lushan rebellion.
I think a problem re: China is that a lot of population decline estimates for China are based on the official census, and as far as I know China didn't have a formal census before the Xin dynasty, and certainly not before unification in the 3rd century BC. So the fact that we don't see comparable population declines reported may just be an artifact of that measurement issue. We certainly see plenty of them in the second millennium.
There was the Muslim conquest of the Mediterranean, Persia and Pakistan, though I don't know if that was unusually bloody.
I haven't seen estimates of this that put it anywhere near the Mongol conquests, so I would assume not particularly bloody relative to what was to come later. I would also guess that the Islamic world probably saw significant population growth around that time.
These might be small fluctuations in the grand scheme of things or add up to a period of enough turmoil and strife in the most populous regions of the world to slow growth down.
Yeah, it's possible that this is the explanation, but if so it's rather hard to know because there's no principled way to compare events like these to analogs in other time periods.
Yeah, that's right. Fixed.
If people vote as if their individual vote determines the vote of a non-negligible fraction of the voter pool, then you only need (averaged over the whole population, so the value of the entire population is instead of , which seems much more realistic.
So voting blue can make sense for a sufficiently large coalition of "ordinary altruists" with who are able to pre-commit to their vote and think people outside the coalition might vote blue by mistake etc. rather than the "extraordinary altruists" we need in the original situation with . Ditto if you're using a decision theory where it makes sense to suppose such a commitment already exists when making your decision.
That would be questioning the assumption that your cost function as an altruist should be linear in the number of lives lost. I'm not sure why you would question this assumption, though; it seems rather unnatural to make this a concave function, which is what you would need for your logic to work.
I'm surprised by how much this post is getting upvoted. It gives us essentially zero information about any question of importance, for reasons that have already been properly explained by other commenters:
-
Chess is not like the real world in important respects. What the threshold is for material advantage such that a 1200 elo player could beat Stockfish at chess tells us basically nothing about what the threshold is for humans, either individually or collectively, to beat an AGI in some real-world confrontation. This point is so trivial that I feel somewhat embarrassed to be making it, but I have to think that people are just not getting the message here.
-
Even focusing only on chess, the argument here is remarkably weak because Stockfish is not a system trained to beat weaker opponents with piece odds. There are Go AIs that have been trained for this kind of thing, e.g. KataGo can play reasonably well in positions with a handicap if you tell it that its opponent is much weaker than itself. In my experience, KataGo running on consumer hardware can give the best players in the world 3-4 stones and have an even game.
If someone could try to convince me that this experiment was not pointless and actually worth running for some reason, I would be interested to hear their arguments. Note that I'm more sympathetic to "this kind of experiment could be valuable if ran in the right environment", and my skepticism is specifically about running it for chess.
Are neural networks trained using reinforcement learning from human feedback in a sufficiently complex environment biased towards learning the human simulator or the direct translator, in the sense of the ELK report?
I think there are arguments in both directions and it's not obvious which solution a neural network would prefer if trained in a sufficiently complex environment. I also think the question is central to how difficult we should expect aligning powerful systems trained in the current paradigm to be.
I'm curious if these rate limits were introduced as a consequence of some recent developments. Has the website been having more problems with spam and low-quality content lately, or has the marginal benefit of making these changes gone up in some other way?
It could also be that you had this idea only recently and in retrospect it had been a good idea for a long time, of course.
Yes, in practice having a model of what is actually driving the correlations can help you do better than these estimates. A causal model would be helpful for that.
The product estimate for the expected correlation is only useful in a setting where nothing else is known about the relationship between the three variables than the two correlations, but in practice you often have some beliefs about what drives the correlations you observe, and if you're a good Bayesian you should of course also condition on all of that.
That's a reasonable picture to have in expectation, yeah.
As an aside, I've tried to work out what the optimal learning rate for a large language model should be based on the theory in the post, and if I'm doing the calculations correctly (which is a pretty big if) it doesn't match actual practice very well, suggesting there is actually something important missing from this picture.
Essentially, the coefficient should be where is the variance of the per-parameter noise in SGD. If you have a learning rate , you scale the objective you're optimizing by a factor and the noise variance by a factor . Likewise, a bigger batch size lowers the noise variance by a linear factor. So the equilibrium distribution ends up proportional to
where is the per-token average loss and should be equal to the mean square of the partial derivative of the per-token loss function with respect to one of the neural network parameters. If the network is using some decent batch or layer normalization this should probably be where is the model size.
We want what's inside the exponential to just be , because we want the learning to be equivalent to doing a Bayesian update over the whole data. This suggests we should pick
which is a pretty bad prediction. So there's probably something important that's being left out of this model. I'm guessing that a smaller learning rate just means you end up conditioning on minimum loss and that's all you need to in practice, and larger learning rates cause problems with convergence.
That's useful to know, thanks. Is anything else known about the properties of the noise covariance beyond "it's not constant"?
Some comments on the paper itself: if the problem is that SGD with homoskedastic Gaussian noise fails to converge to a stationary distribution, why don't they define SGD over a torus instead? Seems like it would fix the problem they are talking about, and if it doesn't change the behavior it means their explanation of what's going on is incorrect.
If the only problem is that with homoskedastic Gaussian noise convergence to a stationary distribution is slow (when a stationary distribution does exist), I could believe that. Similar algorithms such as Metropolis-Hastings also have pretty abysmal convergence rates in practice when applied to any kind of complicated problem. It's possible that SGD with batch noise has better regularization properties and therefore converges faster, but I don't think that changes the basic qualitative picture I present in the post.
Check the Wikipedia section for the stationary distribution of the overdamped Langevin equation.
I should probably clarify that it's difficult to have a rigorous derivation of this claim in the context of SGD in particular, because it's difficult to show absence of heteroskedasticity in SGD residuals. Still, I believe that this is probably negligible in practice, and in principle this is something that can be tested by experiment.
Sure, I agree that I didn't put this information into the post. However, why do you need to know which is more likely to know anything about e.g. how neural networks generalize?
I understand that SLT has some additional content beyond what is in the post, and I've tried to explain how you could make that fit in this framework. I just don't understand why that additional content is relevant, which is why I left it out.
As an additional note, I wasn't really talking about floating point precision being the important variable here. I'm just saying that if you want -complexity to match the notion of real log canonical threshold, you have to discretize SLT in a way that might not be obvious at first glance, and in a way where some conclusions end up being scale-dependent. This is why if you're interested in studying this question of the relative contribution of singular points to the partition function, SLT is a better setting to be doing it in. At the risk of repeating myself, I just don't know why you would try to do that.
You need to discretize the function before taking preimages. If you just take preimages in the continuous setting, of course you're not going to see any of the interesting behavior SLT is capturing.
In your case, let's say that we discretize the function space by choosing which one of the functions you're closest to for some . In addition, we also discretize the codomain of by looking at the lattice for some . Now, you'll notice that there's a radius disk around the origin which contains only functions mapping to the zero function, and as our lattice has fundamental area this means the "relative weight" of the singularity at the origin is like .
In contrast, all other points mapping to the zero function only get a relative weight of where is the absolute value of their nonzero coordinate. Cutting off the domain somewhere to make it compact and summing over all to exclude the disk at the origin gives for the total contribution of all the other points in the minimum loss set. So in the limit the singularity at the origin accounts for almost everything in the preimage of . The origin is privileged in my picture just as it is in the SLT picture.
I think your mistake is that you're trying to translate between these two models too literally, when you should be thinking of my model as a discretization of the SLT model. Because it's a discretization at a particular scale, it doesn't capture what happens as the scale is changing. That's the main shortcoming relative to SLT, but it's not clear to me how important capturing this thermodynamic-like limit is to begin with.
Again, maybe I'm misrepresenting the actual content of SLT here, but it's not clear to me what SLT says aside from this, so...
I'm not too sure how to respond to this comment because it seems like you're not understanding what I'm trying to say.
I agree there's some terminology mismatch, but this is inevitable because SLT is a continuous model and my model is discrete. If you want to translate between them, you need to imagine discretizing SLT, which means you discretize both the codomain of the neural network and the space of functions you're trying to represent in some suitable way. If you do this, then you'll notice that the worse a singularity is, the lower the -complexity of the corresponding discrete function will turn out to be, because many of the neighbors map to the same function after discretization.
The content that SLT adds on top of this is what happens in the limit where your discretization becomes infinitely fine and your dataset becomes infinitely large, but your model doesn't become infinitely large. In this case, SLT claims that the worst singularities dominate the equilibrium behavior of SGD, which I agree is an accurate claim. However, I'm not sure what this claim is supposed to tell us about how NNs learn. I can't make any novel predictions about NNs with this knowledge that I couldn't before.
I don't think this representation of the theory in my post is correct. The effective dimension of the singularity near the origin is much higher, e.g. because near every other minimal point of this loss function the Hessian doesn't vanish, while for the singularity at the origin it does vanish. If you discretized this setup by looking at it with a lattice of mesh , say, you would notice that the origin is surrounded by many parameters that give nearly identical loss, while near other parts of the space the number of such parameters is far fewer.
The reason you have to do some kind of "translation" between the two theories is that SLT can see not just exactly optimal points but also nearly optimal points, and bad singularities are surrounded by many more nearly optimal points than better-behaved singularities. You can interpret the discretized picture above as the SLT picture seen at some "resolution" or "scale" , i.e. if you discretized the loss function by evaluating it on a lattice with mesh you get my picture. Of course, this loses the information of what happens as and in some thermodynamic limit, which is what you recover when you do SLT.
I just don't see what this thermodynamic limit tells you about the learning behavior of NNs that we didn't know before. We already know NNs approximate Solomonoff induction if the -complexity is a good approximation to Kolmogorov complexity and so forth. What additional information is gained by knowing what looks like as a smooth function as opposed to a discrete function?
In addition, the strong dependence of SLT on being analytic is bad, because analytic functions are rigid: their value in a small open subset determines their value globally. I can see why you need this assumption because quantifying what happens near a singularity becomes incredibly difficult for general smooth functions, but because of the rigidity of analytic functions the approximation that "we can just pretend NNs are analytic" is more pernicious than e.g. "we can just pretend NNs are smooth". Typical approximation theorems like Stone-Weierstrass also fail to save you because they only work in the sup-norm and that's completely useless for determining behavior at singularities. So I'm yet to be convinced that the additional details in SLT provide a more useful account of NN learning than my simple description above.
Can you give an example of which has the mode of singularity you're talking about? I don't think I'm quite following what you're talking about here.
In SLT is assumed analytic, so I don't understand how the Hessian can fail to be well-defined anywhere. It's possible that the Hessian vanishes at some point, suggesting that the singularity there is even worse than quadratic, e.g. at the origin or something like that. But even in this regime essentially the same logic is going to apply - the worse the singularity, the further away you can move from it without changing the value of very much, and accordingly the singularity contributes more to the volume of the set as .
Say that you have a loss function . The minimum loss set is probably not exactly , but it has something to do with that, so let's pretend that it's exactly that for now.
This is a collection of equations that are generically independent and so should define a subset of dimension zero, i.e. a collection of points in . However, there might be points at which the partial derivatives vanishing don't define independent equations, so we get something of positive codimension.
In these cases, what happens is that the gradient itself has vanishing derivatives in some directions. In other words, the Hessian matrix fails to be of full rank. Say that this matrix has rank at a specific singular point and consider the set . Diagonalizing will generically bring into a form where it's the linear combination of quadratic terms and higher-order cubic terms, and locally the volume contribution to this set around will be something of order . The worse the singularity, the smaller the rank and the greater the volume contribution of the singularity to the set .
The worst singularities dominate the behavior at small because you can move "much further" along vectors where scales in a cubic fashion than directions where it scales in a quadratic fashion, so those dimensions are the only ones that "count" in some calculation when you compare singularities. The tangent space intuition doesn't apply directly here but something like that still applies, in the sense that the worse a singularity, the more directions you have to move away from it without changing the value of the loss very much.
Is this intuitive now? I'm not sure what more to do to make the result intuitive.