LessWrong 2.0 Reader

View: New · Old · Top

next page (older posts) →

next page (older posts) →

Recent comments

ryan_greenblatt on AI 2027 is a Bet Against Amdahl's Law

I'm having trouble parsing this sentence

You said "This is valid for activities which benefit from speed and scale. But when output quality is paramount, speed and scale may not always provide much help?". But, when considering activities that aren't bottlenecked on the environment, then to achieve 10x acceleration you just need 10 more speed at the same level of capability. In order for quality to be a crux for a relative speed up, there needs to be some environmental constraint (like you can only run 1 experiment).

Is that a fair statement?

Yep, my sense is that an SAR has to be better than humans at basically everything except vision.

(Given this, I currently expect that SAR comes at basically the same time as "superhuman blind remote worker". I don't currently have a strong view on the difficulty of matching human visual abilites, particulary at video processing, but I wouldn't be super surprised if video processing is harder than basically everything else ultimately.)

If "producing better models" (AI R&D) requires more than just narrow "AI research" skills, then either SAR and SAIR need to be defined to cover that broader skill set (in which case, yes, I'd argue that 1.5-10 years is unreasonably short for unaccelerated SC->SAR),

It is defined to cover the broader set? It says "An AI system that can do the job of the best human AI researcher?" Notably, Superintelligent AI researcher (SIAR) happens after "superhuman remote worker" which requires being able to automate any work a remote worker could do.

I'm guessing your crux is that the time is too short?

kilgoar on Illiteracy in Silicon Valley

So, I see you've been looking into Wikipedia and beginning some interest in history. I'm glad you've taken some of your first steps into a deeper understanding of the topic. There are a few warnings, though. When we see numbers in ancient texts such as Plutarch's reference to "thirty thousand," these need to be framed with extreme caution and understanding that ancients simply did not keep accurate records, such as birth certificates, and what evidence we do have shows the numbers to be always exaggerated. We must consider also that Thucydides' history is colored by a critical bias against Athens, with his overarching narrative presented in the Peloponnesian War. All of the speeches and quotations given in Thucydides are meant to create an impression, and are misrepresented when interpreted as if they were a journalistic source.

Now, it's good to hold these ancient atrocities in one hand, but they are not themselves showing a more cruel world of the past. We must compare them with the modern wars if they are to give us some meaningful contrast. Let's take World War 1, for example. We are just going to breeze by each battle and give a death count.

The battle of the Marne, over 500,000 died. 700,000 in the battle of Verdun. Over a million in the first battle of the Somme. 800,000 some in the second battle of the Somme. Kolubara, around half a million. Gallipoli, another half million. Galicia, over 600,000. Third battle of Ypres, exceeding 800,000. A million and a half in the Spring Offensive. Around 1.8 million in the Hundred Days Offensive. 2.3 million in the Busliov Offensive. Estimates have around 16.5 million soldiers as casualties of the first world war.

World War 2 saw some decline in military casualties but also the tragic and steep increase in civilian casualties, with somewhere around 40 million dying as a result of the war. This is due in large part to citizens becoming valid military targets, something that was only hinted at in the first world war. Curtis LeMay, the American commander who organized the systematic firebombing of Japanese civilians said, "If we'd lost the war, we'd all have been prosecuted as war criminals."

Now perhaps this brutal form of war is more kind, you are thinking, because there is no capture or sale of enemy soldiers as slaves, but I think that is utterly facile and mistaken. The Geneva conventions explicitly allow compelling prisoners of war to labor, so long as they aren't officers. The US and Soviets forced German prisoners of war into labor. The Germans captured and enslaved the people of Europe on a scale that was unprecedented in history, with fifteen million people enslaved.

Last century is often called the "Age of Genocide" and we can make a list here, too. The Armenians of Turkey, Jewish people of Germany, Bosnians, Mayans of Guatemala, Tutsis of Rwanda, Tasmania's complete genocide, the genocide of Native Americans, all represent a rising global trend that is very decidedly current and recent, with genocide by no means a universal feature that is continuous through history.

There are currently around two million people in the American prison system, of these around 800,000 do everyday labor like the rest of us, paid in rates best measured by pennies per hour. The trend of mass incarceration in the US is one that has massively increased over the past generation.

With a less biased view of the last century, as well as the present, it is clear that these events of the past were not "terrifically violent" by the standards of the modern era.

eggsyntax on Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI

Great post, thanks! I think your view is plausible, but that we should also be pretty uncertain.

Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI

This has been one of my central research focuses over the past nine months or so^[1]. I very much agree that these failures should be surprising, and that understanding why is important, especially given this issue's implications for AGI timelines. I have a few thoughts on your take (for more detail on my overall view here, see the footnoted posts^[1]):

It's very difficult to distinguish between the LLM approach (or transformer architecture) being fundamentally incapable of this sort of generalization, vs being unreliable at these sorts of tasks in a way that will continue to improve along with other capabilities. Based on the evidence we have so far, there are reasonable arguments on both sides.
- But also there's also an interesting pattern that's emerged where people point to something LLMs fail at and say that it clearly indicates that LLMs can't get to AGI or beyond, and then are proven wrong by the next set of LLMs a few months later. Gary Marcus provides endless examples of this pattern (eg here, here). This outside view should make us cautious about making similar predictions.
- I definitely encountered that pattern myself in trying to assess this question; I pointed here [LW · GW] to the strongest concrete challenges I found to LLM generality, and four months later [LW · GW] LLM performance on those challenges had improved dramatically.
I do think we see some specific, critical cases that are just reliability issues, and are improving with scale (and other capabilities improvements).
- Maintaining a coherent internal representation of something like a game board is a big one. LLMs do an amazing job with context and fuzziness, and struggle with state and precision. As other commenters have pointed out, this seems likely to be remediable without big breakthroughs, by providing access to more conventional computer storage and tools.
- Even maintaining self-consistency over the course of a long series of interactions tends to be hard for current models, as you point out.
- Search over combinatorial search trees is really hard, both because of the state/precision issues just described, and because combinatorial explosions are just hard! Unassisted humans also do pretty badly on that in the general case (although in some specific cases like chess humans learn large sets of heuristics that prune away much of the combinatorial complexity).
  - Backtracking in reasoning models helps with exploring multiple paths down a search tree, but maybe only by a factor of <= 10.
- These categories seem to have improved model-by-model in a way that makes me skeptical that it's a fundamental block that scaling can't solve.
A tougher question is the one you describe as "some kind of an inability to generalize"; in particular, generalizing out-of-distribution. Assessing this is complicated by a few subtleties:
- Lots of test data has leaked into training data at this point^[2], even if we only count unintentional leakage; just running the same exact test on system after system won't work well.
- My take is that we absolutely need dynamic / randomized evals to get around this problem.
- Evaluating generalization ability is really difficult, because as far as I've seen, no one has a good principled way to determine what's in and out of distribution for a model that's absorbed a large percentage of human knowledge (I keep thinking this must be false, but no one's yet been able to point me to a solution).
- It's further complicated by the fact that there are plenty of ways in which human intelligence fails out-of-distribution; it's just that -- almost necessarily -- we don't notice the areas where human intelligence fails badly. So lack of total generality isn't necessarily a showstopper for attaining human-level intelligence.
I'm a lot less convinced than you seem to be that scaling has stopped bringing significant new benefits. I think that's possible, but it's at least equally plausible to me that
- It's just taking a lot longer to see the next full OOM of scaling, because on a linear scale that's a lot of goddamn money. It's hard to tell because the scaling labs are all so cagey about details. And/or
- OpenAI has (as I believe I recall gwern putting it) lost the mandate of heaven. Most of their world-class researchers have decamped for elsewhere, and OpenAI is just executing on the ideas those folks had before they left. The capabilities difference between different models of the same scale is pretty dramatic, and OpenAI's may be underperforming their scale. Again it's hard to say.

One of my two main current projects (described here [? · GW]) tries to assess this better by evaluating models on their ability to experimentally figure out randomized systems (hence ~guaranteed not to be in the training data) with an unbounded solution space. We're aiming to have a results post up by the end of May. It's specifically motivated by trying to understand whether LLMs/LRMs can scale to/past AGI or more qualitative breakthroughs are needed first.

^{^}
I made a similar argument in "LLM Generality is a Timeline Crux" [LW · GW], updated my guesses somewhat based on new evidence in "LLMs Look Increasingly Like General Reasoners" [LW · GW], and talked about a concrete plan to address the question in "Numberwang: LLMs Doing Autonomous Research, and a Call for Input". [? · GW] Most links in the comment are to one of these.
^{^}
"GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models" makes this point painfully well.

snewman on AI 2027 is a Bet Against Amdahl's Law

Sure, but for output quality better than what humans could (ever) do to matter for the relative speed up, you have to argue about compute bottlenecks, not Amdahl's law for just the automation itself!

I'm having trouble parsing this sentence... which may not be important – the rest of what you've said seems clear, so unless there's a separate idea here that needs responding to then it's fine.

It sounds like your actual objection is in the human-only, software-only time from superhuman coder to SAR (you think this would take more than 1.5-10 years).
Or perhaps your objection is that you think there will be a smaller AI R&D multiplier for superhuman coders. (But this isn't relevant once you hit full automation!)

Agreed that these two statements do a fairly good job of characterizing my objection. I think the discussion is somewhat confused by the term "AI researcher". Presumably, for an SAR to accelerate R&D by 25x, "AI researcher" needs to cover nearly all human activities that go into AI R&D? And even more so for SAIR/250x. While I've never worked at an AI lab, I presume that the full set of activities involved in producing better models is pretty broad, with tails extending into domains pretty far from the subject matter of an ML Ph.D and sometimes carried out by people whose job titles and career paths bear no resemblance to "AI researcher". Is that a fair statement?

If "producing better models" (AI R&D) requires more than just narrow "AI research" skills, then either SAR and SAIR need to be defined to cover that broader skill set (in which case, yes, I'd argue that 1.5-10 years is unreasonably short for unaccelerated SC->SAR), or if we stick with narrower definitions for SAR and SAIR then, yes, I'd argue for smaller multipliers.

ryan_greenblatt on AI 2027 is a Bet Against Amdahl's Law

I'm worried that you're missing something important because you mostly argue against large AI R&D multipliers, but you don't spend much time directly referencing compute bottlenecks in your arguments that the forecast is too aggressive.

Consider the case of doing pure math research (which we'll assume for simplicity doesn't benefit from compute at all). If we made emulated versions of the 1000 best math researchers and then we made 1 billion copies of each of them them which all ran at 1000x speed, I expect we'd get >1000x faster progress. As far as I can tell, the words in your arguments don't particularly apply less to this situation than the AI R&D situation.

Going through the object level response for each of these arguments in the case of pure math research and the correspondence to the AI R&D:

Simplified Model of AI R&D

Math: Yes, there are many tasks in math R&D, but the 1000 best math researchers could already do them or learn to do them.

AI R&D: By the time you have SAR (superhuman AI researcher), we're assuming the AIs are better than the best human researchers(!), so heterogenous tasks don't matter if you accept the premise of SAR: whatever the humans could have done, the AIs can do better. It does apply to the speed ups at superhuman coders, but I'm not sure this will make a huge difference to the bottom line (and you seem to mostly be referencing later speed ups).

Amdahl's Law

Math: The speed up is near universal because we can do whatever the humans could do.

AI R&D: Again, the SAR is strictly better than humans, so hard-to-automate activities aren't a problem. When we're talking about ~1000x speed up, the authors are imagining AIs which are much smarter than humans at everything and which are running 100x faster than humans at immense scale. So, "hard to automate tasks" is also not relevant.

All this said, compute bottlenecks could be very important here! But the bottlenecking argument must directly reference these compute bottlenecks and there has to be no way to route around this. My sense is that much better research taste and perfect implementation could make experiments with some fixed amount of compute >100x more useful. To me, this feels like the important question: how much can labor results in routing around compute bottlenecks and utilizing compute much more effectively. The naive extrapolation out of the human range makes this look quite aggressive: the median AI company employee is probably 10x worse at using compute than the best, so an AI which as superhuman as 2x the gap between median and best would naively be 100x better at using compute than the best employee. (Is the research taste ceiling plausibly this high? I currently think extrapolating out another 100x is reasonable given that we don't see things slowing down in the human range as far as we can tell.)

Dependence on Narrow Data Sets

This is only applicable to the timeline to the superhuman coder milestone, not to takeoff speeds once we have a superhuman coder. (Or maybe you think a similar argument applies to the time between superhuman coder and SAR.)

Hofstadter's Law As Prior

Math: We're talking about speed up relative to what the human researchers would have done by default, so this just divides both sides equally and cancels out.

AI R&D: The should also just divide both sides. That said, Hofstadter's Law does apply to the human-only, software-only times between milestones. But note that these times are actually quite long! (Maybe you think they are still too short, in which case fair enough.)

frontier64 on How to end credentialism

Employers who poorly select employees will most likely be out-competed by employers who make better decisions when selecting employees. We already see this in hiring for software engineers where many employers will accept a bachelors degree or other experience which demonstrates coding skill.

And for software engineering, at least when I went to school, you still had to be able to program at least a little bit to get a degree.

The majority of the downside from credentialism comes from fields where it’s literally illegal to work if you don’t have the right college degree

viliam on Not All Beliefs Are Created Equal: Diagnosing Toxic Ideologies

Interesting perspective. The difficult part will be that the proposed metrics are of the "more or less" type, rather than "yes or no". So one must be familiar with multiple beliefs, in order to put the specific one on the scale.

Psychological comfort -- each belief implicitly divides people into two groups: those who understand it and those who don't; the former is better. Knowing the Pythagorean theorem can make you proud of your math skills.

It gets suspicious when a seemingly simple belief explains too much. Knowing the Pythagorean theorem allows you to calculate the longest side of a right-angled triangle, a distance between two points in N-dimensional Euclidean space even for N>3, or allows you to prove that sin²(φ) + cos²(φ) = 1, but that's it. If it also told you how to dress, what to eat, and which political party to vote for, that would be suspicious.

On the opposite end of the scale, mathematics as a whole claims to explains a lot, it is practically involved in everything, but it is a ton of knowledge that takes years or decades to study properly. It would be suspicious if something similarly powerful could be understood by merely reading a book and hanging out with some group.

(Elephant in the room: what about "rationality", especially the claim that "P(A|B) = [P(A)*P(B|A)]/P(B)" explains the entire multiverse and beyond? I think it is kinda okay, as long as you remember that you also need specific data to apply the formula to; mere general knowledge won't help you figure out the details. Also, no one claims that the Bayes Theorem can only be used for good purposes, or that it makes you morally superior.)

Self-Sealing Mechanisms -- be careful when the belief is supported by things other than arguments and data; for example by violence (verbal or otherwise). This is also tricky: is it okay to "cancel" a crackpot? I think it is okay to fire crackpots from academic/scientific institutions; those obviously wouldn't be able to do their job otherwise. But if you start persecuting heretical thoughts expressed by people in their free time, on their blogs, etc., that goes too far.

I sometimes says that "political orientation" is basically "which part of complex reality you decided to ignore". (That doesn't mean that if you ignore nothing, you are unable to have opinions or make decisions. But you opinions will be usually be like "this is complicated, in general it is usually better to do X, but there are exceptions, such as Y". The kind of reasoning that would get you kicked out of any ideological group.)

ryan_greenblatt on AI 2027 is a Bet Against Amdahl's Law

Sure, but for output quality better than what humans could (ever) do to matter for the relative speed up, you have to argue about compute bottlenecks, not Amdahl's law for just the automation itself! (As in, if some humans would have done something in 10 years and it doesn't have any environmental bottleneck, then 10x faster emulated humans can do it in 1 year.)

My mental model is that, for some time to come, there will be activities where AIs simply aren't very competent at all,

Notably, SAR is defined as "Superhuman AI researcher (SAR): An AI system that can do the job of the best human AI researcher but faster, and cheaply enough to run lots of copies." So, it is strictly better than the best human researcher(s)! So, your statement might be true, but is irrelevant if we're conditioning on SAR.

It sounds like your actual objection is in the human-only, software-only time from superhuman coder to SAR (you think this would take more than 1.5-10 years).

Or perhaps your objection is that you think there will be a smaller AI R&D multiplier for superhuman coders. (But this isn't relevant once you hit full automation!)

vladimir_nesov on Vladimir_Nesov's Shortform

Long reasoning training might fail to surpass pass@50-pass@400 capabilities of the base/instruct model. A new paper measured pass@k^[1] performance for models before and after RL training on verifiable tasks, and it turns out that the effect of training is to lift pass@k performance at low k, but also to lower it at high k!

Location of the crossover point varies, but it gets lower with more training (Figure 7, bottom), suggesting that no amount of RL training of this kind lets a model surpass the pass@k performance of the base/instruct model at the crossover point reached with a small amount of RL training. (Would be interesting to know how the pass@k plots depend on the number of reasoning tokens, for models that allow control over the reasoning budget.)

A task is solved at pass@k if an oracle verifier claims at least one of k sampled solutions to be correct. See Figure 3, left in this Jul 2024 paper for how pass@k affects performance, depending on the model. ↩︎

snewman on AI 2027 is a Bet Against Amdahl's Law

This is valid for activities which benefit from speed and scale. But when output quality is paramount, speed and scale may not always provide much help?

My mental model is that, for some time to come, there will be activities where AIs simply aren't very competent at all, such that even many copies running at high speed won't provide uplift. For instance, if AIs aren't in general able to make good choices regarding which experiments to run next, then even an army of very fast poor-experiment-choosers might not be worth much, we might still need to rely on people to choose experiments. Or if AIs aren't much good at evaluating strategic business plans, it might be hard to train AIs to be better at running a business (a component of the SAIR -> ASI transition) without relying on human input for that task.

For Amdah's Law purposes, I've been shorthanding "incompetent AIs that don't become useful for a task even when taking speed + scale into account" as "AI doesn't provide uplift for that task".

EDIT: of course, in practice it's generally at least somewhat possible to trade speed+scale for quality, e.g. using consensus algorithms, or generate-and-test if you have a good way of identifying the best output. So a further refinement is to say that very high acceleration requires us to assume that this does not reach importantly diminishing returns in a significant set of activities.

EDIT2:

(My sense is that the progress multipliers in AI 2027 are too high but also that the human-only times between milestones are somewhat too long. On net, this makes me expect somewhat slower takeoff with a substantial chance on much slower takeoff.)

I find this quite plausible.

LessWrong 2.0 Reader

Archive

Recent comments