Thoughts on Francois Chollet's belief that LLMs are far away from AGI?
post by O O (o-o) · 2024-06-14T06:32:48.170Z · LW · GW · 3 commentsThis is a question post.
Contents
Answers 8 Seth Herd 5 RogerDearnaley None 3 comments
Dwarkesh had a podcast recently with Francois Chollet (creator of Keras)
He seems fairly skeptical we are anywhere near AGI with LLMs. He mostly bases his intuition that LLMs fail on OOD tasks and don't seem to be good at solving simple abstract reasoning problems he calls the ARC challenge. It seems he thinks system 2 thinking will be a much harder unlock than people think and that scaling LLMs will go nowhere. In fact he goes so far as to say the scaling maximalists have set back AGI progress by 5-10 years. Current LLMs to him are just simply information retrieval databases.
He, along with the CEO of Zapier, have launched a 1 million dollar prize to beating the ARC bench marks, which are apparently hard for LLMs. I didn't believe it at first, given how easy they seem, but barely any progress has been made on the ARC bench marks in the last 4 years. In retrospect, it's odd that so many existing benchmarks rely heavily on memorized knowledge, and the ARC results check out with LLMs being bad at playing sudoku (so maybe not that surprising).
This seems to be in contradiction with what people on this site generally think. Is the disagreement mainly that system 2 thinking will be a relatively fast unlock (this is my take at least[1]) whereas Francois thinks it will take a long time?
Or does it go deeper?
- ^
Personally my intuition that LLMs are world modelers and system 2 thinking will be a relatively simple unlock as they get better at modeling the world.
Answers
I think he's totally right that there's a missing ability in LLMs. He doesn't claim this will be a big blocker. I think we'd be fools to assume this gives us much more time.
My previous comment [LW(p) · GW(p)] on this podcast and pretty much this question says more.
Briefly: There might be a variety of fairly easy ways to add more System 2 thinking and problem-solving and reasoning in genuinely new domains. Here's one for system 2 thinking, and here's one [LW · GW] for better reasoning and knowledge discover. There might easily be six more that people are busy implementing, half of which will work pretty quickly.
This could be a bottleneck that gives us extra years, but assuming that seems like a very bad idea. We should step lively on the whole alignment project in case this intelligence stuff happens to be a lot easier than we've been thinking prior to getting enough compute and deep nets that really work.
WRT the consensus you mention: there's no consensus, here or elsewhere. Nobody knows. Taking an average would be a bad idea. The distribution among people who've got the right expertise (or as close as we get now) and spend time on prediction is still very broad. This says pretty clearly that nobody knows. That includes this question as well as all other timeline questions. We can't be sure until it's built and working. I've got lots of reasoning behind my guess that this won't take long to solve, but I wouldn't place heavy odds on being right.
That broad distribution is why the smart bet is to have an alignment solution ready for the shorter projected timelines.
↑ comment by eggsyntax · 2024-06-16T11:26:07.286Z · LW(p) · GW(p)
There might be a variety of fairly easy ways to add more System 2 thinking and problem-solving and reasoning in genuinely new domains. Here's one [LW · GW] for system 2 thinking
This approach seemed more plausible to me a year ago than it does now. It seemed feasible enough that I sketched out takeover scenarios along those lines (eg here). But I think we should update on the fact that there doesn't seem to have been much progress in this direction since then, despite eg Auto-GPT getting $12M in funding in November, and lots of other startups racing for commercially useful scaffolded agentic systems. Maybe there are significant successes in that area that I'm unaware of?
Of course a year isn't that long. But I still think it warrants an update if I'm not missing something. And 'LLMs are currently incapable of dealing with novel situations' is the best explanation I see for why that hasn't happened.
There are other interesting places where LLMs fail badly at reasoning, eg planning problems like block-world or scheduling meetings between people with availability constraints; see eg this paper & other work from Kambhampati.
I've been considering putting some time into this as a research direction; the ML community has a literature on the topic but it doesn't seem to have been discussed much in AIS, although the ARC prize could change that. I have an initial sketch of such a direction here, combining lit review & experimentation. Feedback welcomed!
Replies from: eggsyntax↑ comment by eggsyntax · 2024-06-16T11:42:43.043Z · LW(p) · GW(p)
Note: one possible reason that scaffolded agents haven't succeeded better yet is the argument that Sholto Douglas & Trenton Bricken make on a recent Dwarkesh Patel podcast: that you just need another couple of orders of magnitude of reliability before you can string together substantial chains of subgoals and get good results.
Replies from: Seth Herd↑ comment by Seth Herd · 2024-06-17T19:18:01.588Z · LW(p) · GW(p)
This is a good point. I keep expecting to see useful agents released, or to hear the open-source community get excited about their success with non-commercial projects. Neither have happened. So we should update at least a bit. This is harder or less useful than it first seemed.
But I think there's still a good chance that this is the fastest and most obvious route to AGI. In the article I linked, about a year old now, I didn't predict that GPT4 could be turned into AGI, just that LLMs could - and I noted that it would more likely be GPT5 or GPT6 combined with scaffolding that becomes very useful and very dangerous, relatively easily. The dumber the model, the better and more elaborate the scaffolding needs to be to get it past the point of autonomous reasoning and usefuluness.
There are two factors you don't mention. One is that the biggest blocker to commercial usefulness wasn't reasoning ability, it was ability to correctly interpret a webpage or other software. Multimodal models largely solved that. So most of the commercial dev effort probably went there until the availability of natively multimodal LLM APIs. That was about six months ago, still a long time. And that doesn't account for less-commercial efforts.
The second is the possibility that GPT4 and the current gen just aren't quite smart enough to have scaffolded System 2 work. The article Large Language Models Cannot Self-Correct Reasoning Yet from DeepMind and academic authors in oct. 23 draws this conclusion. (The many reports of useful self-correction were based on terrible methodology that miscalculated base rates when you allow multiple guesses. Computer scientists are even worse at methodology than social scientists, apparently. )
The "Yet" in their title is important. They think a little more native reasoning ability would get them over the hump to doing useful self-correction. That's one huge application of System 2 reasoning, but not all of it.
My current fear is not that this system 2 scaffolding won't work, it's that it won't work fast enough and easily enough to be the dominant approach. If we bake in the planning and reasoning abilities using RL, a lot of the advantages of language model agents disappear, and several big reasons to think we'll get alignment wrong come back into play.
So I'm thinking that alignment people should actually help make scaffolded system 2 reasoning work, which is a pretty radical proposal relative to most alignment thought.
Replies from: eggsyntax↑ comment by eggsyntax · 2024-06-18T11:18:36.543Z · LW(p) · GW(p)
But I think there's still a good chance that this is the fastest and most obvious route to AGI.
Agreed that it's quite plausible that LLMs with scaffolding basically scale to AGI. Mostly I'm just arguing that it's an open question with important implications for safety & in particular timelines.
One is that the biggest blocker to commercial usefulness wasn't reasoning ability, it was ability to correctly interpret a webpage or other software.
I'm very skeptical of this with respect to web pages. Some pages include images (eg charts) that are necessary to understand the page content, but for many or most pages, the important content is text in an HTML file, and we know LLMs handle HTML just fine (since they can easily create it on demand).
The second is the possibility that GPT4 and the current gen just aren't quite smart enough to have scaffolded System 2 work.
Agreed, this seems like a totally live possibility.
So I'm thinking that alignment people should actually help make scaffolded system 2 reasoning work, which is a pretty radical proposal relative to most alignment thought.
Personally I'd have to be a lot more confident that alignment of such systems just works to favor alignment researchers advancing capabilities; to me having additional time before AGI seems much more clearly valuable.
Replies from: Seth Herd, Seth Herd↑ comment by Seth Herd · 2024-06-18T22:41:10.749Z · LW(p) · GW(p)
I was also surprised that interpreting webpages was a major blocker. They're in text and HTML, as you say.
I don't remember who said this, but I remember believing them since they'd actually tried to make useful agents. They said that actual modern webpages are such a flaming mess of complex HTML that the LLMs get confused easily.
Your last point, whether the direction to easier-to-align AGI or more time to work on alignment is preferable is a very complex issue. I don't have a strong opinion since I haven't worked through it all. But I think there are very strong reasons to think LLM-based AGI is far easier to align than other forms, particularly if the successful approach doesn't heavily rely on RL. So I think your opinion is in the majority, but nobody has worked it through carefully enough to have a really good guess. That's a project I'd like to embark on by writing a post making the controversial suggestion that maybe we should be actively building LMA AGI as the safest of a bad set of options.
Replies from: eggsyntax↑ comment by eggsyntax · 2024-06-19T08:11:30.859Z · LW(p) · GW(p)
I think that'd be a really valuable post!
I also think we'll get substantial info about the feasibility of LMA in the next six months. Progress on ARC-AGI will tell us a lot about LLMs as general reasoners, I think (and Redwood's excellent new work [LW · GW] on ARC-AGI has already updated me somewhat toward this not being a fundamental blocker). And I think GPT-5 will tell us a lot. 'GPT-4 comes just short of being capable and reliable enough to work well for agentic scaffolding' is a pretty plausible view. If that's true, then we should see such scaffolding working a lot better with GPT-5; if it's false, then we should see continued failures to make it really work.
↑ comment by Seth Herd · 2024-06-18T22:48:54.626Z · LW(p) · GW(p)
I realized I didn't really reply to your first point, and that it's a really important one.
We're in agreement that scaffolded LLMs are a possible first route to AGI, but not a guaranteed one.
If that's the path, timelines are relatively short.
If that's a possibility, we'd better have alignment solutions for that possible path, ASAP.
That's why I continue to focus on aligning LMAs.
If other paths to AGI turn out to be the first routes, timelines are probably a little longer, so we've got a little longer to work on alignment for those types of systems. And there are more people working on RL-based alignment schemes (I think?)
Replies from: eggsyntaxCoT prompting and agentic behavior are basically supplying System 2 thinking. Currently LLMs tend to use and benefit from them for a little while, then sooner or later go off the rails/get caught in a loop/get confused, and are seldom able to get unstuck when they do. What we need is for them to be able to much more reliably carry out abilities that they have already demonstrated: which is bread-and-butter for scaling. So I don't see System 2 thinking as a blocker, just work-in-progress. It might take a few years.
As for the ARC challenge, it clearly requires a visual LLM, so systems capable of attempting it have only really existed for about 18 months. My guess is that it will fall soon: progress on math and programming benchmarks has been rapid, so visual logic puzzles doesn't seem like it would be that hard. I'd guess the main problem is the shortage of visual puzzle training material for tasks like this in most training sets.
↑ comment by O O (o-o) · 2024-06-14T15:30:20.752Z · LW(p) · GW(p)
My guess is that it will fall soon: progress on math and programming benchmarks has been rapid, so visual logic puzzles doesn't seem like it would be that hard.
His argument is that with millions of examples of these puzzles, you can train an LLM to be good at this particular task, but that doesn’t mean reasoning if it fails at a similar task it doesn’t see. He thinks you should be able to train an LLM to do this without ever training on tasks like these.
I can buy this argument, but still have some doubts. It may be this reasoning is just derived from visual training data + spending more time per token reliably, or he is right and LLMs are fundamentally terrible at abstract reasoning. I think it would be nice to know what’s the youngest a human can be and still solve this. Might give us a sense of the “training data” a human needs to get there.
Some caveats: humans can only get 85% on the public test set I believe. This is to say nothing about the difficulty of the private test set. Maybe it’s harder, tho I doubt it since it would go against what he claims is the spirit of the benchmark.
3 comments
Comments sorted by top scores.
comment by lc · 2024-06-18T22:55:40.802Z · LW(p) · GW(p)
Francois seems almost to assume that just because an algorithm takes millions or billions of datapoints to train, that means its output is just "memorization". In fact it seems to me that the learning algorithms just work pretty slowly, and that the thing that's learned after those millions or billions of tries is the actual generative concepts.
comment by eggsyntax · 2024-06-19T08:14:42.856Z · LW(p) · GW(p)
Redwood's new work on ARC-AGI [LW · GW] provides some useful evidence on this question IMO.
comment by Archimedes · 2024-06-16T19:22:13.929Z · LW(p) · GW(p)
My hypothesis is that poor performance on ARC is largely due to lack of training data. If there were billions of diverse input/output examples to train on, I would guess standard techniques would work.
Efficiently learning from just a few examples is something that humans are still relatively good at, especially in simple cases where system1and system 2 synergize well. I’m not aware of many cases where AI approaches human level without orders of magnitude more training data than a human ever sees in a lifetime.
I think the ARC challenge can be solved within a year or two, but doing so won’t be super interesting to me unless it breaks new ground in sample efficiency (not trained on billions of synthetic examples) or generalization (e.g. solved using existing LLMs rather than a specialized net).