Is "Weakly General AI" Already Possible?

post by Tomás B. (Bjartur Tómas) · 2023-05-24T15:08:14.201Z · LW · GW · 4 comments

There is a famous Metaculus question on "weakly general" AI: 

Its resolution criteria are as follows: 

Able to reliably pass a Turing test of the type that would win the Loebner Silver Prize.

Able to score 90% or more on a robust version of the Winograd Schema Challenge, e.g. the "Winogrande" challenge or comparable data set for which human performance is at 90+%

Be able to score 75th percentile (as compared to the corresponding year's human students; this was a score of 600 in 2016) on all the full mathematics section of a circa-2015-2020 standard SAT exam, using just images of the exam pages and having less than ten SAT exams as part of the training data. (Training on other corpuses of math problems is fair game as long as they are arguably distinct from SAT exams.)

Be able to learn the classic Atari game "Montezuma's revenge" (based on just visual inputs and standard controls) and explore all 24 rooms based on the equivalent of less than 100 hours of real-time play (see closely-related question.)

By "unified" we mean that the system is integrated enough that it can, for example, explain its reasoning on an SAT problem or Winograd schema question, or verbally report its progress and identify objects during videogame play.

I feel like this would be doable today if OpenAI was to spend a few million on it.

Advanced prompting strategies used by things like Smart-GPT promise to reduce error significantly - it seems plausible 90+ is possible on WinoGrande now with such strategies. 

So all that remains is Montezuma's revenge - and this could be fixed any time a new model is trained with a GATO-like strategy. 

Also, I briefly had access to the leaked GPT-4 multi-modal endpoint. It was shockingly good. It would surprise me but not overly surprise me if it could solve Montezuma's revenge - just very slowly and very, very expensively. Were it combined with a weaker system that could move in the game and queried GPT-4 to help get unstuck perhaps that would work - and GPT-4 would be able to explain its actions provided its context window contains some description of the actions in text or photos.

At the very least, I'm pretty confident next-generation LLMs (GPT-5 and Gemini) will be able to pass all the criteria, so I think 2024 is a good estimate - ignoring the fact that these systems will be trained to reveal they are LLMs. 


Comments sorted by top scores.

comment by Christopher King (christopher-king) · 2023-05-24T18:21:41.556Z · LW(p) · GW(p)

I think a Loebner Silver Prize is still out of reach of current tech; GPT-4 sucks at most board games (which is possible for a judge to test over text).

I won't make any bets about GPT-5 though!

comment by Gordon Seidoh Worley (gworley) · 2023-05-24T17:41:06.711Z · LW(p) · GW(p)

If it is the case that OpenAI is already capable of building a weakly general AI by this process, then I guess most of the remaining uncertainty lies in determining when it's worthwhile for them or someone like them to do it.

comment by hold_my_fish · 2023-05-25T06:35:56.614Z · LW(p) · GW(p)

I believe you're underrating the difficult of Loebner-silver. See my post on the topic. [LW · GW] The other criteria are relatively easy, although it would be amusing if a text-based system failed on the technicality of not playing Montezuma's revenge.

comment by nickwelp · 2023-05-25T06:07:41.697Z · LW(p) · GW(p)

I think even basic LLMs with less than general AI can be powerfully helpful as a center point in a series of nodes constituting a mind like thing for less intelligent robots that are still very helpful, like home assistants.