↑ comment by gwern ·
2020-05-29T18:28:04.400Z · LW(p) · GW(p)
The obvious thing to do here is to plug it into a DRL agent. Something like learning from instructions or from game manuals: Nethack was recently packaged up, so imagine finetuning GPT-3 on the Nethack wiki and then providing text embeddings from GPT-3 to MuZero or Agent57 etc. https://www.technologyreview.com/2020/02/17/844721/ai-openai-moonshot-elon-musk-sam-altman-greg-brockman-messy-secretive-reality/
One of the biggest secrets is the project OpenAI is working on next. Sources described it to me as the culmination of its previous four years of research: an AI system trained on images, text, and other data using massive computational resources. A small team has been assigned to the initial effort, with an expectation that other teams, along with their work, will eventually fold in. On the day it was announced at an all-company meeting, interns weren’t allowed to attend. People familiar with the plan offer an explanation: the leadership thinks this is the most promising way to reach AGI.
As Schmidhuber put it: "one model to rule them all". Cross-modal learning ought to be much more efficient and give even more human-like reasoning eg https://arxiv.org/abs/1912.02315 GPT-3 is a text-only self-supervised world-model; being unimodal (so no visual transfer from SimCLR or other recent highly-successful image self-supervision) and not benefiting from any RL loops, it has a lot of weaknesses, but it's a start.
Between the continued scaling, how scaling/pretraining produces ever more human-like systems in terms of performance/adversarial-examples, cross-modal learning, transfer learning working in RL, self-supervised learning suddenly crushing it, the potential of brain imitation learning, the next decade is very exciting indeed (contra predictions that DL will surely top out any time - real soon now, just you wait and see). One can easily imagine a multi-headed architecture where a multimodal GPT-3-like module, trained by self-supervised learning on large text and image and video datasets (like VideoBERT), feeds into a trunk with modules for ALE, DMLab, Dactyl robot arm etc, doing per-task MuZero-style policy-learning+planning, collecting new experience which is fed back into the self-supervised model, enabling it to do zero-shot tasks in games or robotics or text generation from video or text inputs, learning extremely sample-efficiently (and the more so the more tasks it trains on)...
We are increasingly limited mostly by researchers' ability to actually write and tweak and integrate these darn things.
Replies from: zachary-robertson