Posts

Comments

Comment by Ankesh Anand (ankesh-anand) on "A Generalist Agent": New DeepMind Publication · 2022-05-12T18:54:57.135Z · LW · GW

My main takeaway from Gato: If we can build specialized AI agents for 100s/1000s of tasks, it's now pretty straightforward to make a general agent that can do it all in a single model. Just tokenize data from all the tasks and feed into a transformer.

Comment by Ankesh Anand (ankesh-anand) on How do new models from OpenAI, DeepMind and Anthropic perform on TruthfulQA? · 2022-02-27T10:40:35.379Z · LW · GW

Any plans on evaluating RETRO (the retrieval augmented transformer from DeepMind) on TruthfulQA? I'm guessing it should perform similarly to WebGPT but would be nice to get a concrete number. 

Comment by Ankesh Anand (ankesh-anand) on EfficientZero: How It Works · 2021-11-26T18:47:56.392Z · LW · GW

Great post! I think you might wanna emphasize just how crucial ReAnalyse is for data-efficiency (the default MuZero is quite sample in-efficient), and how the reanalyse-ratio can be tuned easily for any data budget using a log-linear scaling law.  You can also interpret the off-policy correction thing as running ReAnalyse twice, so my TL;DR of EfficientZero would be "MuZero ReAnalyse + SPR".

Regarding contrastive vs SPR, I don't think you would find a performance boost using a contrastive loss compared to SPR on Atari at least.  We did an ablation for this in the SPR paper (Table 6, appendix). I suspect the reason contrastive works (slightly) better on Procgen is because of the procedural diversity there making negative examples much more informative. 

Definitely agree about moving to multi-task test beds as the next frontier in RL. I also suspect we would see more non tabula-rasa RL methods, ones that start off with general-purpose pre-trained models or representations and then only do a tiny amount of fine-tuning on the actual RL task.

Comment by Ankesh Anand (ankesh-anand) on EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised · 2021-11-14T21:26:39.424Z · LW · GW

Thanks, glad you liked it, I really like the recent RL directions from OpenAI too! It would be interesting to see the use of model-based RL for the "RL as fine-tuning paradigm": making large pre-trained models more aligned/goal-directed efficiently by simply searching over a reward function learned from humans. 

Comment by Ankesh Anand (ankesh-anand) on EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised · 2021-11-14T04:10:42.119Z · LW · GW

I was eyeballing Figure 2 in the PPG paper and comparing it to our results on the full distribution (Table A.3). 

PPO: ~0.25
PPG: ~0.52
MuZero: 0.68
MuZero+Reconstruction: 0.93

Comment by Ankesh Anand (ankesh-anand) on EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised · 2021-11-14T03:45:35.827Z · LW · GW

The Q-Learning baseline is a model-free control of MuZero. So it shares implementation details of MuZero (network architecture, replay ratio, training details etc.) while removing the model-based components of MuZero (details in sec A.2) . Some key differences you'd find vs a typical Q-learning implementation:  

  • Larger network architectures: 10 block ResNet compared to a few conv layers in typical implementations.
  • Higher sample reuse: When using a reanalyse ratio of 0.95, both MuZero and Q-Learning use each replay buffer sample an average of 20 times. The target network is updated every 100 training steps.
  • Batch size of 1024 and some smaller details like using categorical reward and value predictions similar to MuZero.
  • We also have a small model-based component which predicts reward at next time step which lets us decompose the Q(s,a) into reward and value predictions just like MuZero.

I would guess larger networks + higher sample reuse have the biggest effect size compared to standard Q-learning implementations. 

The ProcGen competition also might have used the easy difficulty mode compared to the hard difficulty mode used in our paper.

Comment by Ankesh Anand (ankesh-anand) on EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised · 2021-11-07T20:10:26.055Z · LW · GW

They do seem to cover SPR (an earlier version of SPR was called MPR). @flodorner If you do decide to update the plot, maybe you could update the label as well? 

Comment by Ankesh Anand (ankesh-anand) on EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised · 2021-11-07T20:06:08.111Z · LW · GW

We do actually train/evaluate on the full distribution (See Figure 5 rightmost). MuZero+SSL versions (especially reconstruction) continue to be a lot more sample-efficient even in the full-distribution, and MuZero itself seems to be quite a bit more sample efficient than PPO/PPG. 

Comment by Ankesh Anand (ankesh-anand) on Are we in an AI overhang? · 2020-07-28T07:23:20.914Z · LW · GW

Worth noting that they already use BERT in Search. https://blog.google/products/search/search-language-understanding-bert/

Comment by Ankesh Anand (ankesh-anand) on How uniform is the neocortex? · 2020-05-08T00:16:49.713Z · LW · GW

The raw neural network does use search during training though, and does not rely on search only during evaluation.