Posts
Comments
Comment by
Kibidango on
Iterated Distillation and Amplification ·
2018-11-30T17:48:20.958Z ·
LW ·
GW
AGZ’s policy network p is the learned model.
I found this bit slightly confusing. As far as I understand from the AGZ Nature paper, AGZ does not have a separate policy network p, but uses a single network which outputs both the learned policy p and the estimated probability v that the current player will win the game. Is this what the sentence is referring to?