[AN #159]: Building agents that know how to experiment, by training on procedurally generated games

rohinmshah

[AN #159]: Building agents that know how to experiment, by training on procedurally generated games

post by Rohin Shah (rohinmshah) · 2021-08-04T17:10:03.823Z · LW · GW · 4 comments

      HIGHLIGHTS 
      NEAR-TERM CONCERNS 
        RECOMMENDER SYSTEMS 
      AI GOVERNANCE 
      OTHER PROGRESS IN AI 
        MULTIAGENT RL 
        DEEP LEARNING 
      NEWS 
      FEEDBACK
      PODCAST
None
4 comments

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.

HIGHLIGHTS

Generally capable agents emerge from open-ended play (Open-Ended Learning Team et al) (summarized by Zach): Artificial intelligence agents have become successful at games when trained for each game separately. However, it has proven challenging to build agents that can play previously unseen games. This paper makes progress on this challenge in three primary areas: creating rich simulated environments and tasks, training agents with attention mechanisms over internal states, and evaluating agents over a variety of games. The authors show that agents trained with goal-based attention in their proposed environment (XLand) succeed at a range of novel, unseen tasks with no additional training required. Moreover, such agents appear to use general tactics such as decision-making, tool use, and experimentation during game-play episodes.

The authors argue that training-data generation is a central challenge to training general RL agents (an argument we’ve seen before with POET (AN #41) and PAIRED (AN #136)). They propose the training environment XLand to address this. XLand includes many multiplayer games within consistent, human-relatable 3D worlds and allows for dynamic agent learning through the procedural generation of tasks which are split into three components: world, agents, and goals. The inclusion of other agents makes this a partially observable environment. Goals are defined with Boolean formulas. Each goal is a combination of options and every option is a combination of atomic predicates. For example, in hide-and-seek one player has the goal see(me,opponent) and the other player not(see(opponent,me)). The space of worlds and games are shown to be both vast and smooth, which supports training.

The agents themselves are trained using deep-RL combined with a goal-attention module (GOAT). The per-timestep observations of the agent are ego-centric RGB images, proprioception values indicating forces, and the goal of the agent. The GOAT works by processing this information with a recurrent network and then using a goal-attention module to select hidden states that are most relevant to achieving a high return. This is determined by estimating the expected return if the agent focused on an option until the end of the episode.

As with many other major deep-RL projects, it is important to have a good curriculum, where more and more challenging tasks are introduced over time. The obvious method of choosing tasks with the lowest reward doesn’t work, because the returns from different games are non-comparable. To address this, an iterative notion of improvement is proposed, and scores are given as percentiles relative to a population. This is similar in spirit to the AlphaStar League (AN #43). Following this, game-theoretic notions such as Pareto dominance can be used to compare agents and to determine how challenging a task is, which can then be used to create a curriculum.

Five generations of agents are trained, each of which is used in the next generation to create opponents and relative comparisons for defining the curriculum. Early in training, the authors find that adding an intrinsic reward based on self-play is important to achieve good performance. This encourages agents to achieve non-zero rewards in as many games as possible, which the authors call “participation”. The authors also conduct an ablation study and find that dynamic task generation, population-based methods, and the GOAT module have a significant positive impact on performance.

The agents produced during training have desirable generalization capability. They can compete in games that were not seen before in training. Moreover, fine-tuning dramatically improves the performance of agents in tasks where training from scratch completely fails. A number of case studies are also presented to explore emergent agent behavior. In one experiment, an agent is asked to match a colored shape and another environment feature such as a shape or floor panel. At the start of the episode, the agent decides to carry a black pyramid to an orange floor, but then after seeing a yellow sphere changes options and places the two shapes together. This shows that the agent has robust option evaluation capability. In other experiments, the agents show the capacity to create ramps to move to higher levels in the world environment. Additionally, agents seem capable of experimentation. In one instance, the agent is tasked with producing a specific configuration of differently colored cube objects. The agent demonstrates trial-and-error and goes through several different configurations until it finds one it evaluates highly.

There are limitations to the agent capabilities. While agents can use ramps in certain situations they fail to use ramps more generally. For example, they frequently fail to use ramps to cross gaps. Additionally, agents generally fail to create more than a single ramp. Agents also struggle to play cooperative games involving following not seen during training. This suggests that experimentation does not extend to co-player behavior. More broadly, whether or not co-player agents decide to cooperate is dependent on the population the agents interacted with during training. In general, the authors find that agents are more likely to cooperate when both agents have roughly equal performance or capability.

Zach's opinion: This is a fairly complicated paper, but the authors do a reasonable job of organizing the presentation of results. In particular, the analysis of agent behavior and their neural representations is well done. At a higher level, I found it interesting that the authors partially reject the idea of evaluating agents with just expected returns. I broadly agree with the authors that the evaluation of agents across multi-player tasks is an open problem without an immediate solution. With respect to agent capability, I found the section on experimentation to be most interesting. In particular, I look forward to seeing more research on how attention mechanisms catalyze such behavior.

Rohin's opinion: One of my models about deep learning is “Diversity is all you need”. Suppose you’re training for some task for which there’s a relevant feature F (such as the color of the goal pyramid). If F only ever takes on a single value in your training data (you only ever go to yellow pyramids), then the learned model can be specialized to that particular value of F, rather than learning a more general computation that works for arbitrary values of F. Instead, you need F to vary a lot during training (consider pyramids that are yellow, blue, green, red, orange, black, etc) if you want your model to generalize to new values of F at test time. That is, your model will be zero-shot robust to changes in a feature F if and only if your training data was diverse along the axis of feature F. (To be clear, this isn’t literally true, it is more like a first-order main effect.)

Some evidence supporting this model:

- The approach in this paper explicitly has diversity in the objective and the world, and so the resulting model works zero-shot on new objectives of a similar type and can be finetuned quickly.

- In contrast, the similar hide and seek project (AN #65) did not have diversity in the objective, had distinctly less diversity in the world, and instead got diversity from emergent strategies for multiagent interaction (but there were fewer than 10 such strategies). Correspondingly, the resulting agents could not be quickly finetuned.

- My understanding is that in image recognition, models trained on larger, more diverse datasets become significantly more robust.

Based on this model, I would make the following predictions about agents in XLand:

- They will not generalize to objectives that can’t be expressed in the predicate language used at training time, such as “move all the pyramids near each other”. (In some sense this is obvious, since the agents have never seen the word “all” and so can’t know what it means.)

- They will not work in any environment outside of XLand (unless that environment looks very very similar to XLand).

In particular, I reject the idea that these agents have learned “general strategies for problem solving” or something like that, such that we should expect them to work in other contexts as well, perhaps with a little finetuning. I think they have learned general strategies for solving a specific class of games in XLand.

You might get the impression that I don’t like this research. That’s not the case at all — it is interesting and impressive, and it suggests that we could take the same techniques and apply them in broader, more realistic domains where the resulting agents could be economically useful. Rather, I expect my readership to overupdate on this result and think that we’ve now reached agents that can do “general planning” or some such, and I want to push against that.

NEAR-TERM CONCERNS

RECOMMENDER SYSTEMS

How Much Do Recommender Systems Drive Polarization? (Jacob Steinhardt) (summarized by Rohin): There is a common worry that social media (and recommender systems in particular) are responsible for increased polarization in recent years. This post delves into the evidence for this claim. By “polarization”, we mostly mean affective polarization, which measures how positive your feelings towards the opposing party are (though we also sometimes mean issue polarization, which measures the correlation between your opinions on e.g. gun control, abortion, and taxes). The main relevant facts are:

1. Polarization in the US has increased steadily since 1980 (i.e. pre-internet), though arguably there was a slight increase from the trend around 2016.

2. Since 2000, polarization has only increased in some Western countries, even though Internet use has increased relatively uniformly across countries.

3. Polarization in the US has increased most in the 65+ age group (which has the least social media usage).

(2) could be partly explained by social media causing polarization only in two-party systems, and (3) could be explained by saying that social media changed the incentives of more traditional media (such as TV) which then increased polarization in the 65+ age group. Nevertheless, overall it seems like social media is probably not the main driver of increased polarization. Social media may have accelerated the process (for instance by changing the incentives of traditional media), but the data is too noisy to tell one way or the other.

Rohin's opinion: I’m glad to see a simple summary of the evidence we currently have on the effects of social media on polarization. I feel like for the past year or two I’ve constantly heard people speculating about massive harms and even existential risks based on a couple of anecdotes or armchair reasoning, without bothering to check what has actually happened; whenever I talk to someone who seems to have actually studied the topic in depth, it seems they think that there are problems with recommender systems, but they are different from what people usually imagine. (The post also notes a reason to expect our intuitions to be misguided: we are unusual in that we get most of our news online; apparently every age group, starting from 18-24, gets more news from television than online.)

Note that there have been a few pieces arguing for these harms; I haven't sent them out in the newsletter because I don't find them very convincing, but you can find links to some of them along with my thoughts here [AF(p) · GW(p)].

Designing Recommender Systems to Depolarize (Jonathan Stray) (summarized by Rohin): This paper agrees with the post above that “available evidence mostly disfavors the hypothesis that recommender systems are driving polarization through selective exposure, aka ‘filter bubbles’ or ‘echo chambers’”. Nonetheless, social media is a huge part of today’s society, and even if it isn’t actively driving polarization, we can ask whether there are interventions that would decrease polarization. That is the focus of this paper.

It isn’t clear that we should intervene to decrease polarization for a number of reasons:

1. Users may not want to be “depolarized”.

2. Polarization may lead to increased accountability as each side keeps close watch of politicians on the other side. Indeed, in 1950 people used to worry that we weren’t polarized enough.

3. More broadly, we often learn through conflict: you are less likely to see past your own biases if there isn’t someone else pointing out what they are. To the extent that depolarization removes conflict, it may be harmful.

Nonetheless, polarization also has some clear downsides:

1. It can cause gridlock, preventing effective governance.

2. It erodes norms against conflict escalation, leading to outrageous behavior and potentially violence.

3. At current levels, it has effects on all spheres of life, many of which are negative (e.g. harm to relationships across partisan lines).

Indeed, it is plausible that most situations of extreme conflict were caused in part by a positive feedback loop involving polarization; this suggests that reducing polarization could be a very effective method to prevent conflict escalation. Ultimately, it is the escalation and violence that we want to prevent. Thus we should be aiming for interventions that don’t eliminate conflict (as we saw before, conflict is useful), but rather transform it into a version that doesn’t lead to escalation and norm violations.

For this purpose we mostly care about affective polarization, which tells you how people feel about “the other side”. (In contrast, issue polarization is less central, since we don’t want to discourage disagreement on issues.) The paper’s central recommendation is for companies to measure affective polarization and use this as a validation metric to help decide whether a particular change to a recommender system should be deployed or not. (This matches current practice at tech companies, where there are a few high-level metrics that managers use to decide what to deploy, and importantly, the algorithms do not explicitly optimize for those metrics.) Alternatively, we could use reinforcement learning to optimize the affective polarization metric, but in this case we would need to continuously evaluate the metric in order to ensure we don’t fall prey to Goodhart effects.

The paper also discusses potential interventions that could reduce polarization. However, it cautions that they are based on theory or studies with limited ecological validity, and ideally should be checked via the metric suggested above. Nonetheless, here they are:

1. Removing polarizing content. In this case, the polarizing content doesn’t make it to the recommender system at all. This can be done quite well when there are human moderators embedded within a community, but is much harder to do at scale in an automated way.

2. Changing recommendations. The most common suggestion in this category is to increase the diversity of recommended content (i.e. go outside the “filter bubble”). This can succeed, though usually only has a modest effect, and can sometimes have the opposite effect of increasing polarization. A second option is to penalize content for incivility: studies have shown that incivility tends to increase affective polarization. Actively promoting civil content could also help. That being said, it is not clear that we want to prevent people from ever raising their voice.

3. Changing presentation. The interface to the content can be changed to promote better interactions. For example, when the Facebook “like” button was replaced by a “respect” button, people were more likely to “respect” comments they disagreed with.

Rohin's opinion: I really liked this paper as an example of an agenda about how to change recommender systems. It didn’t rely on armchair speculation or anecdotes to determine what problems exist and what changes should be made, and it didn’t just assume that polarization must be bad and instead considered both costs and benefits. The focus on “making conflict healthy” makes a lot of sense to me. I especially appreciated the emphasis on a strategy for evaluating particular changes, rather than pushing for a specific change; specific changes all too often fail once you test them at scale in the real world.

AI GOVERNANCE

Collective Action on Artificial Intelligence: A Primer and Review (Robert de Neufville et al) (summarized by Rohin): This paper reviews much of the work in AI governance (specifically, work on AI races and other collective action problems).

OTHER PROGRESS IN AI

MULTIAGENT RL

Scalable Evaluation of Multi-Agent Reinforcement Learning with Melting Pot (Joel Z. Leibo, Edgar Duéñez-Guzmán, Alexander Sasha Vezhnevets, John P. Agapiou et al) (summarized by Rohin): In supervised learning, the test dataset is different from the training dataset, and thus evaluates how well the learned model generalizes (within distribution). So far, we mostly haven't done this with reinforcement learning: the test environment is typically identical to the training environment. This is because it would be very challenging -- you would have to design a large number of environments and then split them into a train and test set; each environment would take a very long time to create (unlike in, say, image classification, where it takes a few seconds to label an image).

The core insight of this paper is that when evaluating a multiagent RL algorithm, you can get a “force multiplier” by taking a single multiagent environment (called a “substrate”) and “filling in” some of the agent slots with agents that are automatically created using RL to create a “scenario” for evaluation. For example, in the Capture the Flag substrate (AN #14), in one scenario we fill in all but one of the agents using agents trained by A3C, which means that the remaining agent (to be supplied by the algorithm being evaluated) must cooperate with previously-unseen agents on its team, to play against previously-unseen opponents. Scenarios can fall in three main categories:

1. Resident mode: The agents created by the multiagent RL algorithm under evaluation outnumber the background “filled-in” agents. This primarily tests whether the agents created by the multiagent RL algorithm can cooperate with each other, even in the presence of perturbations by a small number of background agents.

2. Visitor mode: The background agents outnumber the agents created by the algorithm under evaluation. This often tests whether the new agents can follow existing norms in the background population.

3. Universalization mode: A single agent is sampled from the algorithm and used to fill all the slots in the substrate, effectively evaluating whether the policy is universalizable.

The authors use this approach to create Melting Pot, a benchmark for evaluating multiagent RL algorithms that can produce populations of agents (i.e. most multiagent RL algorithms). Crucially, the algorithm being evaluated is not permitted to see the agents in any specific scenario in advance; this is thus a test of generalization to new opponents. (It is allowed unlimited access to the substrate.) They use ~20 different substrates and create ~5 scenarios for each substrate, giving a total of ~100 scenarios on which the multiagent RL algorithm can be evaluated. (If you exclude the universalization mode, which doesn’t involve background agents and so may not be a test of generalization, then there are ~80 scenarios.) These cover both competitive, collaborative, and mixed-motive scenarios.

Rohin's opinion: You might wonder why Melting Pot is just used for evaluation, rather than as a training suite: given the diversity of scenarios in Melting Pot, shouldn’t you get similar benefits as in the highlighted post? The answer is that there isn’t nearly enough diversity, as there are only ~100 scenarios across ~20 substrates. For comparison, ProcGen (AN #79) requires 500-10,000 levels to get decent generalization performance. Both ProcGen and XLand are more like a single substrate for which we can procedurally generate an unlimited number of scenarios. Both have more diversity than Melting Pot, in a narrower domain; this is why training on XLand or ProcGen can lead to good generalization but you wouldn’t expect the same to occur from training on Melting Pot.

Given that you can’t get the generalization from simply training on something similar to Melting Pot, the generalization capability will instead have to come from some algorithmic insight or by finding some other way of pretraining agents on a wide diversity of substrates and scenarios. For example, you might figure out a way to procedurally generate a large, diverse set of possible background agents for each of the substrates.

(If you’re wondering why this argument doesn’t apply to supervised learning, it’s because in supervised learning the training set has many thousands or even millions of examples, sampled from the same distribution as the test set, and so you have the necessary diversity for generalization to work.)

DEEP LEARNING

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets (Alethea Power et al) (summarized by Rohin): This paper presents an interesting empirical phenomenon with deep learning: grokking.

Consider tasks of the form “a ◦ b = ?”, where “◦” is some operation on modular arithmetic. For example, in the task of addition mod 97, an example problem would be “32 + 77 = ?”. There are exactly 97 possible operands, each of which gets its own token, and so there are 97^2 possible problems that are defined by pairs of tokens. We will train a neural net on some fraction of all possible problems and then ask how well it performs on the remaining problems it didn’t see: that is, we’re asking it to fill in the missing entries in the 97x97 table that defines addition mod 97.

It turns out that in these cases, the neural net memorizes the training dataset pretty quickly (in around 10^3 updates), at which point it has terrible generalization performance. However, if you continue to train it all the way out to 10^6 updates, then it will often hit a phase transition where you go from random chance to perfect generalization almost immediately. Intuitively, at the point of the phase transition, the network has “grokked” the function and can run it on new inputs as well. Some relevant details about grokking:

1. It isn’t specific to group or ring operations: you also see grokking for tasks like “a/b if b is odd, otherwise a − b”.

2. It is quite sensitive to the choice of hyperparameters, especially learning rate; the learning rate can only vary over about a single order of magnitude.

3. The time till perfect generalization is reduced by weight decay and by adding noise to the optimization process.

4. When you have 25-30% of possible examples as training data, a decrease of 1 percentage point leads to an increase of 40-50% in the median time to generalization.

5. As problems become more intuitively complicated, time till generalization increases (and sometimes generalization doesn’t happen at all). For example, models failed to grok the task x^3 + xy^2 + y (mod 97) even when provided with 95% of the possible examples as training data.

6. Grokking mostly still happens even when adding 1,000 “outliers” (points that could be incorrectly labeled), but mostly stops happening at 2,000 “outliers”.

Read more: Reddit commentary

Rohin's opinion: Another interesting fact about neural net generalization! Like double descent (AN #77), this can’t easily be explained by appealing to the diversity model. I don’t really have a good theory for either of these phenomena, but one guess for grokking is that:

1. Functions that perfectly memorize the data without generalizing (i.e. probability 1 on the true answer and 0 elsewhere) are very complicated, nonlinear, and wonky. The memorizing functions learned by deep learning don’t get all the way there and instead assign a probability of (say) 0.95 to the true answer.

2. The correctly generalizing function is much simpler and for that reason can be easily pushed by deep learning to give a probability of 0.99 to the true answer.

3. Gradient descent quickly gets to a memorizing function, and then moves mostly randomly through the space, but once it hits upon the correctly generalizing function (or something close enough to it), it very quickly becomes confident in it, getting to probability 0.99 and then never moving very much again.

A similar theory could explain deep double descent: the worse your generalization, the more complicated, nonlinear and wonky you are, and so the more you explore to find a better generalizing function. The biggest problem with this theory is that it suggests that making the neural net larger should primarily advantage the memorizing functions, but in practice I expect it will actually advantage the correctly generalizing function. You might be able to rescue the theory by incorporating aspects of the lottery ticket hypothesis (AN #52).

NEWS

Political Economy of Reinforcement Learning (PERLS) Workshop (Stuart Russell et al) (summarized by Rohin): The deadline for submissions to this NeurIPS 2021 workshop is Sep 18. From the website: "The aim of this workshop will be to establish a common language around the state of the art of RL across key societal domains. From this examination, we hope to identify specific interpretive gaps that can be elaborated or filled by members of our community. Our ultimate goal will be to map near-term societal concerns and indicate possible cross-disciplinary avenues towards addressing them."

FEEDBACK

I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.

PODCAST

An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

4 comments

Comments sorted by top scores.

comment by Sammy Martin (SDM) · 2021-08-04T17:50:28.935Z · LW(p) · GW(p)

- They will not work in any environment outside of XLand (unless that environment looks very very similar to XLand).
In particular, I reject the idea that these agents have learned “general strategies for problem solving” or something like that, such that we should expect them to work in other contexts as well, perhaps with a little finetuning. I think they have learned general strategies for solving a specific class of games in XLand.

Strongly agree with this, although with the caveat that it's deeply impressive progress compared to the state of the art in RL research in 2017, where getting an agent to learn to play ten games with a noticeable decrease in performance during generalization was impressive. This is generalization over a few million related games that share a common specification language, which is a big step up from 10 but still a fair way off infinity (i.e. general problem-solving).

It may well be worth having a think about what AI that's human level on language understanding, image recognition and some other things, but significantly below human on long-term planning would be capable of, what risks it may present. (Is there any existing writing on this sort of 'idiot savant AI', possibly under a different name?)

It seems to be the view of many researchers that long-term planning will likely be the last obstacle to fall, and that view has been borne out by progress on e.g. language understanding in GPT-3. I don't think this research changes that view much, although I suppose I should update slightly towards long-term planning being easier than I thought.

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2021-08-04T20:06:17.115Z · LW(p) · GW(p)

I wonder if grokking is evidence for, or against, the Mignard et al view that SGD on big neural nets is basically a faster approximation of rejection sampling. Here's an argument that it's evidence against:

--Either the "grokked algorithm circuit" is simpler, or not simpler, than the "memorization circuit."

--If it's simpler, then rejection sampling would reach the grokked algorithm circuit prior to reaching the memorization circuit, which is not what we see.

--If it's not simpler, then rejection sampling would briefly stumble across the grokked algorithm circuit eventually but immediately return to the memorization circuit.

OTOH maybe Mignard could reply that indeed, for small neural nets like these ones SGD is not merely an approximation of rejection sampling but rather meanders a lot, creating a situation where more complex circuits (the memorization ones) can have broader basins of attraction than simpler circuits (the grokked algorithm). But eventually SGD randomly jumps its way to the simpler circuit and then stays there. idk.

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2021-08-05T08:20:05.379Z · LW(p) · GW(p)

I feel like everyone is taking the SGD = rejection sampling view way too seriously. From the Mingard et al paper:

We argue here that the inductive bias found in DNNs trained by SGD or related optimisers,
is, to first order, determined by the parameter-function map of an untrained DNN. While on
a log scale we find PSGD(f|S) ≈ PB(f|S) there are also measurable second order deviations
that are sensitive to hyperparameter tuning and optimiser choice.

The first order effect is what lets you conclude that when you ask GPT-3 a novel question like "how many bonks are in a quoit", that it has never been trained on, you can expect that it won't just start stringing characters together in a random way, but will probably respond with English words.

The second order effects could be what tells you whether or not it is going to respond with "there are three bonks in a quoit" or "that's a nonsense question". (Or maybe not! Maybe random sampling has a specific strong posterior there, and SGD does too! But it seems hard to know one way or the other.) Most alignment-relevant properties seem like they are in this class.

Grokking occurs in a weird special case where it seems there's ~one answer that generalizes well and has much higher prior, and everything else is orders of magnitude less likely. I don't really see why you should expect that results on MNIST should generalize to this situation.

Replies from: daniel-kokotajlo

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2021-08-05T10:01:53.469Z · LW(p) · GW(p)

Thanks! I'm not sure I understand your argument, but I think that's my fault rather than yours, since tbh I don't fully understand the Mingard et al paper itself, only its conclusion.