# [AN #124]: Provably safe exploration through shielding

post by rohinmshah · 2020-11-04T18:20:06.003Z · LW · GW · None comments## Contents

HIGHLIGHTS TECHNICAL AI ALIGNMENT HUMAN INTENT OTHER PROGRESS IN AI LEARNING LEARNING FEEDBACK PODCAST None No comments

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter **resources here**. In particular, you can look through **this spreadsheet** of all summaries that have ever been in the newsletter.

Audio version **here** (may not be up yet).

# HIGHLIGHTS

**Neurosymbolic Reinforcement Learning with Formally Verified Exploration** *(Greg Anderson et al)* (summarized by Rohin): A typical approach to formally verified safe exploration in RL is to compute a *shield*, which identifies a safe set of states and actions. After this shield is computed, it is “wrapped” around the environment to ensure that if a potentially unsafe action is about to be taken, it is replaced with a safe one. Then, a policy learning algorithm is applied as normal to learn a good policy.

The key insight of this paper is to compute shields for specific *policies*, rather than creating a one-time shield that must apply to the entire state space. Since any given policy will only visit a small fraction of the state space, the shields are easier to compute and can be more permissive.

They assume access to a *worst-case dynamics model*, which given a state and action outputs a *set* of states that could be visited. Given a policy π, an *inductive safety invariant* is a set of safe states that includes all possible initial states and is closed under worst-case transitions: if you start at a state in the set, for any action that π suggests and for any state from the worst-case transition dynamics, that new state will still be in the set. Our algorithm will ensure that any policy we execute will have a corresponding inductive safety invariant.

Formal verification techniques allow us to find inductive safety invariants for restricted classes of policies. This paper uses the space of deterministic, piecewise linear policies as its set of symbolic policies. But how do we apply this to neural nets? The key idea is to start with a safe symbolic policy, convert it to a neurosymbolic policy, take a neural net gradient step, convert back to a safe symbolic policy, and repeat until done. Let’s go over each of these steps.

First, let’s suppose we have a symbolic policy g with inductive safety invariant ø. Then for any neural net f, we construct the policy h = “f(s) if no matter what we stay within ø, otherwise g(s)”. It is easy to see that ø is also an inductive safety invariant for h. Which f should we use to create h? The authors train a neural net to imitate g, and use that as their f. (Note that imitating g only requires executing g in the environment, and we know that g is safe.)

Now that we have our neurosymbolic policy h, we need to take gradient steps on it. We collect data in the environment using h, but then for the gradient we ignore the symbolic part, and take a gradient step as though the data were collected using f. (It seems they used an on-policy algorithm for this, introducing bias; I am not sure why they didn’t simply use an off-policy algorithm.) This produces a new neurosymbolic policy h’ that is still safe (since g and ø are unchanged, and that’s what guarantees safety).

Finally, we need to convert h’ back into a symbolic policy g’. This is done by a version of imitation learning that works in the symbolic policy space, where a new inductive safety invariant for g’ is found using formal verification techniques.

To start off the whole process, we need an initial symbolic policy, which must be constructed by hand. The authors show using experiments in simple continuous control environments that this method can learn high-reward policies without ever having a safety violation.

**Rohin's opinion:** I really like this as an example of combining the performance of neural networks with the robustness of symbolic approaches. I especially like the fact that the shield is specialized to the current policy and updated over time: I think ML scales so well partly because it only deals with a tiny portion of the input space and can completely ignore the vast majority of possible inputs, and so if you want to add anything on top of ML you need to ensure you preserve this property to ensure scalability. Previous approaches required a shield that is correct across all possible states, failing to preserve this property; in contrast, this approach only requires a shield that is correct for the sequence of learned policies (on whichever states they visit).

I should note that a large portion of why I like this paper is that it feels like it elegantly fits in *both* the formal verification *and* the ML fields. (I used to work in programming languages, of which formal verification is a subfield.) On the formal verification side, the guarantees are clean and simple, and the techniques used are canonical. On the ML side, I mentioned above why I like the fact that the shield is policy-specific and updated over time.

As I’ve said before, I think the real challenge in formal verification for AI alignment is how to handle fuzzy specifications. I think this paper shows a path forward: since the safety is established by an inductive invariant that can change over time, we could potentially use human feedback to establish these inductive invariants and update them over time, without requiring a human to fully specify at the outset exactly what is safe and what isn’t. You could think of it as an expanding whitelist of states which the policy is allowed to visit.

# TECHNICAL AI ALIGNMENT

## LEARNING HUMAN INTENT

**Imitation Learning in the Low-Data Regime** *(Robert Dadashi et al)* (summarized by Zach): **Non-Adversarial Imitation Learning** (**AN #119**) has become more popular recently due to the fact that GAN style architectures can be notoriously unstable during training. This paper makes a contribution by introducing an imitation learning strategy that relies on minimizing an upper bound on the Wasserstein distance between the imitator and expert state visitation distributions. The Wasserstein distance can be understood using the 'Earth Mover's Analogy'. In this interpretation, we view the distance as the cost of the most efficient transport strategy to move probability mass from the imitator distribution to the expert distribution. The advantage of such an approach is that the metric can be calculated in an offline way. If we calculate the distance for partial rollouts then we can create a dense, albeit non-stationary, reward for the imitator. In experiments, agents trained using the Wasserstein distance are able to learn control tasks using only a single trajectory.

**Read more:** **Paper: Primal Wasserstein Imitation Learning**

**Zach's opinion:** With this paper, I conclude that IRL works for Mujoco-style control tasks. The performance of this method is similar to offline GAIL but is better justified and more stable. However, ultimately, I'm a bit skeptical of their claim that the method will generalize to other tasks. Results for GAIL/DAC are quite poor in Atari-like environments whereas pair-wise reward modeling seems to perform quite well. This would suggest a reward modeling approach would scale much better in more complicated settings.

## VERIFICATION

**An Inductive Synthesis Framework for Verifiable Reinforcement Learning** *(He Zhu et al)* (summarized by Rohin): This older paper has a pretty similar idea to the one in the highlighted paper. In order to compute a safety shield for a neural network RL agent, we first transform the neural network into a simpler more symbolic policy, prove safety of the symbolic policy, and then use the generated inductive safety invariant as a shield. This paper also uses deterministic piecewise linear policies as its space of symbolic policies. It only proves safety of the final learned RL policy, and so only guarantees safety at deployment, not at training time. (In other words, it does not guarantee safe exploration, and instead assumes that you are training in simulation so that safety is not a concern.)

**Rohin's opinion:** Since this paper was published at PLDI, it is both longer and goes into a lot more of the details of how to actually perform each of these steps, as well as showing it with a running example on the inverted pendulum (where safety is defined as not going beyond a certain angle). I’m not going to summarize them here but anyone interested in these technical details should check out this paper before the highlighted one (which is constrained by ML page limits and can’t explain the techniques very well).

Just as a reminder that learning programs does not automatically confer interpretability, I present to you the symbolic policy learned by their method for the inverted pendulum:

**Verifiably Safe Exploration for End-to-End Reinforcement Learning** *(Nathan Hunt et al)* (summarized by Rohin): As we saw in the highlight, applications of formal verification to reinforcement learning and safe exploration often rely on *shielding*, in which any proposed unsafe actions are replaced by randomly chosen safe actions. Typically, this requires having an MDP model in a high-level, symbolic state space, such as by defining the MDP over the Atari simulator state, rather than learning from pixels.

This paper demonstrates that we can relax this requirement and learn policies on low-level observations, while still getting the safety guarantees of the shielding approach. The approach is simple: we define (manually) an abstract model of the environment, with a symbolic state space and dynamics model, and use this to create a shield as usual. Then, to learn the policy (which gets pixels as input), we use an object detector to transform the pixels into a symbolic state, and then use the shield if necessary to select which action to take. The authors show that as long as the error of the object detection step is low, the overall policy learning will remain safe.

**Enabling certification of verification-agnostic networks via memory-efficient semidefinite programming** *(Sumanth Dathathri, Krishnamurthy Dvijotham, Alexey Kurakin, Aditi Raghunathan, Jonathan Uesato et al)* (summarized by Rohin): In parallel with extending verification to sequential settings, as well as learning what specifications to verify, we also need to make verification significantly cheaper in order for it to be feasible to apply it to large neural networks. So far, we have only been able to achieve one of two very desirable properties at a time:

1. The method can scale up to large, independently trained networks. (This has been achieved by methods using linear (LP) relaxations like **this one** (**AN #19**).)

2. The method produces tight bounds and thus avoids producing vacuous results. (Achieved by using relaxations based on semidefinite programming (SDP) instead of linear ones.)

This paper shows how you can massage the SDP version such that the resulting algorithm becomes scalable, changing the runtime and memory requirements from O(n^6) and O(n^4) to O(n) per iteration. The resulting algorithm can be applied to larger neural nets than previous SDP approaches and gives much tighter bounds than LP approaches. For example, on an adversarially trained CNN for MNIST (which SDP algorithms haven’t previously been applied to), they can verify 87.8% adversarial accuracy, while LP methods can only verify 0.4%.

# OTHER PROGRESS IN AI

## REINFORCEMENT LEARNING

**Does On-Policy Data Collection Fix Errors in Off-Policy Reinforcement Learning?** *(Aviral Kumar et al)* (summarized by Flo): Q-learning finds the optimal **Q**-function **Q*** by updating our estimate **Q(s,a)** for a state-action pair **(s,a)** to get closer to the immediate reward plus the discounted **Q**-value for the best action **a'** in the next state **s'**. To generate samples, we usually pick actions corresponding to high **Q**-values. In bandit problems where **s'** is always terminal and thus has all **Q**-values at zero, this leads to **corrective feedback**: If we overestimated an actions value, we will pick this action again soon and are quickly able to correct our misconception. In general MDPs, corrective feedback can be a lot weaker as our update of **Q(s,a)** also depends on the **Q**-values for the next state: To get corrective feedback, we need somewhat correct **Q**-values for the next state, but to get these we likely needed good values for the second to next state, etc. This is particularly problematic with function approximation as updating the current state's **Q**-value might lead to a worse estimate for values down the chain. Consequently, we might see convergence to suboptimal **Q**-functions, instable learning, or problems with sparse or noisy rewards.

To deal with this, we would like to first prioritize correct estimates for states near the end of the chain. But in many branching problems, we actually observe these states with the least frequency such that their values are influenced disproportionally by other states' values when function approximation is used. The authors' approach, dubbed DisCor, reweighs the data distribution to account for this: We would like to preferentially sample states for which we expect **Q** to be close to **Q*** after the update and thus give more weight to state-action pairs when we expect the error **|Q*-Q|** to already be small. As we don't know **Q***, we rely on a bound for the error at a state-action pair **(s,a)** equal to the sum of the magnitudes of previous updates down the chain plus the initial error, discounted by the usual discount rate **γ** as we move back in time. Thus, the error in the next state one step ago is discounted by **γ**, the error in the second to next state two steps ago is discounted by **γ** squared and the initial error is discounted by **γ** to the **k**. This bound can be approximated by a neural network using a SARSA-like update rule, for which the influence of the unknown initial error fades for large **k** due to the discounting.

DisCor is evaluated on MetaWorld tasks in both the single and multi-task setting and SAC augmented with DisCor clearly outperforms SAC in many settings. Similar improvements can be observed for DQN on Atari.

**Read more:** **Paper: DisCor: Corrective Feedback in Reinforcement Learning via Distribution Correction**

**Flo's opinion:** Putting less weight on updating values with fluctuating targets seems like a good idea. As the approach does not require much additional compute if weights are shared for the **Q**-network and the network estimating the bound, and as it seems quite orthogonal to previous improvements to methods based on **Q**-functions, I would not be surprised if it became somewhat widely used.

## DEEP LEARNING

**Gradient Descent: The Ultimate Optimizer** *(Kartik Chandra et al)* (summarized by Rohin): Hyperparameter tuning is an important and tedious step for most applications of machine learning. Often this can cause a project to take significantly longer, as you need to have multiple training runs with different hyperparameters in order to identify which ones work best. How can we do better?

This paper shows that in some cases, you can make the computation involving your hyperparameters differentiable, such that they too can be optimized using gradient descent *during the actual training run*. They show this for SGD and Adam (where for Adam they optimize all four hyperparameters, not just the learning rate). Since these hyperparameters are then optimized using another instantiation of gradient descent, that new instantiation also has its own hyperparameters that can once again be optimized. They show how to build an arbitrarily high “stack” of hyperparameter optimizers.

In practice, building a stack of just 3 or 4 such optimizers makes it very robust to the initial choice of parameters by a human, while only increasing the cost of training by less than 2x.

**Rohin's opinion:** Fast hyperparameter tuning is a pretty important aspect of models. I particularly like **population-based training** for this purpose, because it doesn’t require your computation to be differentiable. However, when you can make your computation differentiable, this method is probably significantly more efficient (and perhaps also more performant).

**FEEDBACK**

I'm always happy to hear feedback; you can send it to me, **Rohin Shah**, by **replying to this email**.

**PODCAST**

An audio podcast version of the **Alignment Newsletter** is available. This podcast is an audio version of the newsletter, recorded by **Robert Miles**.

## None comments

Comments sorted by top scores.