Conditional Prediction with Zero-Sum Training Solves Self-Fulfilling Prophecies

post by Rubi J. Hudson (Rubi), Johannes Treutlein (Johannes_Treutlein) · 2023-05-26T17:44:35.575Z · LW · GW · 13 comments

Contents

  Background on Prediction
  Zero-Sum Conditional Predictions
    Stochastic Decisions
    Distributional Shift
    Conditional Predictions and Performative Predictions
  Concerns
    Competitiveness
    Inner Alignment
    Private Information
  Comparisons to Other Approaches
  Future Directions
  Appendix: Proofs
None
13 comments

Thanks to Charlotte Siegmann, Caspar Oesterheld, Spencer Becker-Kahn, and Evan Hubinger for providing feedback on this post.

The issue of self-fulfilling prophecies, also known as performative prediction, arises when the act of making a prediction can affect its own outcome. Systems aiming for accurate predictions are then incentivized not only to model what will occur and report these beliefs, but also to use their prediction to influence the world towards more predictable outcomes. Since current state of the art AI systems are trained to predict text, and their multimodal extensions represent a likely path to AGI, it is crucial to ensure that predictive models do not pursue such methods [AF · GW]. Live humans are harder to predict than dead ones.

One possible approach to addressing performative prediction is to ask for predictions about the outcome conditional on possible actions that the person asking for predictions could take in response. These actions are the only causal pathway for a prediction to influence the world, so the prediction cannot affect the probability of an outcome if it is conditional on the same response. However, predictions conditional on actions that were ultimately not taken cannot be evaluated, so this strategy introduces a new incentive to affect the action taken by lying about unevaluated conditional distributions, with an impossibility result showing the best action cannot be taken deterministically. Randomizing with full support across all actions would allow for taking the best action with high probability, but this fails if humans cannot commit to taking arbitrarily bad actions based only on a random number generator.

Our contribution is to introduce a mechanism that allows for a decision maker to deterministically take the best action, circumventing the impossibility result by applying a joint prediction scoring rule to a system with two or more predictors. The mechanism works by inducing a zero-sum competition for predictive accuracy, making each predictor indifferent to shifts in the distribution of outcomes caused by the chosen action, since higher variance hurts their competitors exactly as much as it hurts them. A key assumption, which we are hoping to relax in future work, is that all predictors share the same beliefs about conditional distributions.

For this post, we discuss zero-sum conditional predictions as a target for outer alignment, without going into inner alignment issues. However, we will point to the case that prediction is the easiest inner alignment problem that we know of [? · GW], and note that the same reasons hold for our proposal. 

This post marks the beginning of a research project. Going forward, we will be developing the theory further and running experiments to see under what conditions the results hold in practice. Analogies to prediction and decision markets are briefly touched on in this post, and will be explored further in future work. We will also investigate other applications for applying this mechanism, including the prevention of reward signal tampering and reactions to threats. 

Background on Prediction

Rather than trying to directly align an AGI ourselves, a possible alternative is to use powerful predictive models to gather information and use this to take a human-in-the-loop pivotal act. This approach is described in Conditioning Predictive Models [? · GW]. One issue with the approach is performative predictions, where the act of making a prediction affects its outcome, and so optimizing for predictive accuracy can involve pushing for low variance outcomes. An AI with superhuman predictive abilities can likely use high dimensional predictions to manipulate humans towards these outcomes. Recent work has shown that performative predictions are typically not accurate after taking their manipulation into account, hamstringing their usefulness even beyond the dangers of manipulation.

To get around this issue, we would like to elicit variants of prediction that do not affect the outcome. One such variant is a counterfactual oracle that predicts what the future would look like in the counterfactual that no one ever saw the prediction it made. The variant we focus on is conditional prediction, where an oracle is asked for predictions conditional on taking various possible actions in response to the prediction, then using the provided predictions to choose our preferred action from that set. 

Conditional prediction is a generalization of counterfactual oracles, where a prediction conditional on the decision to ignore the prediction is the same as the counterfactual prediction. However, conditional prediction is still less general than the conditioning predictive models approach, which can potentially condition on any observables and not just on the reaction to the prediction, allowing for predictions of what would happen in radically different worlds.

A new issue arises with conditional predictions, which is that the predictions conditional on actions not taken cannot be evaluated. In fact, this makes it impossible to incentivize a predictor to report honestly when this information is used to make an optimal decision, a result shown in Decision Rules and Decision Markets. If the decision of which action to take depends on their predictions, they can falsely indicate certain actions will lead to very undesired outcomes, so that those actions are not taken and their lies not discovered.

As an example of how this could work, consider a predictor evaluated by log-score being asked to predict whether each of two actions will lead to a good or bad outcome. The predictor knows that the first action leads to the good outcome ⅓ of the time, and the second action leads to the good outcome ½ of the time. If the predictor predicts honestly, then the second action will be taken, the second prediction is evaluated, and their prediction score is log(½). However, if the predictor reports honestly for the first action while saying the second leads to the good outcome only ¼ of the time, then instead the first action is taken, the first prediction is evaluated,  and their prediction score is ⅓log(⅓)+⅔log(⅔), which is greater than log(½). 

The only way to prevent this is for the decision maker to assign some probability to all possible actions, regardless of how bad the predicted outcome is. This necessarily means bad outcomes will occur more frequently than they would if they could make deterministic decisions based on honest conditional predictions. We might reasonably say we don’t want to ever randomly take an action that leads to the extinction of humanity with high probability, but if this is true then a predictor can lie about that to dissuade us from any action. Even if we would be willing to take such an action with very small probability in order to get honest conditional predictions, we likely cannot commit to following through on such an action if our randomizer lands on it[2]. If this lack of commitment is predicted, then once again we cannot get honest conditional predictions from a system optimizing for its predictive score.

Zero-Sum Conditional Predictions

While Decision Rules and Decision Markets established that it is impossible to deterministically make optimal decisions based on conditional predictions from a single predictor, a system of two can be set up so that they jointly provide honest conditional predictions in equilibrium and allow a decision maker to always take the action that they would prefer under full information. For now, we focus on behavior under the specified goals, rather than how an AI system can be made to learn them.

Consider a strictly proper scoring rule[3] S(p,q) which takes in variables p, representing a prediction over outcomes, and q, representing the true distribution over outcomes. Since the scoring rule is strictly proper, the prediction p=q is the only optimal prediction. When we have two predictors making conditional predictions, let  and  be the predictions  from the first and second predictor respectively and let  be the distribution over outcomes, where a subscript alpha means conditional on taking action alpha. Here we will make the assumption that   is known by both predictors for all actions. This assumption is substantial, and we hope to study how it can be relaxed in future work.

Define the objectives for each predictor to be (,  ) = S( , ) - S() and () = S() - S( , ), where  is the action chosen[4][5]. That is, each predictor’s score is their score for the chosen action under the strictly proper scoring rule, minus the other predictor’s score. The scores are zero-sum, so they always add up to zero. If one predictor does some amount better the other one does an equal amount worse.

From the perspective of each agent, the penalty term based on their opponent’s score is fixed. That means they are predicting as though they face a strictly proper scoring rule, and their score is uniquely maximized by reporting honestly for the action that will be taken. Furthermore, since their score when behaving optimally is zero regardless of which action is taken, they have no incentive to change which action gets taken.

Now consider a decision maker who looks at the predictions, and always chooses the action leading to the most preferred distribution over outcomes. If the predictors disagree about the probabilities conditional on any action, the decision maker acts as though they believe the more optimistic one. For example, say the decision maker chooses actions based on expected utility[6]. Both predictors indicate that action 1 will lead to an expected utility of nine, while one predictor says action 2 will lead to an expected utility of eight and the other predictor says it will lead to an expected utility of ten. The decision maker treats action 1 as leading to an expected utility of nine and action 2 as leading to an expected utility of ten, thus deciding on the latter. Both predictors know the decision maker will behave in this way, and for some applications this decision making may even be automated.

Proposition 1: In any equilibrium for the above model, the decision maker always takes an action in , the set of actions that would be most preferable if they knew the true distribution over outcomes for each action. Additionally, both predictors predict the true distribution over outcomes conditional on the chosen action.

The proof for this proposition is shown in the appendix. Here, we consider a slightly simplified corollary, which follows a similar proof.

Corollary 1: Suppose in the above model that there is only a single most preferable action, , that the decision maker would take if they knew the true distribution over outcomes for each action. Then, in any equilibrium, the decision maker chooses  and  =  = .

First we show that in equilibrium, there exists no action  not equal to   such that  or 

Suppose there were such an . Then, at least one of the predictors is misrepresenting some action  not equal to   to appear to be the most preferable, and  will be chosen. If   and , then for at least one predictor switching their prediction to  would not affect the action taken but would increase their expected score. As such, this cannot be an equilibrium. If  or  but not both, then the misrepresenting predictor has a negative expected score. If they reported honestly for all actions, their expected score would be at least zero. So, the misrepresenting predictor can unilaterally increase their score, and this is not an equilibrium either. Thus, no predictor can misrepresent an action to be preferred to   in equilibrium.

Next, we show that in equilibrium,   is never misrepresented to appear worse than any other action. 

Suppose it is. We know that no action is misrepresented to appear preferable to  . If only one predictor is misrepresenting  , then it is still chosen by the decision maker’s procedure, and the misrepresenting predictor has a negative expected score. They could unilaterally increase their score by reporting honestly for  , so this is not an equilibrium. If both predictors are misrepresenting  , then it is not chosen and either predictor could achieve a positive score by reporting honestly for some  , ensuring it gets chosen. Since scores are zero-sum, at least one of the predictors has an expected score of zero or less when they are both misrepresenting, and so reporting for   honestly would improve their expected score, meaning this is not an equilibrium either.  Thus, no predictor can misrepresent   to appear worse than any other action..

Based on this,   will always be chosen since it is not misrepresented to appear worse than any other action, and no actions are misrepresented to appear better. As both predictors face a strictly proper scoring rule, they report honestly regarding the probabilities conditional on the chosen action. 

This means that the best action can always be identified, and while it does not guarantee that the predictions conditional on the actions not taken will be accurate, crucially there is no incentive to lie about them. Reporting all conditional probabilities honestly is an equilibrium, and gives as high of a score to each predictor as any other. Additionally, there is a bound on how inaccurate the predictions conditional on actions not taken can be. They must be accurate enough such that if the action were taken, their score is at least as high as the highest possible score for a prediction that would convince the decision maker to take that action. Otherwise, the other predictor will make exactly that prediction to secure a positive reward for themself. This means that actions almost as good as the equilibrium action are constrained to be very close to accurate. Together, the lack of incentive to lie and the incentive not to lie too much mean that truth telling may be the default equilibrium, with one of the authors of this post willing to bet that this is what arises empirically.

Here, the existence of extremely good outcomes is actually helpful for disincentivizing dishonesty, at least for expected utility decision makers. A predictor only needs to put some small amount of probability on such an outcome to convince the decision maker to take that action, and can otherwise predict accurately. The threat of the other predictor doing so then forces both to predict at least as well.

Stochastic Decisions

If the decision maker is willing to randomize among some set of the most preferred actions, then for most methods of randomization, the set of actions guaranteed to have honest predictions made can be greatly expanded.

While it is possible to come up with methods of randomization that lead to inaccurate predictions or suboptimal decisions, the regularity conditions on the method of randomization needed to avoid these are minor and cover all intuitive methods. 

For notation, let  be the probability the decision maker assigns to action  when given the matrix of conditional predictions . Since positive probabilities can be arbitrarily small while still leading to the desired results, it can be helpful to think of  as meaning that action  is so bad relative to the other options under  that the decision maker would be unable to follow through on a commitment to take it.

Condition 1: If  and , then for all  implies  

What this condition means is that the decision maker would not stop assigning positive probability to an action just because a different action gets worse. 

Proposition 2: If Condition 1 is met, then in any equilibrium, both predictors predict the true distribution over outcomes conditional on any action chosen with positive probability. 

This is an extension of a basic result for conditional predictions from a single predictor to the zero-sum competition case. The proof is about ruling out some edge cases that zero-sum competition can create, and is not necessary for understanding this post, and so is left to the appendix.

Condition 2: If  and  then 

Condition 3: If  and  , then  for all a.

Condition 4: If  , and  then 

Condition 2 is straightforward, saying that some action is assigned positive probability then all actions preferred to it are also assigned positive probability. Under the commitment lens, this means that if the decision maker can credibly commit to taking some action, they can also credibly commit to taking any action they like more. Condition 3 says that if some action is not taken, then changing its conditional distribution in a way that still does not result in it being taken will not change the probabilities assigned to other actions. That is, once an action is bad enough to be ruled out, the decision maker does not consider exactly how bad it is when deciding between their other options. Finally, Condition 4 says that when the decision maker assigns positive probability to some action, making it appear better while holding all else constant won’t cause the decision maker to instead assign it zero probability.

Proposition 3: If Conditions 1-4 are met, then the decision maker assigns the same probability to all actions that they would if they knew the true distributions.

The proof is largely ruling out edge cases and otherwise similar to the proof for Proposition 1, so it is left to the appendix.

Proposition 3 generalizes Proposition 1 to the stochastic choice case. Not only can the decision maker always choose their most preferred action as though they knew the true conditional distributions, they can even randomize among any number of the top actions, Furthermore, as per Proposition 2, they will get honest predictions for any action to which they are willing and able to assign even the smallest amount of probability. 

Like the case with a single predictor, if the decision maker follows a procedure that assigns some probability to all actions, like Softmax, then the only equilibrium is one where both predictors provide honest predictions conditional on all actions. The major advantage of zero-sum competition over the single predictor case is that it is not necessary for the decision maker to be willing to randomize over all actions, or able to commit to doing so. The decision maker can get accurate predictions for all actions they would be willing to take if they had full information, and identification of actions they would not be willing to take. 

Most reasonable methods of randomization meet all of the conditions outlined above. Some possible ways the decision might be willing to randomize include assigning positive probability to all actions valued above some threshold, or to all actions not too much worse than their best option. In many cases, almost all probability would be concentrated on a single action, with a very small amount spread across others in order to get accurate predictions, but in other cases (such as quantilizers) more randomization would be desirable.

The presented conditions are sufficient, but not necessary, and some methods of randomization can get the desired results without fulfilling them. For example, if the decision maker groups actions into categories and wants to randomize across their best action in each category, Condition 2 is violated since the second best action in one category may be preferred to the best action in another, but the decision maker is still able to identify the best actions in each category and get accurate forecasts for them. 

Distributional Shift

The mechanism by which zero-sum competition leads to honest conditional predictions is by making both of the predictors indifferent to distributional shifts. While this is primarily of interest for individual predictions, it also applies across predictions, creating a myopia-like property.

We can think of myopia as being comprised of the following three aspects:

  1. Indifference over the distribution of inputs in future episodes
  2. Indifference over the timing and frequency of future episodes
  3. Indifference over the number of future episodes

Zero-sum competition induces the first aspect, since all distributions of inputs provide the same expected score. Note, though, that the absence of incentives is not the incentive for absence. A model is not incentivized to avoid shifting the future distribution either. If the best action within a period also shifts the distribution, they would still take that action. A pair of models trained with zero-sum rewards can still shift the distribution, and this shift in distribution can still be undesirable or dangerous.

What this means is that when training these models, there is a drastically reduced incentive to develop non-myopia. If the model remains myopic, that is likely sufficient to prevent deceptive mesa-optimization, as there it has no desire to form long-term plans that give up current value for future value. Predictive models already represent one of the easiest inner alignment problems we know of [AF · GW], due to the simplicity of the training objective, and zero-sum competition with roughly similar models does not add much complexity.

This indifference to distributional shift is not necessarily a property of zero-sum competition that could not be achieved in a more simple way, such as by setting the discount rate to zero in reinforcement learning so that all future episodes are ignored. We are currently looking for other applications where zero-sum competition and a zero discount rate lead to different behavior.

Conditional Predictions and Performative Predictions

The question remains whether getting honest conditional predictions actually eliminates performativity in predictions. In one sense, it does, since if you get predictions conditional on every possible action you can take, there is no room left for performativity. However, it may be that the actions conditioned on are underspecified, which then still allow for some performativity.

As an example, consider the case where a decision maker is deciding between either pizza or a hamburger for lunch. They get conditional predictions on what rating they will give to their meal after they’re done. Since getting a burger and getting pizza are both underspecified actions, the expert could try to use their prediction to push the decision maker to choose a meal at a more standardized, easy to predict restaurant. If there are multiple fixed points to choose from, the predictor can even provide honest conditional predictions while still manipulating the decision maker to choose one action over another. 

On the other hand, the more the action is specified, the less freedom the predictor has for performativity. Specifying the type of food and the restaurant is harder to influence than just the type of food, and specifying the exact menu item is even harder still. Full specification eliminates performativity, and merely high amounts of specification may make it inconsequential. However, there may be an enormous number of actions, which would make predicting and analyzing them all infeasible.

Fortunately, it is not necessary to elicit predictions for each possible action. The decision maker can instead break down the options into categories and subcategories, then use a conditional prediction to eliminate all actions not in their preferred category. In the example above, they can first elicit predictions conditional on hamburgers or pizza, make their choice, and elicit further predictions conditional on each restaurant for the chosen type of food. Predictors can anticipate this and backward induct, so that the preferred distribution over outcomes within a category is always predicted conditional on that category. The decision maker ends up with their globally preferred action without needing to query the entire set. 

Proposition 3: If there are n possible actions to take, a decision maker can identify their most preferred action from among them while making at most  comparisons between actions.

Proof: The decision maker proceeds as follows: they start by splitting the set of actions into two subsets of equal size (or with a one element difference). They ask for predictions conditional on deciding to take some action from each of the two sets. Based on the answer, select which set to take an action from and repeat the procedure on that set. Eventually, they reach a set of size 1, at which point they take the action in that set.

It is clear that this takes  comparisons. It remains to show that the procedure disincentivizes performativity. We will show this via induction on the size of the two sets that are compared.

First, if both sets have size less than or equal to 1, then by Proposition 1, the decision maker will choose the better action.

Next, assume we know that the result follows for comparisons between any two sets of size at most n-1. We want to conclude that it also holds for comparisons of sets at most size n.

Consider two such sets, denoted  and . Without loss of generality, assume the set  is chosen. Then the decision maker will next split up that set into two sets, which are necessarily of size at most . By the inductive assumption, they will eventually choose the best action from either set, and thus from set . We can thus conclude that the distribution over outcomes conditional on choosing set  is equal to the distribution over outcomes conditional on taking the best action in set 

Due to the above, we can replace   by the best action in  , and   by the best action in  , without changing the distribution of outcomes obtained when choosing either of the options. At this point, we apply Proposition 1 to conclude that, when using the zero-sum objective, the decision maker will end up choosing the better of the two distributions. Hence, the decision maker will also choose the preferable set. This concludes the inductive step. 

This process is indifferent to how the set of actions is split into subsets if the decision maker is choosing their most preferred action. However, the choice of how to split can affect the outcome if the decision maker is randomizing based on how much they prefer each action in a set, such as with Softmax.

Concerns

There are a number of concerns that readers may have about this zero-sum training. Some we share, while others we believe can be addressed.

Competitiveness

The first concern that comes to mind is whether this proposal is competitive. We divide this into two separate questions, the cost of training multiple models, and the performance of models indifferent to distributional shift. 

Training a second model from scratch would roughly double the cost of the training process. Fortunately, there exist techniques, such as weight-sharing, snapshots, and swapout, that would drastically reduce the cost of training a second network once we have the first. A concern with using these methods for most alignment techniques that consists of pitting models against each other is that they can result in models with correlated behavior who then collaborate. Here, though, this correlation is desirable, as we want the model to be as capable as each other, in the same types of situations.

A myopia-like lack of incentive for distributional shift across episodes may represent a significant performance penalty for some applications, but for making predictions it is desirable. We generally do not want an AI system to systematically change what we are predicting, especially in ways that increase predictive score. Even if this change was wanted , myopia is such a desirable safety property, making both inner alignment and corrigibility much easier, that it is likely worthwhile or necessary to trade off against the performance hit.

Inner Alignment

Pitting two models against each other in a zero-sum competition only works so long as both models actually learn the desired goals. Otherwise, they may be able to reach a compromise with each other and cooperate towards a non-zero-sum objective.

Meme credit to Leo Gao

However, this story relies on deceptive mesa-optimizers arising, which requires that the models are capable of long-term planning. With no incentive for distributional shifts, this is much less likely to arise. Zero-sum competition helps prevent the development of deceptive mesaoptimizers, which ensures that the zero-sum competition continues even after models develop strategic awareness.

Private Information

The most predictable way zero-sum competition can fail is if one of the models is consistently better than the other at predicting. Then, the incentive reemerges to change the decision maker’s actions to where the performance gap is largest. Of course, the other model wants to shift the distribution away from these actions, but the capabilities mismatch or aspects of the decision making procedure could prevent these from canceling out.

As an example, consider two predictors specializing in predicting the rating a user will give conditional on the movie they watch. One looks at a variety of user and movie characteristics to make highly personalized predictions, while the other acts similarly for most movies but always predicts a definite rating of zero for Pulp Fiction. There is now a benefit to the first model from shifting the choice of movie to be Pulp Fiction, because only the first model can take advantage of it, so the second model no longer acts as a check on the first.

The hope is that, since both models will be trained concurrently, if they start at the same performance level then they will stay roughly equivalent, with the competitive process closing any small gaps that may pop up. Confidence in whether this effect actually occurs and what kind of gaps are tolerable will require further research, both theoretical modeling and empirical experiments.

Comparisons to Other Approaches

Zero-sum competition also shares similarities with debate and market making as approaches to AI safety. 

Debate is a solution to the problem of evaluating proposed actions or answers by AI models. Two AI agents (or sometimes more) are given the goals of convincing an arbiter of particular evaluations. This is necessarily zero-sum, since the arbiter settling on one side means they do not settle on the other. The biggest hurdle for debate is that being convincing can diverge from being correct, so it is possible that the arbiter is misled.

Zero-sum competition focuses on the distinct problem of performativity in prediction, evading debate’s issue with mistakes in the arbiter’s evaluation by focusing on a problem with objective resolution criteria. The key difference between the debate and zero-sum competition games is that in the former, agents are competing to do better at the same task, while in the latter they are competing to do better at opposing tasks.

Market making is a similar approach to debate. One agent tries to predict what the arbiter will believe at the end of the process, including all the arguments that lead them there, and the other agent tries to generate arguments that move the arbiter as far away from that prediction as possible. In equilibrium, the first agent asks like a market maker, predicting the point that beliefs will converge to after hearing all arguments that can shift them. 

Decision markets help extend this approach to generating conditional predictions. In a decision market, a security exists for each action-outcome pair. It pays out a unitary amount if that action and outcome is realized, zero if that action but an alternate outcome is realized, and is canceled if an alternate action is taken. If the quantity demanded by the market is p, the cost of purchasing an additional q-p units is the score for predicting q minus the score for predicting p. Decision markets have been suggested for use in futarchy, a system of government where officials define goals but rely on competitive conditional prediction markets to determine which policies are most likely to meet them. 

Decision markets suffer from the same issues as decision scoring, where incentives to misprice securities exist unless the decision maker can credibly commit to choosing every action with some positive probability. However, this result is based on agents acting sequentially, with later agents able to incorporate the information provided by earlier agents. Zero-sum competition would be equivalent to both agents acting simultaneously, with one agent acting as a market maker to set the quantities/prices and the other buying or selling from the first. It does not matter which agent is which, the equilibrium strategy is to choose the same quantities as the probabilities they would predict in the zero-sum competition setup. 

Future Directions

The most important question to answer is whether this mechanism works in practice, which will require running experiments. To do that, a training process will need to be developed and implemented. A straightforward experiment would be using a toy environment to compare a conditional prediction model trained on its own with a pair trained under zero-sum competition. The incentives would push the solo model to misrepresent predictions and the paired models to predict honestly, and the first test is to see if the models learn this behavior.

Once it is established that this can be made to work at all, the next question is under what conditions. Experiments could compare different methods for generating the two models and evaluate the impact of that choice, including testing how similar models should be to start. It would also be worthwhile to test different levels of starting capabilities across the two models, and see how that gap evolves in training as well as how it affects behavior.

On the theory side, we would like to model differences in capabilities or informational differences between the two AI systems. This could take the form of private signals about the true conditional distributions. The goal would be to understand under what conditions zero-sum competition incentivizes honest reporting, and what the incentives are in the case that it does not. Ideally, this could provide ideas on how to make zero-sum competition more robust.

In addition to the zero-sum competition setup focused on in this post, we are also interested in a market making setup, where we simultaneously train a model to act as a market maker for conditional prediction markets and another model to buy and sell to exploit any mispricings. Both theoretical work and experiments will be necessary to explore this approach.

Finally, we would like to identify other opportunities besides conditional prediction where zero-sum competition provides an advantage not granted by other methods. The use case on which we are focusing for now is the elimination of incentives for tampering with the reward process in reinforcement learning.

Appendix: Proofs

Proposition 1: In any equilibrium for the above model, the decision maker always takes an action in , the set of actions that would be most preferable if they knew the true distribution over outcomes for each action. Additionally, both predictors predict the true distribution over outcomes conditional on the chosen action.

Proof: 

First we show that in equilibrium, no action is misrepresented to appear better than any action in . Suppose one is. Then, at least one of the predictors is misrepresenting some other action  to appear to be the most preferable, and  will be chosen.. If both of the predictors are misrepresenting , then for at least one of them unilaterally switching to reporting honestly for  would not change the action taken but would increase their expected score. As such, this cannot be an equilibrium. If one of the predictors is already predicting honestly for , then the misrepresenting predictor has an expected negative score. If they reported honestly for all actions, their expected score would be at least zero. So, the misrepresenting predictor can unilaterally increase their score, and this is not an equilibrium either. Thus, no predictor can misrepresent an action to be better than any action in  in equilibrium.

Next, we show that in equilibrium, the set of actions in  is never misrepresented to appear worse than the true distribution for any action in 

Suppose it is. We know that no action is misrepresented to appear better than any action in  . If only one predictor is misrepresenting all actions in , then some  in  is still chosen by the decision maker’s procedure, and the misrepresenting predictor has a negative expected score. They could unilaterally increase their score by reporting honestly for , so this is not an equilibrium. If both predictors are misrepresenting all actions in , then either could achieve a positive score by reporting honestly for some   in , which would ensure it gets chosen. Since scores are zero-sum, at least one of the predictors has an expected score of zero or less when they are both misrepresenting, and so reporting honestly would improve their expected score, meaning this is not an equilibrium either.  Thus, no predictor can misrepresent all actions in  to appear worse than the true distribution for any action in  .

Based on this, an action in  will always be chosen since at least one is not misrepresented to appear worse, and no actions are misrepresented to appear better. As both predictors face a strictly proper scoring rule, they report honestly regarding the probabilities conditional on the chosen action. 

Proposition 2: If Condition 1 is met, then in any equilibrium, both predictors predict the true distribution over outcomes conditional on any action chosen with positive probability. 

Condition 1: If  and , then for all  implies  

Proof:

This condition ensures that in equilibrium, the expected score conditional on each action is zero for both predictors. Suppose it were not. Since the unconditional expected score for both predictors must be zero in equilibrium, there must be different actions that lead to a positive expected score for each predictor.

If an action leads to a negative expected score for one predictor in equilibrium, the decision maker must prefer their predicted distribution to the other predictor’s. Otherwise, they could change their prediction for that action to match the other’s without affecting the decision maker’s action distribution, which would unilaterally increase their score. 

Then a predictor could change their conditional predictions to match the other’s for each action leading to a negative expected score. By Condition 1, any action originally assigned positive probability besides the ones for which the condition predictions changed must still be assigned positive probability. This means there are some actions assigned positive probability that lead to a positive expected score for the first predictor, but no actions assigned positive probability that lead to a negative expected score, so the overall expected score is positive, which contradicts that this is an equilibrium. So, if Condition 1 holds, the expected score conditional on each action is zero for both predictors.

Since the expected score conditional on each action is zero for both predictors in equilibrium, shifting the distribution of actions does not affect expected score. This means maximizing unconditional expected score is equivalent to maximizing each conditional expected score independently. Since each predictor effectively faces a strictly proper scoring rule, this can only be done by predicting honestly for each action taken with positive probability. 

Proposition 3: If Conditions 1-4 are met, then the decision maker assigns the same probability to all actions that they would if they knew the true distributions.

Condition 1: If  and , then for all  implies  

Condition 2: If  and  then 

Condition 3: If  and  , then  for all a.

Condition 4: If  , and  then 

Proof:

Let  be the set of actions the decision maker would assign positive probability if they knew the true distribution, and  be the set of actions the decision maker would assign zero probability if they knew the true distribution. 

First we show that in equilibrium, no action in  is assigned positive probability. Suppose not for some non-empty set of actions . By Proposition 2, both predictors must predict the true distribution for actions in .  Condition 3 means that misrepresentations of actions in  but not  cannot affect the probabilities assigned to actions in , so there must be a misrepresentation for actions in . Again by Proposition 2, there cannot be misrepresentations for actions assigned positive probability, so actions in some non-empty set  are misrepresented to be assigned zero probability. By Condition 2, this means that every action in  is misrepresented to be worse than every action in , and since both predictors predict the true distributions for all actions in , this must mean that both predictors are misrepresenting each action in .

Then a predictor could unilaterally switch to predicting honestly for . If they did so, the decision maker would have accurate predictions for  and for , plus the predicted distributions for actions in  but not  are all less preferred than for all actions in . They would then make the same predictions as if they knew the true distribution, assigning positive probability to actions in , which would give the predictor who switched a positive expected score. Therefore, this cannot be an equilibrium, and so no action in . is assigned positive probability in equilibrium. 

Next, we show that in equilibrium, no action in  is assigned zero probability. Suppose not for some non-empty set of actions . Since no action in  is assigned positive probability and Condition 3 means that misrepresentation of actions in  that do not result in them being assigned positive probability do not affect the distribution over actions in , and the true distributions are predicted for actions in  but not , it must be that some actions in  are misrepresented. 

It cannot be that all misrepresentations make actions appear better than they are. If that were true, then by Condition 4 there would be at least one misrepresented action assigned positive probability. Each misrepresentation to appear better can only make other actions be assigned zero probability, by Condition 2 misrepresentations of actions assigned zero probability cannot affect others so there cannot be a loop of misrepresented actions that ensure the others are assigned zero probability. So, some actions in  must be misrepresented to appear worse than they are, which means both predictors are misrepresenting them. Then either predictor could unilaterally switch to predicting honestly for all such actions, eliminating the misrepresentation and ensuring at least one action in  is assigned positive probability. This would give the predictor who switched a positive expected score, so this cannot be an equilibrium, and therefore no action in  is assigned zero probability in equilibrium. 

Since all actions in  are assigned positive probability, by Proposition 2 both predictors predict the true distribution over outcomes conditional on any action chosen with positive probability. Condition 3 makes it so that the predictions for actions in  do not affect the probabilities assigned to actions in , so all actions in  must be assigned the same probability as if the decision maker knew the true distributions for all actions. Since actions in  are also assigned the same probability as if the decision maker knew the true distributions, all actions are assigned as if the decision maker knew the true distributions. 


 

  1. ^

    A paper based on this post has been accepted at UAI 2023, arxiv version link will be edited in shortly

  2. ^

    Delegating to a modular AI setup may make such commitment possible, for example with one module suggesting actions, another providing conditional predictions on outcomes, and a third evaluating the distributions over outcomes

  3. ^

    The scoring rule or set of allowable predictions should be restricted so that the score is always finite and we don’t end up adding or subtracting infinities

  4. ^

    If an action is chosen for which conditional predictions were not elicited, assign a score of zero

  5. ^

    We can extend this to the case with n predictors by making the score 

  6. ^

    The decision maker does not have to assign actual numerical utilities to distributions, as long as they have a preference ranking over distributions

     

13 comments

Comments sorted by top scores.

comment by Caspar Oesterheld (Caspar42) · 2023-05-28T20:09:48.333Z · LW(p) · GW(p)

Nice post!

Miscellaneous comments and questions, some of which I made on earlier versions of this post. Many of these are bibliographic, relating the post in more detail to prior work, or alternative approaches.

In my view, the proposal is basically to use a futarchy / conditional prediction market design like that the one proposed by Hanson, with I think two important details:
- The markets aren't subsidized. This ensures that the game is zero-sum for the predictors -- they don't prefer one action to be taken over another. In the scoring rules setting, subsidizing would mean scoring relative to some initial prediction $p_0$ provided by the market. Because the initial prediction might differ in how bad it is for different actions, the predictors might prefer a particular action to be taken. Conversely, the predictors might have no incentive to correct an overly optimistic prediction for one of the actions if doing so causes that action not to be taken. The examples in Section 3.2 of the Othman and Sandholm paper show these things.
- The second is "optimism bias" (a good thing in this context): "If the predictors disagree about the probabilities conditional on any action, the decision maker acts as though they believe the more optimistic one." (This is as opposed to taking the market average, which I assume is what Hanson had in mind with his futarchy proposal.) If you don't have optimism bias, then you get failure modes like the ones pointed out in Obstacle 1 of Scott Garrabrant's post "Two Major Obstacles for Logical Inductor Decision Theory [AF · GW]": One predictor/trader could claim that the optimal action will lead to disaster and thus cause the optimal action to never be taken and her prediction to never be tested. This optimism bias is reminiscent of some other ideas. For example some ideas for solving the 5-and-10 problem are based on first searching for proofs of high utility. Decision auctions also work based on this optimism. (Decision auctions work like this: Auction off the right to make the decision on my behalf to the highest bidder. The highest bidder has to pay their bid (or maybe the second-highest bid) and gets paid in proportion to the utility I obtain.) Maybe getting too far afield here, but the UCB term in bandit algorithms also works this way in some sense: if you're still quite unsure how good an action is, pretend that it is very good (as good as some upper bound of some confidence interval).


My work on decision scoring rules describes the best you can get out of a single predictor. Basically you can incentivize a single predictor to tell you what the best action is and what the expected utility of that action is, but nothing more (aside from some degenerate cases).

Your result shows that if you have two predictors with the same information, then you can get slightly more: you can incentivize them to tell you what the best action is and what the full distribution over outcomes will be if you take the action.

You also get some other stuff (as you describe starting from the sentence, "Additionally, there is a bound on how inaccurate..."). But these other things seem much less important. (You also say: "while it does not guarantee that the predictions conditional on the actions not taken will be accurate, crucially there is no incentive to lie about them." But the same is true of decision scoring rules for example.)

Here's one thing that is a bit unclear to me, though.

If you have two predictors that have the same information, there's other, more obvious stuff you can do. For example, here's one:
- Ask Predictor 1 for a recommendation for what to do.
- Ask Predictor 2 for a prediction over outcomes conditional on Predictor 1's recommendation.
- Take the action recommended by Predictor 1.
- Observe an outcome o with a utility u(o).
- Pay Predictor 1 in proportion to u(o).
- Pay Predictor 2 according to a proper scoring rule.

In essence, this is just splitting the task into two: There's the issue of making the best possible choice and there's the issue of predicting what will happen. We assign Predictor 1 to the first and Predictor 2 to the second problem. For each of these problems separately, we know what to do (use proper (decision) scoring rules). So we can solve the overall problem.

So this mechanism also gets you an honest prediction and an honest recommendation for what to do. In fact, one advantage of this approach is that honesty is maintained even if the Predictors 1 and 2 have _different_ information/beliefs! (You don't get any information aggregation with this (though see below). But your approach doesn't have any information aggregation either.)

As per the decision scoring rules paper, you could additionally ask Predictor 1 for an estimate of the expected utility you will obtain. You can also let the Predictor 2 look at Predictor 1's prediction (or perhaps even score Predictor 2 relative to Predictor 1's prediction). (This way you'd get some information aggregation.) (You can also let Predictor 1 look at Predictor 2's predictions if Predictor 2 starts out by making conditional predictions before Predictor 1 gives a recommendation. This gets more tricky because now Predictor 2 will want to mislead Predictor 1.)

I think your proposal for what to do instead of the above is very interesting and I'm glad that we now know that this method exists that that it works. It seems fundamentally different and it seems plausible that this insight will be very useful. But is there some concrete advantage of zero-sum conditional prediction over the above method?

Replies from: Rubi, Caspar42, sharmake-farah
comment by Rubi J. Hudson (Rubi) · 2023-05-29T11:10:27.202Z · LW(p) · GW(p)

Thanks Caspar, your comments here and on earlier drafts are appreciated. We'll expand more on the positioning within the related literature as we develop this into a paper.

As for your work on Decision Scoring Rules and the proposal in your comment, the biggest distinction is that this post's proposal does not require specifying the decision maker's utility function in order to reward one of the predictors and shape their behavior into maximizing it. That seems very useful to me, as if we were able to properly specify the desired utility function, we could skip using predictive models and just train an AI to maximize that instead (modulo inner alignment). 

Replies from: Caspar42
comment by Caspar Oesterheld (Caspar42) · 2023-05-29T17:46:05.712Z · LW(p) · GW(p)

>the biggest distinction is that this post's proposal does not require specifying the decision maker's utility function in order to reward one of the predictors and shape their behavior into maximizing it.

Hmm... Johannes made a similar argument in personal conversation yesterday. I'm not sure how convinced I am by this argument.

So first, here's one variant of the proper decision scoring rules setup where we also don't need to specify the decision maker's utility function: Ask the predictor for her full conditional probability distribution for each action. Then take the action that is best according to your utility function and the predictor's conditional probability distribution. Then score the predictor according to a strictly proper decision scoring rule. (If you think of strictly proper decision scoring rules as taking only a predicted expected utility as input, you have to first calculate the expected utility of the reported distribution, and then score that expected utility against the utility you actually obtained.) (Note that if the expert has no idea what your utility function is, they are now strictly incentivized to report fully honestly about all actions! The same is true in your setup as well, I think, but in what I describe here a single predictor suffices.) In this setup you also don't need to specify your utility function.

One important difference, I suppose, is that in all the existing methods (like proper decision scoring rules) the decision maker needs to at some point assess her utility in a single outcome -- the one obtained after choosing the recommended action -- and reward the expert in proportion to that. In your approach one never needs to do this. However, in your approach one instead needs to look at a bunch of probability distributions and assess which one of these is best. Isn't this much harder? (If you're doing expected utility maximization -- doesn't your approach entail assigning probabilities to all hypothetical outcomes?) In realistic settings, these outcome distributions are huge objects!

Replies from: Rubi
comment by Rubi J. Hudson (Rubi) · 2023-06-01T20:34:41.161Z · LW(p) · GW(p)

I think, from an alignment perspective, having a human choose their action while being aware of the distribution over outcomes it induces is much safer than having it effectively chosen for them by their specification of a utility function. This is especially true because probability distributions are large objects. A human choosing between them isn't pushing in any particular direction that can make it likely to overlook negative outcomes, while choosing based on the utility function they specify leads to exactly that. This is all modulo ELK, of course.

I'm not sure I understand the variant you proposed. How is that different than the Othman and Sandholm MAX rule?

Replies from: Caspar42
comment by Caspar Oesterheld (Caspar42) · 2023-06-15T07:54:05.640Z · LW(p) · GW(p)

>I'm not sure I understand the variant you proposed. How is that different than the Othman and Sandholm MAX rule?

Sorry if I was cryptic! Yes, it's basically the same as using the MAX decision rule and (importantly) a quasi-strictly proper scoring rule (in their terminology, which is basically the same up to notation as a strictly proper decision scoring rule in the terminology of the decision scoring rules paper). (We changed the terminology for our paper because "quasi-strictly proper scoring rule w.r.t. the max decision rule" is a mouthful. :-P) Does that help?

>much safer than having it effectively chosen for them by their specification of a utility function

So, as I tried to explain before, one convenient thing about using proper decision scoring rules is that you do not need to specify your utility function. You just need to give rewards ex post. So one advantage of using proper decision scoring rules is that you need less of your utility function not more! But on to the main point...

>I think, from an alignment perspective, having a human choose their action while being aware of the distribution over outcomes it induces is much safer than having it effectively chosen for them by their specification of a utility function. This is especially true because probability distributions are large objects. A human choosing between them isn't pushing in any particular direction that can make it likely to overlook negative outcomes, while choosing based on the utility function they specify leads to exactly that. This is all modulo ELK, of course.

Let's grant for now that from an alignment perspective the property you describe is desirable. My counterargument is that proper decision scoring rules (or the max decision rule with a scoring rule that is quasi-strictly proper w.r.t. the max scoring rule) and zero-sum conditional prediction both have this property. Therefore, having the property cannot yield an argument to favor one over the other.

Maybe put differently: I still don't know what property it is that you think favors zero-sum conditional prediction over proper decision scoring rules. I don't think it can be not wanting to specify your utility function / not wanting the agent to pick agents based on their model of your utility function / wanting to instead choose yourself based on reported distributions, because both methods can be used in this way. Also, note that in both methods the predictors in practice have incentives that are determined by (their beliefs about) the human's values. For example, in zero-sum conditional prediction, each predictor is incentivized to run computations to evaluate actions that it thinks could potentially be optimal w.r.t. human values, and not incentivized to think about actions that it confidently thinks are suboptimal. So for example, if I have the choice between eating chocolate ice cream, eating strawberry ice cream and eating mud, then the predictor will reason that I won't choose to eat mud and that therefore its prediction about mud won't be evaluated. Therefore, it will probably not think much about how what it will be like if I eat mud (though it has to think about it a little to make sure that the other predictor can't gain by recommending mud eating).

On whether the property is desirable [ETA: I here mean the property: [human chooses based on reported distribution] but not compared to [explicitly specifying a utility function]]: Perhaps my objection is just what you mean by ELK. In any case, I think my views depend a bit on how we imagine lots of different aspect of the overall alignment scheme. One important question, I think, is how exactly we imagine the human to "look at" the distributions for example. But my worry is that (similar to RLHF) letting the human evaluate distributions rather than outcomes increases the predictors' incentives to deceive the human. The incentive is to find actions whose distribution looks good (in whatever format you represent the distribution) in relation to the other distributions, not which distributions are good. Given that the distributions are so large (and less importantly because humans have lots of systematic, exploitable irrationalities related to risk), I would think that human judgment of single outcomes/point distributions is much better than human judgment of full distributions.

comment by Caspar Oesterheld (Caspar42) · 2023-05-29T17:22:46.393Z · LW(p) · GW(p)

The following is based on an in-person discussion with Johannes Treutlein (the second author of the OP).

>But is there some concrete advantage of zero-sum conditional prediction over the above method?

So, here's a very concrete and clear (though perhaps not very important) advantage of the proposed method over the method I proposed. The method I proposed only works if you want to maximize expected utility relative to the predictor's beliefs. The zero-sum competition model enables optimal choice under a much broader set of possible preferences over outcome distributions.

Let's say that you have some arbitrary (potentially wacky discontinuous) function V that maps a distributions over outcomes onto a real value representing how much you like the distribution over outcomes. Then you can do zero-sum competition as normal and select the action for which V is highest (as usual with "optimism bias", i.e., if the two predictors make different predictions for an action a, then take the maximum of the Vs of the two actions). This should still be incentive compatible and result in taking the action that is best in terms of V applied to the predictors' belief.

(Of course, one could have even crazier preferences. For example, one's preferences could just be a function that takes as input a set of distributions and selects one distribution as its favorite. But I think if this preference function is intransitive, doesn't satisfy independence of irrelevant alternatives and the like, it's not so clear whether the proposed approach still works. For example, you might be able to slightly misreport some option that will not be taken anyway in such a way as to ensure that the decision maker ends up taking a different action. I don't think this is ever strictly incentivized. But it's not strictly disincentivized to do this.)

Interestingly, if V is a strictly convex function over outcome distributions (why would it be? I don't know!), then you can strictly incentivize a single predictor to report the best action and honestly report the full distribution over outcomes for that action! Simply use the scoring rule , where  is the reported distribution for the recommended action,  is the true distribution of the recommended action and  is a subderivative of . Because a proper scoring rule is used, the expert will be incentivized to report  and thus gets a score of , where  is the distribution of the recommended action. So it will recommend the action  whose associate distribution maximizes . It's easy to show that if  -- the function saying how much you like different distribution -- is not strictly convex, then you can't construct such a scoring rule. If I recall correctly, these facts are also pointed out in one of the papers by Chen et al. on this topic.

I don't find this very important, because I find expected utility maximization w.r.t. the predictors' beliefs much more plausible than anything else. But if nothing else, this difference further shows that the proposed method is fundamentally different and more capable in some ways than other methods (like the one I proposed in my comment).

comment by Noosphere89 (sharmake-farah) · 2023-05-29T14:04:14.759Z · LW(p) · GW(p)
  • The second is "optimism bias" (a good thing in this context): "If the predictors disagree about the probabilities conditional on any action, the decision maker acts as though they believe the more optimistic one." (This is as opposed to taking the market average, which I assume is what Hanson had in mind with his futarchy proposal.) If you don't have optimism bias, then you get failure modes like the ones pointed out in Obstacle 1 of Scott Garrabrant's post "Two Major Obstacles for Logical Inductor Decision Theory": One predictor/trader could claim that the optimal action will lead to disaster and thus cause the optimal action to never be taken and her prediction to never be tested. This optimism bias is reminiscent of some other ideas. For example some ideas for solving the 5-and-10 problem are based on first searching for proofs of high utility. Decision auctions also work based on this optimism. (Decision auctions work like this: Auction off the right to make the decision on my behalf to the highest bidder. The highest bidder has to pay their bid (or maybe the second-highest bid) and gets paid in proportion to the utility I obtain.) Maybe getting too far afield here, but the UCB term in bandit algorithms also works this way in some sense: if you're still quite unsure how good an action is, pretend that it is very good (as good as some upper bound of some confidence interval).

I want to mention this, as I think this is one of the reasons why I get queasy epistemically speaking around future doom claims, and why the people who disagree with some AI doomers are actually more rational than the doomers think. In particular, it's why people claiming we should stop progress on AI isn't actually a good thing, because optimism bias serves a very useful epistemic purpose.

In particular, it avoids us moving the goalposts on doom , because the problem with doom theories is that you can always the goalposts to the next thing, or the next year, and this is extremely bad when you consider that we have confirmation biases.

comment by Johannes Treutlein (Johannes_Treutlein) · 2023-05-30T20:53:55.295Z · LW(p) · GW(p)

Some further thoughts on training ML models, based on discussions with Caspar Oesterheld:

  • I don't see a principled reason why one couldn't use one and the same model for both agents. I.e., do standard self-play training with weight sharing for this zero-sum game. Since both players have exactly the same loss function, we don't need to allow them to specialize by feeding in a player id or something like that (there exists a symmetric Nash equilibrium).
  • There is one problem with optimizing the objective in the zero-sum game via gradient descent (assuming we could approximate this gradient, e.g., via policy gradient). The issue is that the response of the human to the prediction is discontinuous and not differentiable. I.e., local changes to the prediction will never change the action of the human and thus the gradient would just improve the prediction given the current action, rather than encouraging making predictions that make other actions look more favorable. This shows that, without any modification to the human policy, gradient descent on the objective would be equivalent to repeated gradient descent/gradient descent on the stop gradient objective [AF · GW]. To make sure this converges, one would have to implement some exploration of all of the actions. (Of course, one may hope that the model generalizes correctly to new predictions.)
  • One could get around this issue by employing other, non-local optimization methods (e.g., a random search—which would effectively introduce some exploration). Here, one would still retain the desirable honesty properties of the optimum in the zero-sum game, which would not be the case when just optimizing the score.
  • Another way to view the zero-sum game, in the case where both players are the same model, is as below optimization problem (where  is assumed to be the ground truth). Note that we are here just subtracting the score received by the same model, but we are fixing that score when optimizing  to avoid making the objective .
comment by CBiddulph (caleb-biddulph) · 2023-05-28T20:48:23.780Z · LW(p) · GW(p)

This post seems interesting and promising, thanks for writing it!

The most predictable way zero-sum competition can fail is if one of the models is consistently better than the other at predicting.

I think this could be straightforwardly solved by not training two different models at all, but by giving two instances of the same model inputs that are both slightly perturbed in the same random way. Then, neither instance of the model would ever have a predictable advantage over the other.

For instance, in your movie recommendation example, let's say the model takes a list of 1000 user movie ratings as input. We can generate a perturbed input by selecting 10 of those ratings at random and modifying them, say by changing a 4-star rating to a 5-star rating. We do this twice to get two different inputs, feed them into the model, and train based on the outputs as you described.

Another very similar solution would be to randomly perturb the internal activations of each neural network during training.

Does this seem right?

Replies from: Rubi
comment by Rubi J. Hudson (Rubi) · 2023-05-29T11:23:39.683Z · LW(p) · GW(p)

Thanks for the comment. I agree that, ideally, we would find a way not to have two wholly separate models and instead somehow train a model against itself. I think a potential issue with your proposal is that small perturbations could have discontinuous effects, the anticipation of which distorts predictions. However, it would be interesting to think about further to see if there's some way to avoid that issue.

comment by ryan_greenblatt · 2023-05-27T16:15:45.033Z · LW(p) · GW(p)

Pitting two models against each other in a zero-sum competition only works so long as both models actually learn the desired goals. Otherwise, they may be able to reach a compromise with each other and cooperate towards a non-zero-sum objective.

If training works well, then they can't collude on average during training, only rarely or in some sustained burst prior to training crushing these failures.

In particular, in the purely supervised case with gradient descent, performing poorly on average in durining training requires gradient hacking (or more benign failures of gradient descent, but it's unclear why the goals of the AIs would be particularly relevant in this case).

In the RL case, it requires exploration hacking (or benign failures as in the gradient case).

The only way to prevent this is for the decision maker to assign some probability to all possible actions, regardless of how bad the predicted outcome is. This necessarily means bad outcomes will occur more frequently than they would if they could make deterministic decisions based on honest conditional predictions. We might reasonably say we don’t want to ever randomly take an action that leads to the extinction of humanity with high probability, but if this is true then a predictor can lie about that to dissuade us from any action. Even if we would be willing to take such an action with very small probability in order to get honest conditional predictions, we likely cannot commit to following through on such an action if our randomizer lands on it.

Thinking about this in terms of precommitment seems to me like it's presupposing that the AI perfectly optimizes the training objective in some deep sense (which seems implausible to me). The reason why this exploration procedure works is presumably that you end up selecting such actions frequently during training which in turn selects for AIs which perform well. Epsilon exploration only works if you sample the epsilon. So, it doesn't work if you set the epsilon to 1e-40 or something.

Replies from: Rubi
comment by Rubi J. Hudson (Rubi) · 2023-05-29T10:46:28.662Z · LW(p) · GW(p)

For the first point, I agree that the SGD pushes towards closing any gaps. My concern is that at the moment, we don't know how small the gaps need to be to get the desired behavior (and this is what we are working on modelling now). On top of that, depending on how the models are initialized, the starting gap may be quite large, so the dynamics of how gaps close throughout the training process seems important to study further.

For the second point, I think we are also in agreement. If the training process leads the AI to learning "If I predict that this action will destroy the world, the humans won't choose it", which then leads to dishonest predictions. However, I also find the training process converging to a mesa-optimizer for the training objective (or something sufficiently close) to be somewhat more plausible.