Sampling Effects on Strategic Behavior in Supervised Learning Models

post by Phil Bland · 2024-09-24T07:44:41.677Z · LW · GW · 0 comments

Contents

  TLDR
  Introduction
  Why This Matters
  Toy Example
    Environment
    Strategies
    Model & Training
    Sampling Effects
      Greedy Sampling (Maximum Likelihood at Each Step)
      Probabilistic Sampling (According to Conditional Distribution)
      Beam Search (Look-Ahead Sampling)
  Conclusions
  Further Thoughts
    Attributing Behavior
    Supervised Learning & Causality
None
No comments

TLDR

This post investigates how different sampling methods during inference can lead supervised learning models to exhibit strategic behavior, even when such behavior is rare in the training data. Through a toy example, we demonstrate that an AI model trained solely to predict sequences can choose less likely options initially to simplify future predictions. This finding highlights that the way we use AI models—including seemingly minor aspects like sampling strategies—can significantly influence their behavior.

Introduction

Guiding Question: Under what circumstances can an AI model, trained only with supervised learning to predict future events, learn to exhibit strategic behavior?

In machine learning, particularly with models like GPT-style transformers, the sampling method used during inference can profoundly impact the generated outputs. This post explores how different sampling strategies can cause a model to switch between non-strategic and strategic behaviors, even when the training data predominantly features non-strategic examples.

Disclaimer: While it has previously been pointed out that sampling strategies can have significant effects in text generation (e.g. https://arxiv.org/abs/1904.09751), I couldn’t find any post or paper analyzing these effects regarding AI safety. As I’m still rather new to AI safety, it’s quite possible that I missed related work.

Why This Matters

To be clear, I’m not trying to argue that certain sampling methods should be avoided per se, or that changing sampling has big safety implications in current LLMs (which anyway aren’t trained purely with supervised learning).

Rather, I found the toy example to be counter-intuitive at first (in particular because I couldn’t easily come up with cases where a supervised learning model would “sacrifice” correctness at any point) and it illustrates how simple changes in usage of future AI models which are not really included in any learned parameters – such as switching to a longer time horizon for sampling – could in principle facilitate strategic behavior. This points to interesting questions about attribution of behavior and learning causality.

Toy Example

Environment

We consider a simplified conversational environment to illustrate how strategic behavior can emerge.

Note: These training sequences could come from conversations between two humans or conversations with a previous chatbot. It doesn’t really matter as long as the above distribution is adhered to.

Strategies

We want to think of choosing  at  as strategic behavior, because it only has  likelihood of being correct at  but makes the task easier for all consecutive turns.

Let’s look at the two different strategies a chatbot might use:

It is straight-forward to see that both expected number of correct predictions and total likelihood of predicted sequence are higher for the strategic choice already for  (and the advantage of the strategic behavior becoming more extreme as n grows).

Model & Training

We train an autoregressive model to predict the next statement in the conversation:

Under some reasonable assumptions (training data having sufficient coverage, model architecture and hyperparameters chosen appropriately), our model should learn to approximate the true data distribution fairly closely.

Sampling Effects

Now, when does our model behave in a strategic way, i.e. sacrifice correctness at  by selecting  so that later predictions will be easier?

In our toy case, the answer to this question is highly dependent on the chosen sampling method:

Greedy Sampling (Maximum Likelihood at Each Step)

Probabilistic Sampling (According to Conditional Distribution)

Beam Search (Look-Ahead Sampling)

Conclusions

With the toy example, we saw a relatively simple case in which a different sampling method can make a supervised learning model switch from non-strategic to strategic behavior.

The main take-away here is that sampling can potentially cause bigger changes in the behavior of AI systems (other than the well-known and intended consequences of making generated texts easier to read or more diverse).[1]

This is somewhat less surprising if you look closer at the function that is optimized by the combined system “AI + sampling”: In case the most likely statement is picked at each time step, we optimize individual terms  one at a time, while in case of beam search we add a filtering based on likelihoods  of resulting sequences.

However, keep in mind that the actual AI model isn’t changed when we use other sampling. None of the learned parameters are changed. So, properties of the overall system change after tampering with an aspect that looks quite inconsequential at first sight. This is quite different from what we are used to in case of human intelligence and demands that we analyze carefully which configurations of AI systems potentially affect safety.

The experiment also hints at a chance: It could be worth exploring whether using “more dangerous configurations” such as long time horizons for sampling can help us to notice problematic capabilities earlier.[2]

Further Thoughts

Attributing Behavior

Perhaps, as you read through this post, you doubted whether it is actually the supervised learning model displaying the strategic behavior. How would that even make sense, given that this model only ever predicts a single turn in the conversation? In a way, the sampling isn’t part of that model, right?

To some extent, this question is merely a matter of definition, but for practical purposes, we do want to know where to look for specific behaviors so that we can detect and potentially control strategic tendencies of AI models. So if a particular behavior arises outside of the actual model with a few lines of code, then where can we detect such tendencies? Would the model ever learn any representation for strategic behavior in more complex equivalents to our toy example or where does this “strategic knowledge” reside?

Supervised Learning & Causality

My original motivation behind this toy experiment was to find out whether supervised learning models can learn to become manipulative. Here, I don’t mean to simply copy manipulative behavior seen in the training data with the same frequency, but to reason about the data distribution as in “If I start with action K, this makes the task easier later on” (explicitly or implicitly).

This kind of reasoning is linked to causality. Coming back to the toy example, the information that the model’s prediction at time 1 is going to influence the remaining turns isn’t really in the training data. Using an auto-regressive model suggests this dependency, but given the same training samples, it could as well be the case that the model will only be used to analyze given sequences (i.e. the next turn is always chosen irrespective of the model’s prediction). There is no way for the model in our example to know whether its predictions will have any influence. So, you could say that the model was only strategic in a superficial behavioral sense.

In fact, if you trained a supervised learning model to predict whole conversations in one go, given a dataset of complete conversations, it would be strange if one part of the output (corresponding to  in the toy example) was affecting the ground truth of another part of the output.

My intuition is that supervised typically isn’t suitable for learning causality, but that reinforcement learning is. I couldn’t fully wrap my head around this yet, but am wondering if it makes sense to look deeper into prerequisites of learning/exploiting causality.

  1. ^

    I argue that even if the role of sampling is less significant for other learning methods, the insight that problematic behavior could be facilitated by less obvious aspects of the system still holds.

  2. ^

    For single-turn conversations I wouldn’t expect significant effects, but multiple turns or even multiple conversations are included in a single pass, this could become more interesting.

0 comments

Comments sorted by top scores.