Notes on "How do we become confident in the safety of a machine learning system?"

post by RohanS · 2023-10-26T03:13:56.024Z · LW · GW · 0 comments

Contents

      Summary:
  How do we become confident in the safety of a machine learning system?
    What’s a training story?
    Training story components
    How mechanistic does a training goal need to be?
    Relationship to inner alignment
    Do training stories capture all possible ways of addressing AI safety?
    Evaluating proposals for building safe advanced AI
    Case study: Microscope AI
    Exploring the landscape of possible training stories
    Training story sensitivity analysis
None
No comments

I'm trying out taking notes on an AI safety reading in a LessWrong draft post and sharing the result as a low-effort "distillation." I'm hoping this is a quick way to force myself to take better notes, and to provide distillations that are useful to others. This reading is "How do we become confident in the safety of a machine learning system? [LW · GW]" by Evan Hubinger.

I wasn't sure at the outset how far into tangential thoughts of my own I would go. I think by writing notes with the thought that other people might read them in mind, I ended up a) following the structure of the original post more rigidly than usual and b) going on fewer tangents than usual. I think that probably isn't ideal for me; if I do this again, I might try to actively not follow those impulses.

Anyway, here are the notes. Each section title here links to the corresponding section in the original post.

Summary:

This post suggests that we use training stories to flesh out and evaluate proposals for training safe ML systems. A proposal is a training setup (loosely understood), and a training story consists of a training goal and a training rationale. A training goal describes what behavior you hope the training setup will give rise to, in some mechanistic detail; it includes both how you think the model will perform the relevant task capably and why you think performing it in that way will be safe. A training rationale describes why you think the training setup will result in a model that implements the mechanistic behavior described in the training goal. While training stories do not capture all possible ways to make AI systems safer, they are quite a general framework that can help us better systematically organize and evaluate our proposals. This framework can also help us generate proposals if there are clear gaps in the space of possible training stories. The post includes quite a few examples of proposals, training goals, training rationales, and evaluations of training stories.

How do we become confident in the safety of a machine learning system? [LW · GW]

What’s a training story? [LW · GW]

Training story components [LW · GW]

2 essential components:

  1. Training goal: A mechanistic description of the desired model and why that would be good (e.g. classify cats using human vision heuristics)
    1. This example seems to be missing the "why that would be good" part, but based on the earlier part it could be "human vision heuristics don't include any agent/optimizer which would take extreme or dangerous actions"
  2. Training rationale: Why you believe your training setup will give rise to a model satisfying the mechanistic description in the training goal
    1. "'Training setup,' here, refers to anything done before the model is released, deployed, or otherwise given the ability to meaningfully impact the world."
      1. Some things I think this could include that might not immediately come to mind under the words "training setup" include interventions like ensembling and activation engineering
      2. Also, I think a model could meaningfully impact the world before humans intend for it to be done training. Example: a model becomes misaligned and gains a decisive strategic advantage during training, and interrupts the training process to pursue its goals. Also, all of continual learning.
        1. So a training story should also talk about why nothing bad will happen along the way?

2 subcomponents for each of the two things above, making 4 essential parts to any training story:

  1. Training goal specification: Mechanistic description of the desired model
  2. Training goal desirability: Why any model that meets the training goal will be safe and perform the desired task well
    1. (My addition) There's a universal quantifier here. "For all models  that meet the training goal,  is safe" is the statement that needs justifying. That's a high bar.
  3. Training rationale constraints: What constraints the model must satisfy and why the training goal is compatible with those constraints (e.g. fitting the training data perfectly (if trained to zero loss), and being implementable with the selected architecture)
  4. Training rationale nudges: Why the training setup is likely to produce a model that meets the training goal, despite other things also satisfying the constraints
    1. Could be "meeting the goal is the simplest way to meet the constraints" (inductive bias arguments)

How mechanistic does a training goal need to be? [LW · GW]

Relationship to inner alignment [LW · GW]

Do training stories capture all possible ways of addressing AI safety? [LW · GW]

In The Plan - 2022 Update [LW · GW], John Wentworth says:

I expect we’ll also see more discussion of end-to-end alignment strategies based on directly reading and writing the internal language of neural nets.... Since such strategies very directly handle/sidestep the issues of inner alignment, and mostly do not rely on a reward signal as the main mechanism to incentivize intended behavior/internal structure, I expect we’ll see a shift of focus away from convoluted training schemes in alignment proposals.

It's not clear to me whether interventions like "reading and writing the internal language of neural nets" are supposed to fit within training stories. John does seem to be suggesting that the main part of these alignment proposals is not how the models are trained.

Evaluating proposals for building safe advanced AI [LW · GW]

Case study: Microscope AI [LW · GW]

Exploring the landscape of possible training stories [LW · GW]

First, some training goals. (This part reminds me of John Wentworth's discussion of possible alignment targets here [LW · GW], which I liked quite a bit. That post came out later, but I read it first.)

  1. Loss-minimizing models: A possible training goal, for people who are very confident they've specified a truly desirable objective, is to get a model that internally optimizes for that objective (e.g. loss function). This can be bad, due to classic outer alignment concerns.
  2. Fully aligned agents: An agent that cares about everything humans care about and acts to achieve those goals. E.g. ambitious value learning [LW · GW]. But this is a very difficult target, that most people think isn't worth focusing on in our first shot at aligning advanced AI. Instead, we can do something else more achievable to get useful and safe human+ level AI, and then use those to help us get to fully aligned superintelligence later. (Or sacrifice the fully aligned superintelligence goal because we think we'll never get it.)
    1. I personally think "human values" aren't well-defined, and in my mind fully aligned agents would do something along the lines of "solve morality and act accordingly," which probably works out to "maximize the well-being of conscious creatures." What, did I say something controversial?
  3. Corrigible agents: We may want AI systems that let themselves be turned off or modified, or help us figure out when it would be good for us to turn them off or modify them. Consider approval-directed agents (though there are ways this could fail to be corrigible).
    1. Robust instruction-following agents may also fit here?
  4. Myopic agents: AI agents that preform limited optimization only, such as only optimizing the immediate effect of their next action, rather than having long-term, large-scale goals.
  5. Simulators: AI system that does nothing other than simulate something else. Things you might want to simulate include HCH (roughly, an infinite tree of humans delegating subtasks and consulting each other), physics (for an AlphaFold-like system), and a human internet user (for a language model).
  6. Narrow agents: Agents that are highly capable in a specific domain without thinking much or at all about other domains. E.g. STEM AI [LW · GW].
    1. I think some current systems, like superhuman chess AIs, also fit this category.
  7. Truthful question-answerers: A non-agent truthful question-answering system that accurately reports what its model of the world says/predicts in human-understandable terms.

Next, some training rationales. These aren't explicitly paired with particular training goals, but there are implicit relationships to classes of training goals.

  1. Capabilities limitations: Can argue that a particular training setup will give rise to a model that does not have certain dangerous capabilities (e.g. internal optimization, understanding of how to deceive humans), and therefore won't exhibit the dangerous behaviors that require the capability.
  2. Inductive bias analysis: Can argue that certain behaviors are more likely to be learned than others given a training setup, for reasons other than "lower loss." Often this appeals to some form of simplicity. This is a promising approach, but hard to make strong detailed arguments in advance about what will be learned. But there is a growing literature outlining empirical phenomena that can inform our inductive bias assessments (like deep double descent [LW · GW], lottery tickets, scaling laws, grokking, or distributional generalization).
    1. "This is especially problematic given how inductive bias analysis essentially requires getting everything right before training begins, as a purely inductive-bias-analysis-based training rationale doesn’t provide any mechanism for verifying that the right training goal is actually being learned during training."
  3. Transparency and interpretability: If we use transparency tools during training to throw out or disincentivize dangerous models, then our training rationale could include something like "We think that our transparency checks will rule out all simple models that don’t fit the training goal, with all remaining models that don’t fit the goal being too complex according to the inductive biases of the training process to possibly be learned." One great thing about transparency tools is that they could make it possible for us to verify that the right training goal is actually being learned during training, rather than just hoping it will be and leaving training a black box.
    1. Evan's A transparency and interpretability tech tree [LW · GW] is a good post on various types of interpretability along with discussion of their challenges and limitations. It helped me see what ambitious interpretability could look like.
  4. Automated oversight: Humans may struggle to label data or use transparency tools; having humans-in-the-loop slows things down, and we can only understand human-understandable concepts. We could perhaps automate these things with competent AIs, including ones that understand the alien abstractions used by the model we're training (maybe a copy of that same model). How can we trust the supervisor AI? That's a challenge; maybe using some recursive oversight methods. For a more thorough exploration of what such an approach might look like, see “Relaxed adversarial training for inner alignment [LW · GW].”
  5. AI cognitive science: Instead of mechanistic analysis, we could do behavioral analysis of models; test how they act in ways that we think reveal important underlying details. Possibly a less direct and trustworthy source of info than mechanistic interpretability. Might be hard to extrapolate findings to more capable models. 
  6. Precursor checking: Check for some easier-to-check property that is a necessary precursor for bad behavior. E.g. it might be easier to check that a model's planning-horizon length is below a certain threshold, and that might imply that the model isn't being deceptive, while checking for deception more generally may be very difficult. Need to make sure the property is actually a necessary condition. 
  7. Loss landscape analysis: Extension of inductive bias analysis, can help us predict paths through parameter space (rather than just the final local loss minimum). Could be used in conjunction with various other training rationales on this list.
  8. Game-theoretic/evolutionary analysis: What multi-agent dynamics are likely to be learned? One difficult aspect of this is avoiding the assumption that agents will be optimizing for their specified objectives; agent objectives are typically taken as given when finding equilibria in game theory. Evolutionary analysis does this less; it deals with properties that tend to be selected-for in multi-agent settings, might be useful.

We can use these lists of goals and rationales to categorize past approaches, we can combine existing goals and rationales in new ways, and we can try to come up with new goals and rationales (aided by the observed lack of exhaustivity in these lists).

Training story sensitivity analysis [LW · GW]

We want to analyze how sensitive the outcome of a training setup is to incorrectness in the assumptions that go into its training story. What happens when a training story fails? When one of its assumptions is wrong? Does it fail safely or catastrophically?

The original post contains some examples of what our uncertainties about the assumptions in a training story might look like and how we can have more confidence in some parts of a training story than others.

"Hopefully, as we build better training stories, we’ll also be able to build better tools for their sensitivity analysis so we can actually build real confidence in what sort of model our training processes will produce."

0 comments

Comments sorted by top scores.