Notes on Internal Objectives in Toy Models of Agents

post by Paul Colognese (paul-colognese) · 2024-02-22T08:02:39.556Z · LW · GW · 0 comments

Contents

  Summary
  Introduction
    Internal Target Information/Internal Objectives
    Toy Models of Agents
  Toy Model: Thermostat
    The Thermostat
    Observations of the Thermostat
    Internal Target Information
    Observations Regarding the Thermostat’s Internal Target Information
  Toy Model: Model-Based Planning (Maze-Solving) Agent
    The Model-Based Planning (Maze-Solving) Agent
      The Task
      The Pseudo-Algorithm
      The Neural Network-based Value Function
    Observations of the Agent's Internal Target Information
  Discussion
    What form does ITI take?
    Retargetable Cognition
    Detecting an Agent's Objective
None
No comments

Thanks to Jeremy Gillen and Arun Jose for discussions related to these ideas.

Summary

WARNING: The quality of this post is low. It was sitting in my drafts folder for a while, yet I decided to post it because some people found these examples and analyses helpful in conversations. I tidied up the summary, deleted some sections, and added warnings related to parts of the post that could be confusing.

These notes are the result of reflecting on how Internal Objectives/Internal Target Information might be represented in simple theoretical models of agents. This reflection aimed to inform how we might detect these Internal Objectives via interpretability [LW · GW].

Note: Insights are over-indexed on these particular models of agents.

Insights include:

Introduction

Internal Target Information/Internal Objectives

Internal Target Information [LW · GW] (ITI) or an Internal Objective [LW · GW] is information about an agent's target used by the agent's internals to differentiate between actions that lead to the target outcome and actions that don’t.

The hope is that by developing interpretability methods to detect and interpret this information, we can directly detect misalignment.

Toy Models of Agents

In this post, we explore two toy models of agents and how ITI is internally represented within these models. Our models are neural-network-based with static objectives. We sometimes reflect on what these models would look like if their objectives were retargetable [LW · GW] but do not consider this in detail.

These models are not implemented in practice but are considered in theory for speed of analysis.

Toy Model: Thermostat

The Thermostat

Consider a Thermostat  whose input  is the room's current temperature (as measured by a thermometer), and whose output is one of the two following actions: "Turn the radiator on" or "Turn the air conditioning on." Let  denote the Thermostat's target temperature for the room. We assume that the thermostat implements the following function and that this is enough to maintain the target temperature of the room within reasonable bounds:

We suppose that  is implemented on a simple one-layer hidden neural network, where the single input neuron corresponds to the room's current temperature, there are two neurons  and  in the hidden layer with ReLU activations, and the two output neurons  and  correspond to  and , respectively (see Figure below).

Neural Network diagram of the Thermostat. (WARNING: Missing bias units and weights)

The hidden neuron values are given by:  and . We apply the identity to get the outputs, i.e.,  and .

An argmax is applied to the outputs to select the corresponding action, i.e., if  is selected; otherwise,  is selected. One can check that this implements  as described above.

Observations of the Thermostat

The thermostat can be thought of as having a crude world model that consists of a single variable that tracks the temperature of the room, i.e., the input neuron. 

Internal Target Information

Recall that Internal Target Information (ITI) is information about an agent's target used to differentiate between actions that lead to the target outcome and actions that don’t.

In this example, the thermostat's target is to maintain the room at temperature . We see that this information is explicitly stored in the bias terms in both hidden units.

Note that we can construct a space of neural networks that implement the same function , by multiplying the weights and biases related to the input to the hidden units by some non-zero number  and multiplying the weights between the hidden units and the output units by .

This observation highlights the problem with conceiving of ITI as being explicitly represented in a human-interpretable way. Perhaps instead, it is better to conceive of ITI as encoded in the function that maps from the world model of the thermostat (in this case, a single variable represented by the input neuron) to the outputs of the neural network (or to intermediate variables such as those in the hidden layer).

Observations Regarding the Thermostat’s Internal Target Information

Quick observations:

Toy Model: Model-Based Planning (Maze-Solving) Agent

The Model-Based Planning (Maze-Solving) Agent

The Task

The agent receives an input image of a maze and has to compute and output a plan/action sequence that will lead the mouse to the yellow star.

The environment. The aim of the agent is to produce an action sequence that causes the mouse to reach the yellow star after seeing the image of the maze.

The Pseudo-Algorithm

We briefly describe a Pseudo-Algorithm that the agent uses to solve the maze. The Figure below might be more illuminating.

We assume that the agent's algorithm used to select action sequences/plans is modular. The algorithm consists of several functions, including a world model for simulating the outcomes of plans, a value function for evaluating the value of a world state, and an argmax.

The agent receives an image of the maze and for each action sequence/plan in , the agent uses its world model to simulate the outcome of the plan. For each outcome it uses the value function to assign a value to the outcome according to how good the state is, and then finally, an argmax function is used to select the plan that leads to the highest value state.

WARNING: Value function/Evaluation function used interchangeably.

 

The Neural Network-based Value Function

WARNING: This section is probably unnecessary. Feel free to ignore. I was going to write out the network explicitly but that seemed like it was too much effort. It wouldn't surprise me if the details are wrong/misleading, but I'm leaving it in just in case.

For our purposes, it isn't essential to consider how functions other than the value function are implemented. We ground the value function by assuming it's implemented by a neural network that we will describe shortly. The whole of the agent's architecture could be a neural network or some modular system that makes calls to a neural network that instantiates the value function.

Potentially, the most natural value function that encodes the idea that the agent's target is to get to the yellow star calculates the number of squares  in the shortest path between the mouse and the yellow star and sets the value of that state as , where  is sufficiently small.

We want to be explicit and write down the neural network that encodes the value function as we did in the thermostat case. Therefore, we choose a proxy value function that can be easily encoded in a neural network that also would result in the correct plan/action sequence being chosen.

We assume that the input to this value function is a sequence of variables that represents high-level and relevant information about a particular world state. In particular, we assume that the input to the value function/neural network is a sequence of 6 variables that correspond to the mouse/yellow star/red gem's x/y coordinates in the maze. We denote the  yellow star's x-coordinate as  and give the other variables similar naming conventions.

The value function  is given by:

.

The highest value state is 0 and corresponds to a world state where the mouse is at the yellow star (the target). 

 

Observations of the Agent's Internal Target Information

Discussion

What form does ITI take?

The ITI is likely to be distributed across multiple parameters and activations but sufficiently localized that it can be used by a process that maps from world state variables to actions.

From these examples, it seems like ITI is embedded in the mapping (Action-Selection Process) from the Agent's world model to actions that lead to the target outcome.

Retargetable Cognition

Detecting an Agent's Objective

0 comments

Comments sorted by top scores.