Human preferences as RL critic values - implications for alignment

seth-herd

Human preferences as RL critic values - implications for alignment

post by Seth Herd · 2023-03-14T22:10:32.823Z · LW · GW · 6 comments

  The payoff: a handle for alignment.
  Values in RL and brains: what is known.
  Extrapolation: human preferences as RL values
  Preferences for abstractions
  Advantages of a powerful critic system
  Conclusions, caveats, and directions
None
6 comments

TLDR: Human preferences might be largely the result of a critic network head, much like that used in SOTA agentic RL systems. The term "values" in humans might mean almost exactly what it does in RL: an estimate of discounted sum of future rewards. In humans this is based on better and more abstract representations than current RL systems.

Work on aligning RL systems often doesn't address the critic system as distinct from the actor. But using systems with a critic head may provide a much simpler interface for interpreting and directly editing the system's values and, therefore, its goals and behavior. In addition, including a powerful critic system may be advantageous for capabilities as well.

One way to frame this is that human behavior, and therefore a neuromorphic AGI, might well be governed primarily by a critic system, and that's simpler to align than understanding the complex mess of representations and action habits in the remainder of the system.

Readers will hopefully be familiar with Steve Byrnes' sequence intro to brain-like AGI safety [? · GW]. What I present here seems to be entirely consistent with his theories. Post 14 [? · GW] of that sequence, his recent elaboration [AF · GW], and other recent work^[1] present similar ideas. We use different terminology^[2] and explanatory strategies, and I focus more on specifics of the critic system, so hopefully the two explanations are not redundant.

The payoff: a handle for alignment.

Wouldn't it be nice if the behavior of a system were governed by one system, and it provided an easily trainable (or even hand-editable) set of weights? And if that system had a clear readout, meaning something like "I'll pursue what I'm currently thinking about, as a goal, with priority 0-1 in this context"?

Suppose we had a proto-AGI system including a component with the above properties. That subsystem is what I'm terming the critic. Now suppose further that this system is relatively well-trained but is (by good planning and good fortune) still under human control. We could prompt it with language like "think about helping humans get things they want" or "...human flourishing" or whatever your top pick(s) are for the outer alignment problem. Then we'd hit the "set value to max" button, and all of the weights into the critic system from the active conceptual representations would be bumped up.

If we knew that the critic system's value estimate would govern future model-based behavior, and that it would over time train new model-free habits of thought and behavior, that would seem a lot better than trying to teach it what to do by rewarding behavior alone and guessing at the internal structure leading to that behavior.

I'm suggesting here that not only is such a system advantageous for alignment but that it's advantageous for capabilities. Which would be a nice conjunction, if it turns out to be true.

This wouldn't solve the whole alignment problem by a long shot. We'd still have to somehow decode or evoke the representations feeding that system, and we'd still have to figure out outer alignment; how to define what we really want. And we'd have to worry about its values shifting after it was out of our control, which I worry about. But having such a handle on values would make the problem a lot easier.

Values in RL and brains: what is known.

The basal ganglia and dopamine system act much like the actor and critic systems, respectively, in RL. This has been established by extensive experimental work. Papers and computational models, including ones I've worked on, review that empirical research^[3]. In brief, the amygdala and related subcortical brain areas seem to collectively act much like the critic in RL^[4].

Most readers will be familiar with the role of a critic in RL. Briefly, it learns to make educated guesses on when something good happens, so that it can provide a training signal to the network. In mammals, this has been studied in lab situations like lighting a red light exactly three seconds before giving a hungry mouse a food pellet and observing that its dopamine system increasingly releases dopamine when that red light comes on, as it learns that reward reliably follows. The same dopamine critic system also learns to cancel the dopamine signal when the food pellet arrives.

The critic system thus makes a judgment about when something seems likely to be new-and-good , or new-and-bad. It should be intuitively clear why those are the right occasions to apply learning, either to do more of what produced a new-and-probably-good outcome, or less of what produced a new-and-probably-bad outcome. In mathematical RL, this is the derivative of a sum of time-discounted predicted rewards. Neuroscience refers to the dopamine signal as a reward prediction error signal. They mean the same thing.

There are some wrinkles on this story raised by recent work. For instance, some dopamine neurons fire when punishment happens instead of pausing^[5]. That and several other results add to the story but don't fundamentally change it^[6].

Most of the high-quality experimental evidence is in animals, and in pretty short and easy predictions of future rewards, like the red light exactly three seconds before every reward. There are lots of variations that show rough^[7] adherence to the math of a critic: 50% prediction of reward means half as much dopamine released at the conditioned stimulus (CS in the jargon, the light) and roughly 50% as much dopamine at the unconditioned response (US, the food pellet).

In humans, the data is much less precise, as we don't go poking electrodes into human heads except sometimes when and where they are needed for therapeutic purposes. In animals, we don't have data on solving complex tasks or thinking about abstract ideas. Therefore, we don’t have direct data on how the critic system works in more complex and abstract domains.

Extrapolation: human preferences as RL values

The obvious move is to suppose that all human decision-making is done using the same computations.

The available data is quite consistent with this hypothesis; for instance, humans show blood oxygenation (BOLD from fMRI) activity that matches release of dopamine when they receive social and monetary rewards^[8], and monkeys show dopamine release when they're about to get information about an upcoming water reward, even when that information won't change their odds or amount of reward at all^[9]. There are many other results, none directly demonstrative, but all consistent with the idea that the dopamine critic system is active and involved in all behavior.

One neuroscientist who has directly addressed this explanation of human behavior is Read Montague, who wrote a whole popular press book on this hypothesis^[10]. He says we have an ability to plug in essentially anything as a source of reward and calls that our human superpower. He's proposing that we have a more powerful critic than other mammals, and that's the source of our intelligence. That's the same theory I'm presenting here.

Preferences for abstractions

The extreme version of this hypothesis is that this critic system works even for very abstract representations. For instance, when a human says that they value freedom, a representation of their concept of freedom triggers dopamine release.

The obvious problem with this idea is that a system that can provide a reward signal to itself just by thinking is obviously dysfunctional. It will have an easy mechanism to wirehead and that will be terrible for its performance/survival. So the system needs to draw a distinction between just imagining freedom and making a plan for action that is predicted to actually produce freedom. This seems like something that a critic system can learn pretty easily. It's known that the rodent dopamine system can learn blockers, such as not predicting reward when a blue light comes on at the same time as the otherwise reward-predictive red light.

The critic system would apply to everything we'd refer to as preferences. Although we wouldn't ordinarily think of these as preferences, it would also extend to ascribing a value to contextually useful subgoals, like coding a function as one element of finishing a piece of code. While these things are much more numerous than things we'd call our values, they may simply be a continuum of how valuable, in what contexts the critic system judges them to be.

Advantages of a powerful critic system

I think this extreme version does a lot of work in explaining very complex human behavior, like coding up a working piece of software. Concepts like “break the problem into pieces" and "create a function that does a piece of the problem" seem to be the types of strategies and subgoals we use to solve complex problems. Breaking problems into serial steps seems like a key computational strategy that allows us to do so much more than other mammals. I've written about this elsewhere^[11], but I can't really recommend those papers as they're written for cognitive neuroscientists. I hope to write more, and more clearly, on the advantage of choosing cognitive subgoals flexibly from all of the representations one has learned in future posts.

A second advantage of a powerful critic system is in bridging the gap to rare rewards in any environment. It's generally thought that humans sometimes use model-based computations, as when we think about possible outcomes of plans. But it also seems pretty clear that we often don't evaluate our plans all the way out to actual material rewards; we don't try to look to the end of the game, but rather estimate the value of board positions after looking just a few moves deep.

And so do current SOTA RL agents. The systems I understand all use critic heads. I'm not sure if ChatGPT uses a critic, but the AlphaZero family, the OpenAI Five family of RL agents, and others all use a critic head that benefits from using the same body as the actor head and, thus, has access to all of the representations the whole system learns during training. Those representations can go beyond those learned through RL. ChatGPT, EfficientZero, and almost certainly humans demonstrate the advantages of combining RL with predictive or other self-supervised learning.

Conclusions, caveats, and directions

The potential payoff of having such a system is having an easier handle to do some alignment work, as discussed in the first section. I wanted to say the exciting part first.

There are a few dangling pieces of the logic:

Current systems do not use explicit representations of their current goals.
- that's central to the human-like cognition I envision above.
- I think it's terrifyingly dangerous to let a system choose its own subgoals
  - since there's no clear delineation from final goals, but
I think it's also advantageous to do so,
- but that requires more discussion
As discussed above, such a system would not address:
- Outer alignment
- Interpretability
- Value shift and representational shift over time
Such a system would be highly agentic. Which is a terrible idea.
- I think agentic systems are inevitable,
- they have better capabilities
- And are just fascinating, even if they don't

I hope to cover all of the above in future posts.

Order Matters for Deceptive Alignment [LW · GW] makes the point that alignment would be way easier if our agent already has a good set of world representations when we align it. I think this is the core assumption made when people say "won't an advanced AI understand what we want?". But it's not that easy to maintain control while an AI develops those representations. Kaj Sotala's recent The Preference Fulfillment Hypothesis [AF · GW] presents roughly the same idea. ↩︎
Byrnes uses the concept of a thought assessor roughly as I'm using critic. He puts the basal ganglia as part of the thought assessor system, where the actor-critic hypothesis refers to the basal ganglia as part of the actor system. These appear to be purely terminological differences, positing at least approximately the same function. ↩︎
Brown, J., Bullock, D., & Grossberg, S. (1999). How the basal ganglia use parallel excitatory and inhibitory learning pathways to selectively respond to unexpected rewarding cues. Journal of Neuroscience, 19(23), 10502-10511.
Hazy, T. E., Frank, M. J., & O'reilly, R. C. (2007). Towards an executive without a homunculus: computational models of the prefrontal cortex/basal ganglia system. Philosophical Transactions of the Royal Society B: Biological Sciences, 362(1485), 1601-1613.
Herd, S. A., Hazy, T. E., Chatham, C. H., Brant, A. M., & Friedman, N. P. (2014). A neural network model of individual differences in task switching abilities. Neuropsychologia, 62, 375-389. ↩︎
Mollick, J. A., Hazy, T. E., Krueger, K. A., Nair, A., Mackie, P., Herd, S. A., & O'Reilly, R. C. (2020). A systems-neuroscience model of phasic dopamine. Psychological review, 127(6), 972. ↩︎
Brooks, A. M., & Berns, G. S. (2013). Aversive stimuli and loss in the mesocorticolimbic dopamine system. Trends in Cognitive Sciences, 17(6), 281-286. ↩︎
Dopamine release for aversive events may happen to focus attention on their potential causes, for purposes of plans and actions to avoid them. ↩︎
One important difference between these findings and mathematical RL is that the reward signal scales with recent experience. If I've gotten a maximum of four food pellets in this experiment, getting or learning that I will definitely get four pellets produces a large dopamine response. But if I've sometimes received ten pellets in the recent past, 4 pellets will only produce something like 4/10 the dopamine response. This could be an important difference for alignment purposes because it means that humans aren't maximizing any single quantity; their effective utility function is path-dependent. ↩︎
Wake, S. J., & Izuma, K. (2017). A common neural code for social and monetary rewards in the human striatum. Social Cognitive and Affective Neuroscience, 12(10), 1558-1564. ↩︎
Bromberg-Martin, E. S., & Hikosaka, O. (2009). Midbrain dopamine neurons signal preference for advance information about upcoming rewards. Neuron, 63(1), 119-126. ↩︎
Read, M. (2006). Why Choose This Book. EP Dutton, New York. ↩︎
Herd, S., Krueger, K., Nair, A., Mollick, J., & O’Reilly, R. (2021). Neural mechanisms of human decision-making. Cognitive, Affective, & Behavioral Neuroscience, 21(1), 35-57.
Herd, S. A., Krueger, K. A., Kriete, T. E., Huang, T. R., Hazy, T. E., & O'Reilly, R. C. (2013). Strategic cognitive sequencing: a computational cognitive neuroscience approach. Computational intelligence and neuroscience, 2013, 4-4. ↩︎

6 comments

Comments sorted by top scores.

comment by Charlie Steiner · 2023-03-18T19:39:50.368Z · LW(p) · GW(p)

I think this is a little wrong when it comes to humans, and this reflects an alignment difficulty.

Consider heroin. Heroin would really excite my reward system, and yet saying I prefer heroin is wrong. The activity of the critic governs the learning of the actor, but just because the critic would get excited by something if it happened doesn't mean that the combined actor-critic system currently acts or plans with any preference for that thing. Point being that identifying my preferences with the activity of the critic isn't quite right.

This means that making an AI prefer something is more complicated than just using an actor-critic model where that thing gets a high reward. This is also a problem faced by evolution (albeit tangled up with the small information budget evolution has to work with); if it's evolutionarily advantageous for humans' "actor" to have some behavior that it would not otherwise have, the evolved solution won't look like a focused critic that detects that behavior and rewards it, or an omniscient critic that knows how the world works from the start and steers the human straight towards the goal, it has to look like a curriculum that makes progressively-closer-to-desired behaviors progressively more likely to actually be encountered and slowly doles out reward as it notices better things happening.

Okay, maybe it's pretty fair to call the critic's reward estimate an estimate of how good things are according to the human, at that moment in time. Even ignoring the edge cases, my claim is that this is surprisingly not helpful for aligning AI to human values.

Replies from: Seth Herd

↑ comment by Seth Herd · 2023-03-18T20:19:44.031Z · LW(p) · GW(p)

I'm not sure I'm following you. I definitely agree that human behavior is not completely determined by the critic system. And that this complicates the alignment of brain like AGI. For instance, when we act out of habit, the critic is not even invoked until at least until the action is completed, and maybe not at all.

But I think you're addressing instinctive behavior. If you throw something at my eye, I'll blink - and this might not take any learning. If an electrical transformer box blows up nearby, I might adopt a stereotyped defensive posture with one arm out and one leg up, even if I've never studied martial arts (this is a personal anecdote from a neuroscience instructor on instincts). If you put sugar in my mouth, I'll probably salivate even as a newborn.

However, those are the best examples I can come up with. I think that evolution has worked by making its preferred outcomes (or rather simple markers of them) be rewarding. The critic system is thought to derive reward from more than the four Fs; curiosity and social approval are often theorized to innately produce reward (although I don't know of any hard evidence that these are primary rewards rather than learned rewards, after looking a good bit).

Providing an expectation-discounted reward signal is one way to produce progressively-closer-to-desired behaviors. In the mammalian system, I think evolution has good reasons to prefer this route rather than trying to hardwire behaviors in an extremely complex world, and while competing with the whole forebrain system for control of behavior.

But again, I might be misunderstanding you. In any case, thanks for the thoughts!

Edits, to address a couple more points:

I think the critic system, and your conscious predictions and preferences, are very much in charge in your decision not to find some heroin even though it's reputedly the most rewarding thing you can do with a single chunk of time once you have some. You are factoring in your huge preference to not spend your life like the characters in Trainspotting, stealing and scrounging through filth for another fix. Or at least it seems that's why I'm not doing it.

The low information budget of evolution is exactly why I think it relies on hardwired reward inputs to the critic for governing behavior in mammals that navigate and learn relatively complex behaviors in relatively complex perceived environments.

It seems you're saying that a good deal of our behavior isn't governed by the critic system. My estimate is that even though it's all ultimately guided by evolution, the vast majority of mammalian behavior is governed by the critic. Which would make it a good target of alignment in a brainlike AGI system.

I'll look at your posts to see if you discuss this elsewhere. Or pointers would be appreciated.

Replies from: Charlie Steiner

↑ comment by Charlie Steiner · 2023-03-19T04:23:38.214Z · LW(p) · GW(p)

I don't think I'm cribbing from one of my posts. This might be related to some of Alex Turner's recent posts though.

It seems you're saying that a good deal of our behavior isn't governed by the critic system. My estimate is that even though it's all ultimately guided by evolution, the vast majority of mammalian behavior is governed by the critic. Which would make it a good target of alignment in a brainlike AGI system.

I'd like to think I'm being a little more subtle. Me avoiding heroin isn't "not governed by the critic," instead what's going on is that it's learned behavior based largely on how the critic has acted so far in my life, which happens to generalize in a way that contradicts what the critic would do if I actually tried heroin.

Point is, if you somehow managed to separate my reward circuitry from the rest of my brain, you would be missing information needed to learn my values. My reward circuitry would think heroin was highly rewarding, and the fact that I don't value it is stored in the actor, a consequence of the history of my life. If I go out and become a heroin addict and start to value heroin, that information would also be found in the actor, not in the critic.

Providing an expectation-discounted reward signal is one way to produce progressively-closer-to-desired behaviors. In the mammalian system, I think evolution has good reasons to prefer this route rather than trying to hardwire behaviors in an extremely complex world, and while competing with the whole forebrain system for control of behavior.

Yeah, I may have edited in something relevant to this after commenting. The problem faced by evolution (and also by humans trying to align AI) is that the critic doesn't start out omniscient, or even particularly clever - it doesn't actually know what the expectation-discounted reward is. Given the constraints, it's stuck trying to nudge the actor to explore in maybe-good directions, so that it can make better guesses about where to nudge towards next - basically clever curriculum learning.

I bring this up because this curriculum is information that's in the critic, but that isn't identical to our values. It has a sort of planned obsolescence; the nudges aren't there because evolution expected us to literally value the nudges, they're there to serve as a breadcrumb trail that would have led us to learning evolutionarily favorable habits of mind in the ancestral environment.

Replies from: Seth Herd

↑ comment by Seth Herd · 2023-03-19T22:11:24.716Z · LW(p) · GW(p)

Me avoiding heroin isn't "not governed by the critic," instead what's going on is that it's learned behavior based largely on how the critic has acted so far in my life, which happens to generalize in a way that contradicts what the critic would do if I actually tried heroin.

I think we're largely in agreement on this. The actor system is controlling a lot of our behavior. But it's doing so as the critic system trained it to do. So the critic is in charge, minus generalization errors.

However, I also want to claim that the critic system is directly in charge when we're using model-based thinking- when we come up with a predicted outcome before acting, the critic is supplying the estimate of how good that outcome is. But I'm not even sure this is a crux. The critic is still in charge in a pretty important way.

If I go out and become a heroin addict and start to value heroin, that information would also be found in the actor, not in the critic.

I think that information would be found in both the actor and the critic. But not to exactly the same degree. I think the critic probably updates faster. And the end result of the process can be a complex interaction between the actor, a world model (which I didn't even bring into it in the article) and the critic. For instance, if it doesn't occur to you to think about the likely consequences of doing heroin, the decision is based on the critic's prediction that the heroin will be awesome. If the process, governed probably by the actor, does make a prediction of withdrawals and degradation as a result, then the decision is based on a rough sum that includes the critic's very negative assignment of value to that part of the outcome.

The problem faced by evolution (and also by humans trying to align AI) is that the critic doesn't start out omniscient, or even particularly clever - it doesn't actually know what the expectation-discounted reward is.

I totally agree. That's why the key question here is whether the critic can be reprogrammed after there's enough knowledge in the actor and the world model.

As for the idea that the critic nudges, I agree. I think the early nudges are provided by a small variety of innate reward signals, and the critic then expands those with theories of the next thing we should explore, as it learns to connect those innate rewards to other sensory representations.

The critic is only representing adult human "values" as the result of tons of iterative learning between the systems. That's the theory, anyway.

It's also worth noting that, even if this isn't how the human system works, it might be a workable scheme to make more alignable AGI systems.

comment by rcoreilly · 2023-03-23T20:14:31.265Z · LW(p) · GW(p)

So the system needs to draw a distinction between just imagining freedom and making a plan for action that is predicted to actually produce freedom. This seems like something that a critic system can learn pretty easily. It's known that the rodent dopamine system can learn blockers, such as not predicting reward when a blue light comes on at the same time as the otherwise reward-predictive red light.

There are 2 separable problems here: A. can a critic learn new abstract values?; B. how does the critic distinguish reality from imagination? I don't see how blocking provides a realistic solution to either? Can you spell out what the blocker is and how it might solve these problems?

In general, these are both critical problems with the open-ended "super critic" hypothesis -- how does Montague deal with these? So far, I don't see any good solution except a strong grounding to basic survival-relevant values, as any crack in the system seems like it will quickly spiral out of control, much like heroin..

I'm a fan of Tomasello's idea that social & sharing motivations provide the underlying fixed value function that drives most of human open-ended behavior. And there is solid evidence that humans vs. chimps differ strongly in these basic motivations, so it seems plausible that it is "built in" -- curious to hear more about your doubts on that data?

In short, I strongly doubt that an open-ended critic is viable: it is just too easy to short-circuit (wirehead). The socially-grounded critic also has a strong potential for bad local minima: basically the "mutual admiration society" of self-reinforcing social currency. The result is cults of all forms, including that represented by one of the current major political parties in the US... But inevitably these are self-terminating when they conflict strongly with more basic survival values..

Replies from: Seth Herd

↑ comment by Seth Herd · 2023-03-24T22:07:50.758Z · LW(p) · GW(p)

I don't think Montague dealt with that issue much if at all. But it's been a long time since I read the book.

My biggest takeaway from Tomasello's work was his observation that humans pay far more attention to other humans than monkeys do to monkeys. Direct reward for social approval is one possible mechanism, but it's also possible that it's some other bias in the system. I think hardwired reward for social approval is probably a real mechanism. But it's also possible that the correlation between people's approval and even more direct reward of food, water, and shelter play a large role in making human approval and disapproval a conditioned stimulus (or a fully "substituted" stimulus). But I don't think that distinction is very relevant for guessing the scope of the critic's association.

But inevitably these are self-terminating when they conflict strongly with more basic survival values.

I completely agree. This is the basis of my explanation for how humans could attribute value to abstract representations and not wirehead. In sum, a system smart enough to learn about the positive values of several-steps-removed conditioned stimuli can also learn many indicators of when those abstractions won't lead to reward. These may be cortical representations of planning-but-not-doing, or other indicators in the cortex of the difference between reality and imagination. The weaker nature of simulation representations may be enough to distinguish, and it should certainly be enough to ensure that real rewards and punishments always have a stronger influence, making imagination ultimately under the control of reality.

If you've spent the afternoon wireheading by daydreaming about how delicious that fresh meat is, you'll be very hungry in the evening. Something has gone very wrong, in much the same way as if you chose to hunt for game where there is none. In both cases, the system is going to have to learn where the wrong decision was made and the wrong strategy was followed. If you're out of a job and out of money because you've spent months arguing with strangers on the internet about your beloved concept of freedom and the type of political policy that will provide it, something has similarly gone wrong. You might downgrade your estimated value of those concepts and theories, and you might downgrade the value of arguing on the internet with strangers all day.

The same problem arises with any use of value estimates to make prediction-based decisions. It could be that the dopamine system is not involved in these predictions. But given the data that dopamine spiking activity is ubiquitous^[1], even when no physical or social rewards are present it seems likely to me that the system is working the same way in abstract domains as it is known to work in concrete ones.

^{^}
I need to find this paper, but don't have time right now. The finding was that rodents exploring a new home cage exhibit dopamine spiking activity something like once a second or so on average. I have a clear memory of the claim, but didn't evaluate the methods closely enough to be sure the claim was well supported. If I'm wrong about this, I'd change my mind about the system working this way.
This could be explained by curiosity as an innate reward signal, and that might well be part of the story. But you'd still need to explain why animals don't die by exploring instead of finding food. The same core explanation works for both: imagination and curiosity are both constrained to be weaker signals than real physical rewards.

Human preferences as RL critic values - implications for alignment

Contents

The payoff: a handle for alignment.

Values in RL and brains: what is known.

Extrapolation: human preferences as RL values

Preferences for abstractions

Advantages of a powerful critic system

Conclusions, caveats, and directions

6 comments