The Theoretical Reward Learning Research Agenda: Introduction and Motivation

logical_lunatic

The Theoretical Reward Learning Research Agenda: Introduction and Motivation

post by Joar Skalse (Logical_Lunatic) · 2025-02-28T19:20:30.168Z · LW · GW · 4 comments

  Defining the Theoretical Reward Learning Research Agenda
  Motivation and Background
  Special Focus On IRL
  Papers
  The Rest of this Sequence
None
4 comments

At the time of writing, I have just (nearly) finished my PhD at Oxford. During that time, most of my main research has been motivated by the goal of developing a theoretical foundation for the field of reward learning. The purpose of this sequence is to explain and motivate this research agenda, and to provide an accessible overview over some of my results. It will be divided into several smaller posts:

First, in this post, I will define and motivate the theoretical reward learning research agenda, including my aims and background assumptions.
Over a series of subsequent posts, I will provide overviews of and comments on some of the existing research within this research agenda.
Finally, in the last post, I will also provide some resources for anyone who wants to contribute to this (or similar) research, in the form of both open problems and some thoughts on how these problems could be approached.

I hope that this sequence will make it easier to get an overview over my recent research. I also want to welcome discussion, so please make a comment if you have any questions. Finally, I think this research agenda is promising, and it has proven to be very tractable, yet not many people work on it (or in related areas). I therefore also want to advertise this research, and make this post a good starting point for people who may be interested in working on similar questions themselves.

Defining the Theoretical Reward Learning Research Agenda

In one sentence, the aim of this research agenda is to develop a mature theoretical foundation for the field of reward learning (and relevant adjacent areas). This is very closely related to work that has been done in the AI safety space for quite some time, but I have approached it from a slightly different angle. Concretely, this research agenda involves answering questions such as:

What is the right method for expressing goals and instructions to AI systems?
1. The current default answer to this question is “exponentially discounted scalar Markovian reward functions." However, this is not the only option, and it is not self-evident that it is the right choice. Some other notable options include e.g. multi-objective RL, temporal logic, or different kinds of non-Markovian rewards.
2. Some relevant criteria for evaluating a specification language include:
  1. How expressive is the language? Are there things it cannot express?
  2. How intuitive is it for humans to work with?
3. The way that every other question within this research agenda is approached depends on how you answer this question. While “reward functions” probably is a very reasonable answer, it is worthwhile to occasionally reexamine this choice.
4. (I also feel the need to briefly mention that the vNM utility theorem doesn’t imply that Markov rewards can express any “reasonable” preferences in the RL setting. For example, the vNM utility theorem makes several assumptions that do not hold in the RL setting. For details, see e.g. this paper.)
How similar must two different goal specifications be in order to not be hackable?
1. For example, suppose that the behaviour which we desire from an AI system is exactly captured by some reward function , but that the AI instead maximises some other reward function $R_{2}$ . Under what conditions would maximisation of $R_{2}$ lead to behaviour that is good (or at least not detrimental) for $R_{1}$ ? Can we characterise these conditions mathematically?
What is the right way to quantify the differences and similarities between different goal specifications in a given specification language?
1. For example, what is the right way to quantify the difference between two reward functions? A naïve answer might be to measure their $L_{2}$ -distance. However, two reward functions may induce the same preferences between all pairs of policies and have an arbitrarily large $L_{2}$ -distance, or induce the exact opposite preferences between all pairs of policies and have an arbitrarily small $L_{2}$ -distance, so this choice is not very good.
2. The answer to this question should be something like a metric over some type of task specification (such as reward functions), according to which two task specifications have a small distance if and only if they are “similar” (in some relevant and informative sense).
What happens if you execute a task specification that is not close to the “ideal” specification, in the relevant sense?
1. For example, a complete answer to question (2) would be a set of necessary and sufficient conditions on two reward functions $R_{1}$ , $R_{2}$ which characterise when it would be acceptable (as measured by $R_{1}$ ) to maximise $R_{2}$ instead of $R_{1}$ . Can we say anything concrete about what happens if we maximise $R_{2}$ , when these conditions are not satisfied?
2. Similarly, a complete answer to (3) would be a (pseudo)metric $d$ on the space of all reward functions which quantifies their similarity. Can we say anything concrete about what might happen when we maximise $R_{2}$ instead of $R_{1}$ , based on the value of $d (R_{1}, R_{2})$ ? For example:
  1. As we increase $d (R_{1}, R_{2})$ , does the amount of obtained $R_{1}$ -reward decrease smoothly or sharply? Is it predictable?
  2. Can we express some form of regret bound (or similar) in terms of $d (R_{1}, R_{2})$ ?
  3. How do these errors depend on how much optimisation pressure we exert, and other relevant parameters?
3. Are there any distinct failure modes that could be individuated and characterised? Some notable possible candidates include:
  1. Goodhart’s Law.
  2. Convergent Instrumental Subgoals.
Which specification learning algorithms are guaranteed to converge to a good specification?
1. Given that we have answered 1-4, we now know how to express instructions to AI systems, and we know what it takes for those instructions to be sufficiently accurate. The next question is whether or not a given reward learning method is guaranteed to converge to a reward function that is sufficiently accurate in this sense. For example, if we run RLHF forever, are we guaranteed to eventually have a good reward function? What about inverse reinforcement learning?
2. If a given specification learning learning algorithm is guaranteed to converge to a good specification, can we say anything about its sample complexity? Could we develop PAC-like generalisation bounds, for example?
3. If we are not guaranteed to converge to a good reward function, can we say anything about the errors that might persist in the limit? Can they be characterised, and can their potential consequences be described?
4. (Note that the purpose behind answering questions 1-4 primarily is that they are prerequisites for this question.)
How sensitive are these specification learning algorithms to misspecification?
1. Reward functions (and other specifications) are not directly observable in any type of data. This means that any reward learning algorithm must make assumptions about how the underlying reward function relates to the observed data. For example, RLHF assumes that the labels directly correspond to the underlying reward function, modulo some Boltzmann noise. Similarly, an IRL algorithm must make some assumptions about the relationship between the observed policy and the preferences underlying it.
2. A relevant question is now how these reward learning algorithms will behave, if they are applied to data which violates their underlying assumptions (in some specific way). For example, what happens if an IRL algorithm which assumes that the demonstrator policy discounts exponentially is shown data from an agent that discounts hyperbolically? Will the learnt reward function still be mostly OK, or will it degrade completely?
3. Note that this question is crucial. Any reward learning method will invariably rely on certain (implicit or explicit) assumptions about both ethics and human psychology, and we should not expect to get these assumptions exactly right. It therefore matters a great deal how sensitive the algorithm is to small and moderate errors in these assumptions. We should only trust a reward learning method that is at least reasonably robust to such errors.
If we have a bound on the error in a specification (under some metric), can we devise safe methods for optimising it?
1. For example, suppose we have a metric $d$ over the space of all reward functions, and that we have reason to believe that a given reward function $R_{2}$ is within a distance of $ϵ$ to the ideal reward function $R_{1}$ (based on our learning algorithm and amount of training data, etc). What is the best way to optimise $R_{2}$ , in light of this information? Should it just be maximised naïvely, or are there better methods?

If we had satisfactory answers to these questions, then we would have a much better understanding of which reward learning methods are reliable (if any). For example, given a reward learning algorithm, we would be able to answer questions such as:

Does this reward learning algorithm eventually converge to a good reward function?
1. If so, what assumptions are crucial to ensure that this is the case?
2. If not, what errors and failure modes should we expect to see?
What kinds of training data do we need, and how much data do we need?
What is the best way to optimise the resulting reward function, given that it may be subject to the errors described above?

And so on.

I have made substantial progress on most of the questions above, and with a bit more effort, I think we could soon have a complete solution to most (or all) of them.

Motivation and Background

There are several reasons for why I am excited about the theoretical reward learning research agenda. First of all, it tackles one of the main core difficulties of AI safety in a fairly direct way — namely, the difficulty of how to specify what we want AI systems to do (aka “outer alignment”). Progress on this research agenda therefore seems likely to be useful in a fairly broad range of scenarios, and applicable to a fairly broad range of AI systems (in contrast to research that is closely tied to specific alignment strategies, specific types of AI systems, or very specific assumptions).

Moreover, I want AI safety solutions that produce explicit, quantitative safety guarantees that are underpinned and motivated by explicit, auditable assumptions. I don’t think that purely empirical methods are adequate for producing safety assurances that are satisfactory or acceptable for very powerful AI systems. A very mature form of mechanistic interpretability could produce such quantitative safety guarantees, but at the moment, the field of interpretability does not appear to be making progress sufficiently quickly. An alternative is to turn to formal learning theory [LW · GW], which is what the theoretical reward learning research agenda does — such research can yield quantitative safety guarantees in the form of convergence guarantees, generalisation bounds, regret bounds, and the like, and if such results are formulated properly, then they can provide useful constraints on the behaviour of even superhuman AI systems. This general approach thus has some hope to provide useful quantitative safety guarantees (in addition to other benefits, which I discuss below).

When I explain my research to people, I sometimes get the reaction, “Why work on reinforcement learning? Isn’t that outdated, compared to LLMs?” Let me therefore also preempt this question. In general, there are many ways to get an AI system to do what we want — for example, we can use supervised learning, imitation learning, prompting, or reward maximisation. In some cases we can also use more exotic methods, such as direct manipulation of latent activation vectors [? · GW] in trained models. Moreover, reward maximisation is not always the best (or most practical) method. For example, LLMs are primarily trained through supervised learning, and then fine-tuned through (regularised) reward optimisation. However, the reward optimisation step is not essential, and we can obtain similar results by only using supervised fine-tuning or prompting. So what makes reward functions important?

The fundamental reason for why I think reward optimisation is especially important (among all methods for specifying goals to AI systems) is that this is the only method that is not inherently limited by what we already know how to do. For example, suppose that we want to design a chess-playing AI. We can do this using only supervised learning and imitation learning (by training on chess games between strong human players). Moreover, we may even be able to get slightly superhuman performance this way, by combining the best play of many different human players. However, if we want to design a chess-playing AI that can invent completely new strategies and entirely outclass human chess players, then we must use something analogous to reward maximisation (together with either a search algorithm or an RL algorithm, or some other alternative to these). The same principle holds for other domains; without reward optimisation, we are always anchored to current human performance. Right now, LLMs are mainly used to automate things that humans already know how to do. However, when we want to move beyond that, I expect that reward optimisation (or something closely analogous to reward optimisation) will become important again.

I therefore still expect it to be the case that we at some point will have to provide an AI system with a reward function (or something very similar to that), and tell it to optimise that reward function within some large and open-ended domain. If that is the case, then we will want to know if the method that was used to create that reward function is reliable, and what assumptions are required to ensure that it is reliable. We will also want to know what kinds of failure modes we should expect, and if there are any ways to mitigate those failure modes. Answering those questions in turn requires a better theoretical understanding of the dynamics of reward learning and reward optimisation, in line with the questions I outlined above.

It is also important to note that none of this relies on the assumption that the AI system is trained through reinforcement learning (with Bellman updates, etc). Note that “reinforcement learning” refers to both a problem setting, and to a set of algorithms. The research agenda I have outlined is about the problem setting, not the algorithms. The problem setting is in turn highly general, and essentially covers the domain of all sequential decision problems.

Another reason for why I am excited about the theoretical reward learning research agenda is that it (plausibly) is important to have a deeper general understanding of the dynamics of pursuing goals, regardless of how exactly those goals are pursued. For example, game theory is highly relevant for understanding the dynamics of conflicts and strategic interactions in many different kinds of situations, even if the agents involved in those situations aren’t explicitly trying to compute Nash equilibria. Similarly, a better understanding of reward optimisation may be informative in many kinds of situations, even if those situations don’t involve agents that explicitly try to optimise reward functions.

Moreover, progress on this research agenda seems to be very tractable. I have personally made notable progress on several of these questions in a fairly short period of time, and I don’t anticipate that there would be any substantial difficulties in making even more progress. If this research agenda gets a bit more attention, then I think it would be entirely realistic to develop something like a “mathematical theory of outer alignment” on a timescale of maybe 1-3 years (and this is not accounting for the possibility of getting strong AI proof assistants within that time span).

This being said, I should also mention that there (of course) are important areas of AI safety that this research agenda doesn’t touch on at all. Most notably, it does not help with interpretability or inner alignment (except possibly that it may have something to say about what kinds of circuits we should look for in AI systems). Moreover, inner alignment may (at least plausibly) be a bigger problem than outer alignment (though this is also unclear).

Special Focus On IRL

Many of my recent papers have had a special focus on inverse reinforcement learning (IRL). In this section, I will briefly explain the unifying theme of these papers, and explain why I have focused on these problems in particular.

An IRL algorithm attempts to infer a person’s preferences based on their behaviour — this is usually formalised as the problem of inferring a reward function $R$ from a policy $π$ that has been computed from $R$ . To do this, we need a model of how $π$ relates to R (in other words, a model of how a person’s behaviour relates to their preferences). This is called a behavioural model. The behavioural model may in turn be specified manually, or learnt from data.

Of course, the relationship between a person’s preferences and their behaviour is in reality extremely complex. Indeed, a complete solution to this problem would also amount to a solution to the problem of how to create AGI. As such, while we can use various methods to create behavioural models that are more and more accurate, we should not expect to be able to create behavioural models that are perfectly accurate (regardless of how they are created). Therefore, we must expect the behavioural model to be misspecified in real-life cases.

This now raises the question of how robust IRL is to misspecification of the behavioural model. Is a mostly accurate behavioural model enough to ensure that the learnt reward function is mostly accurate, or can slight errors in the behavioural model lead to large errors in the inferred reward function? Stated differently, is the error in the inferred reward function somehow proportional to the error in the behavioural model? If it is, then IRL may be a feasible approach to learning representations of human preferences, whereas if it is not, then we should probably give up on IRL. I have dedicated a number of papers to investigating this question.

There are several reasons for why I have focused on this question in particular, out of all other potential questions in the theoretical reward learning research agenda. The first reason is (very simply) that it would be very good if we could get IRL to work. IRL is the only reward learning method that is self-supervised, which means that we have far (far) more data for IRL algorithms than we do for other forms of reward learning algorithms (such as e.g. RLHF). However, in order to know if IRL can be trusted, we must first understand how IRL is affected by misspecification. Is a roughly plausible model of bounded rationality enough to learn a pretty good reward function? If so, IRL will probably work! However, if IRL is highly sensitive to misspecification, then it will probably not work.

However, this is not my only reason for working on IRL. Another important reason is that I believe that in order to make deep progress in research, it is often necessary to tackle problems that are sufficiently hard. If a problem is too easy, it may be possible to solve it through shortcuts, whereas if it’s hard enough, then it cannot be solved without also making progress on deeper underlying issues. This “deeper” progress is in turn often useful independently of the particular problem that it was used to solve. The issue of how sensitive IRL is to misspecification is one of the (seemingly) hardest concrete questions I have identified within the theoretical reward learning research agenda, and so it is a good candidate for a problem that is “sufficiently hard”. In other words, I also see the problem of misspecification in IRL as a kind of “research whetstone” which may lead to further and more general progress in the theory of reward learning.

Another reason I think is worth mentioning is that there is a kind of duality relationship between IRL and reward optimisation. In IRL, we are going from a policy to a reward function, whereas in reward optimisation, we are going from a reward function to a policy (this is, after all, what the word “inverse” is referring to in “inverse reinforcement learning"). This means that in order to reason about the properties of IRL, we need to reason about the relationship between reward functions and (different types of) policies. Such reasoning may in turn also be useful for reasoning about the process of producing policies from reward functions. In other words, the specific subproblems that are involved with determining how sensitive IRL is to misspecified behavioural models have a lot of overlap with the specific subproblems that are involved with determining the consequences of incorrectly specified reward functions (and this problem, in turn, relates to almost all other questions in the theory of reward learning).

A final reason is that some people in the AI safety research community have a strategy for AI safety that uses IRL as a cornerstone (usually in the form of CIRL). This adds a further reason to scrutinise the overall theoretical feasibility of IRL.

Papers

So far, these are the papers I have written or contributed to that I consider to be part of the theoretical reward learning research agenda. This includes both papers written mostly by me, papers written in collaboration with other people, and papers written by other people under my supervision:

Here is also a selection of papers I had no involvement with, but which I consider to be contributing towards the broader aims of the theoretical reward learning research agenda, and which I want to especially highlight:

(This list is not exhaustive — more relevant papers may be found here, and in the references of the papers linked above.)

It is worth noting that Partial Identifiability and Misspecification in Inverse Reinforcement Learning is a long paper that combines the results from Invariance in Policy Optimisation and Partial Identifiability in Reward Learning, Misspecification in Inverse Reinforcement Learning, Quantifying the Sensitivity of Inverse Reinforcement Learning to Misspecification, and STARC: A General Framework For Quantifying Differences Between Reward Functions. If you plan on reading all of these papers in detail, then it may be a good idea to instead just read this paper — it combines all the results in one place, with a unified framing and terminology, has streamlined some things compared to the earlier papers, and also contains a small handful of new results.

The Rest of this Sequence

In the coming entries of this sequence, I will provide in-depth summaries of some of my papers, and explain their setup and results in more detail (but less detail than what is provided in the papers themselves). Specifically, the papers I will cover are:

In addition to this, there will also be a post giving shorter overviews of some other papers and results from this research agenda, but without going into as much depth as the posts above. Finally, in the last entry of this sequence, I will also attempt to provide some helpful starting points for anyone who may be interested in contributing to this research themselves, in the form of a list of a few concrete open problems and some tools for how they may be approached.

If you have any questions, then please ask them in the comments!

4 comments

Comments sorted by top scores.

comment by Chris_Leong · 2025-03-01T01:55:17.293Z · LW(p) · GW(p)

First of all, it tackles one of the main core difficulties of AI safety in a fairly direct way — namely, the difficulty of how to specify what we want AI systems to do (aka “outer alignment”)

I wouldn't quite go so far as to say it "tackles" the problem of outer alignment [? · GW], but it does tie into (pragmatic) attempts to solve the problem by identifying the ontology of realistically specifiable reward functions. However, maybe I'm misunderstanding you?

Replies from: Logical_Lunatic

↑ comment by Joar Skalse (Logical_Lunatic) · 2025-03-01T21:32:31.697Z · LW(p) · GW(p)

I'm not sure -- what significance are you placing on the word "tackle" in this context? I would also not say that the main value proposition of this research agenda lies in identifying the ontology of the reward function --- the main questions for this area of research may even be mostly orthogonal to that question.

Replies from: Chris_Leong

↑ comment by Chris_Leong · 2025-03-03T03:46:28.599Z · LW(p) · GW(p)

I was taking it as "solves" or "gets pretty close to solving". Maybe that's a misinterpretation on my part. What did you mean here?

Replies from: Logical_Lunatic

↑ comment by Joar Skalse (Logical_Lunatic) · 2025-03-07T21:17:32.272Z · LW(p) · GW(p)

No, that is not a misinterpretation: I do think that this research agenda has the potential to get pretty close to solving outer alignment. More specifically, if it is (practically) possible to solve outer alignment through some form of reward learning, then I think this research agenda will establish how that can be done (and prove that this method works), and if it isn't possible, then I think this research agenda will produce a precise understanding of why that isn't possible (which would in turn help to inform subsequent research). I don't think this research agenda is the only way to solve outer alignment, but I think it is the most promising way to do it.

The Theoretical Reward Learning Research Agenda: Introduction and Motivation

Contents

Defining the Theoretical Reward Learning Research Agenda

Motivation and Background

Special Focus On IRL

Papers

The Rest of this Sequence

4 comments