Reward is not Necessary: How to Create a Compositional Self-Preserving Agent for Life-Long Learning
post by Roman Leventov · 2023-01-12T16:43:42.357Z · LW · GW · 2 commentsThis is a link post for https://arxiv.org/abs/2211.10851
Contents
2 comments
There is an idea in the AI safety community that expected utility maximisation is the "limit" of agency [LW(p) · GW(p)].
I think the linked paper by Thomas J. Ringstrom demonstrates that this idea is wrong.
Explainer Twitter Thread by the author.
Abstract:
We introduce a physiological model-based agent as proof-of-principle that it is possible to define a flexible self-preserving system that does not use a reward signal or reward-maximization as an objective. We achieve this by introducing the Self-Preserving Agent (SPA) with a physiological structure where the system can get trapped in an absorbing state if the agent does not solve and execute goal-directed polices. Our agent is defined using new class of Bellman equations called Operator Bellman Equations (OBEs), for encoding jointly non-stationary non-Markovian tasks formalized as a Temporal Goal Markov Decision Process (TGMDP). OBEs produce optimal goal-conditioned spatiotemporal transition operators that map an initial state-time to the final state-times of a policy used to complete a goal, and can also be used to forecast future states in multiple dynamic physiological state-spaces. SPA is equipped with an intrinsic motivation function called the valence function, which quantifies the changes in empowerment (the channel capacity of a transition operator) after following a policy. Because empowerment is a function of a transition operator, there is a natural synergism between empowerment and OBEs: the OBEs create hierarchical transition operators, and the valence function can evaluate hierarchical empowerment change defined on these operators. The valence function can then be used for goal selection, wherein the agent chooses a policy sequence that realizes goal states which produce maximum empowerment gain. In doing so, the agent will seek freedom and avoid internal death-states that undermine its ability to control both external and internal states in the future, thereby exhibiting the capacity of predictive and anticipatory self-preservation. We also compare SPA to Multi-objective RL, and discuss its capacity for symbolic reasoning and life-long learning.
Relevant paragraph:
An important difference between MORL (Multi-Objective Reinforcement Learning) and HTGMDP (Hierarchical Temporal Goal Markov Decision Process) agents is that MORL agents represent the alternative objectives as primal value functions for policies on X , whereas HTGMDP agents represent these lifted objectives as state-spaces in R = W ×Y ×...×Z controlled by policies on X and framed as a reachability problem. This difference between primal objective (value) and lifted objective (state-space) is significant because in MORL, value functions are derived from Bellman equations which have an additive form that results in the accumulation of a quantity over time without any imposed structure on the accumulation (other than discount factors). Alternatively, the HTGMDP represents the effect of following some policy in other state-spaces with their own transition dynamics, and which is not necessarily required to have a chain-like structure. This means that, while an infinite chain-like state-space could record unbounded reward accumulation just like a value function, they can also record any other kind of process that cannot be modeled as reward accumulation. For example, in some video-games, an agent may have a cap (such as 999) of how much money it can carry. The designer of a MORL agent may model wealth as one of many possible objectives using a standard reward approach, representing the accumulation of wealth independent of limits to the amount of wealth that the agent can effectively posses (or physically carry and transport). Alternatively, the HTGMDP can represent processes such as (but not limited to) wealth or energy dynamics with floor or ceiling limits, or bit-vector dynamics for recording object possession, making the HTGMDP qualitatively different, but more expressive. Also, because alternative objectives are represented as state-space, all high-level states of an HTGMDP agent can condition its policy, and they can also be used for new tasks (spending money from a previous task to obtain an item in a new task) whereas the value functions of MORL agent do not serve this role. This is an important property for open-ended life-long agents which are constantly updating the states of many state-spaces.
2 comments
Comments sorted by top scores.
comment by particlemania · 2023-01-12T19:08:41.712Z · LW(p) · GW(p)
Rewards and Utilities are different concepts. To reject that reward is necessary to get/build agency is not the same thing as rejecting EU maximization as a basin of idealized agency.
Replies from: Roman Leventov↑ comment by Roman Leventov · 2023-01-12T23:25:49.054Z · LW(p) · GW(p)
The relevant paragraph that I quoted refutes exactly this. In the bolded sentence, "value function" is used as a synonym to "utility function". You simply cannot represent an agent that always seeks to maximise "empowerment" (as defined in the paper for Self-preserving agents), for example, or always seeks to minimise free energy (as in Active Inference agents), as maximising some quantity over its lifetime: if you integrate empowerment or free energy over time you don't get a sensible information quantity that you can label as "utility".
This is an uncontroversial idea, and is not a contribution of the paper. The paper contributes a formal demonstration that such agents are "stable", "self-preserving". Previously, this hasn't been shown for arbitrary Active Inference agents, formally.
Note that the fact that these agents are not utility maximisers doesn't mean they don't instrumentally converge. Cf. https://www.lesswrong.com/posts/ostLZyhnBPndno2zP/active-inference-as-a-formalisation-of-instrumental [LW · GW]. I haven't read the full paper yet, maybe I will see how the framework in there could admit mild optimisation, but so far I don't.