## Posts

Safely controlling the AGI agent reward function 2021-02-17T14:47:00.293Z
Graphical World Models, Counterfactuals, and Machine Learning Agents 2021-02-17T11:07:47.249Z
Disentangling Corrigibility: 2015-2021 2021-02-16T18:01:27.952Z
Creating AGI Safety Interlocks 2021-02-05T12:01:46.221Z
Counterfactual Planning in AGI Systems 2021-02-03T13:54:09.325Z
New paper: AGI Agent Safety by Iteratively Improving the Utility Function 2020-07-15T14:05:11.177Z
The Simulation Epiphany Problem 2019-10-31T22:12:51.323Z
New paper: Corrigibility with Utility Preservation 2019-08-06T19:04:26.386Z

Comment by Koen.Holtman on Finite Factored Sets · 2021-05-25T13:59:34.238Z · LW · GW

# Overcoming blindness

You mention above that Pearl's ontology 'has blinded us to the obvious next question'. I am very sympathetic to research programmes that try to overcome such blindness, this is the kind or research I have been doing myself recently. The main type of blindness that I have been trying to combat is blindness to complex types of self-referencing and indirect representation that can be present inside online machine learning agents, specifically in my recent work I have added a less blind viewpoint by modifying and extending Pearl's causal graphs, so that you end up with a two-causal-diagram model of agency and machine learning. These extensions may be of interest to you, especially in relation to problems of embeddedness, but the main point I want to make here is a methodological one.

What I found, somewhat to my surprise, is that I did not need to develop the full mathematical equivalent of all of Pearl's machinery, in order to shed more light on the problems I wanted to investigate. For example, the idea of d-separation is very fundamental to the type of thing that Pearl does with causal graphs, fundamental to clarifying problems of experimental design and interpretation in medical experiments. But I found that this concept was irrelevant to my aims. Above, you have a table of how concepts like d-separation map to the mathematics developed in your talk. My methodological suggestion here is that you probably do not want to focus on defining mathematical equivalents for all of Pearl's machinery, instead it will be a sign of de-blinding progress if you define new stuff that is largely orthogonal.

While I have been looking at blindness to problems of indirection. your part two subtitle suggests you are looking at blindness with respect to the problem of 'time' instead. However, my general feeling is that you are addressing another type of blindness, both this talk and in 'carthesian frames'. You are working to shed more light on the process that creates a causal model, be it a Pearlian or semi-Pearlian model, the process that generates the nodes and the arrows/relations between these nodes.

The mechanical generation of correct (or at least performant) causal models from observational data is a whole (emerging?) subfield of ML I believe, I have nor read much of the literature in this field, but here is one recent paper that may serve as an entry point.

# How I can interpret factoring graphically

Part of your approach is to convert Pearl's partly graphical math into a different, non-graphical formalism you are more comfortable with. That being said, I will now construct a graphical analogy to the operation of factoring you define.

You define factoring as taking a set and creating a set of factors (sets) , such that (in my words) every can be mapped to an equivalent tuple . where , etc.

Graphically, I can depict would be a causal graph with just a single node, a node representing a random variable that takes values in . The factoring would be an n-node graph where each node represents a random variable taking values from . So I can imagine factorization as an operation that splits a single graph node into many nodes .

In terms of mainstream practice in experimental design, this splitting operation replaces a single observable with several sub-observables. Where you depart from normal practice is that you require the splitting operation to create a full bijection: this kind of constraint is much more loosely applied in normal practice. It feels to me you are after some kind of no-loss-of-information criterion in defining partitioning as you do -- the criterion you apply seems to be unnecessarily strict however, though it does create a fun mathematical sequence.

In any case, if a single node splits into nodes , we can wonder how we should picture the arrows between these nodes , that need to be drawn in after the split. Seems to me that this is a key question you are trying to answer: how does the split create arrows, or other relations that are almost but not entirely like Peal's causal arrows? My own visual picture here is that, in the most general case, the split creates fully connected directed graph: each node has an arrow to every other node . This would be a model representation that is compatible with the theory that all observables represented by the nodes are dependent on each other. Then, we might transform this fully connected graph into a DAG, a DAG that is still compatible with observed statistical relations, by deleting certain arrows, and potentially by adding unobserved nodes with emerging arrows. (Trivial example: drawing an arrow is equivalent to stating a theory that maybe is not statistically independent of . If I can disprove that theory, I can remove the arrow.)

This transformation process typically allows for many different candidate DAGs to be created which are all compatible with observational data. Pearl also teaches that we may design and run experiments with causal interventions in order to generate more observational data which can eliminate many of these candidate DAGs.

Comment by Koen.Holtman on Finite Factored Sets · 2021-05-25T12:39:44.598Z · LW · GW

My thoughts on naming this finite factored sets: I agree with Paul's observation that

| Factorization seems analogous to describing a world as a set of variables

By calling this 'finite factored sets', you are emphasizing the process of coming up with individual random variables, the variables that end up being the (names of the) nodes in a causal graph. With representing the entire observable 4D history of a world (like a computation starting from a single game of life board state), a factorisation splits such into a tuple of separate, more basic observables . where , etc. In the normal narrative that explains Pearl causal graphs, this splitting of the world into smaller observables is not emphasized. Also, the splitting does not necessarily need to be a bijection. It may loose descriptive information with respect to .

So I see the naming finite factored sets as a way to draw attention to this splitting step, it draws attention to the fact that if you split things differently, you may end up with very different causal graphs. This leaves open the question of course is if really want to name your framework in a way that draws attention to this part of the process. Definitely you spend a lot of time on creating an equivalent to the arrows between the nodes too.

Comment by Koen.Holtman on Formal Inner Alignment, Prospectus · 2021-05-13T13:52:31.613Z · LW · GW

# The benefit of formalizing things

First off, I'm a big fan of formalizing things so that we can better understand them. In the case of AI safety that, better understanding may lead to new proposals for safety mechanisms or failure mode analysis.

In my experience, once you manage to create a formal definition, it seldom captures the exact or full meaning you expected the informal term to have. Formalization usually exposes or clarifies certain ambiguities in natural language. And this is often the key to progress.

# The problem with formalizing inner alignment

On this forum and in the broader community. I have seen a certain anti-pattern appear. The community has so far avoided getting too bogged down in discussing and comparing alternative definitions and formalization's of the intuitive term intelligence.

However, it has definitely gotten bogged down when it comes to the terms corrigibility, goal-directedness, and inner alignment failure. I have seen many cases of this happening:

The anti-pattern goes like this:

participant 1: I am now going to describe what I mean with the concept of corrigibility, goal-directedness,inner alignment failure, as first step to make progress on this problem of .

participants 2-n: Your description does not correspond to my intuitive concept of at all! Also, your steps 2 and 3 seem to be irrelevant to making progress on my concept of , because of the following reasons.

In this post on corrigibility I have have called corrigibility a term with a high linguistic entropy, I think the same applies to the other two terms above.

These high-entropy terms seem to be good at producing long social media discussions, but unfortunately these discussions seldom lead to any conclusions or broadly shared insights. A lot of energy is lost in this way. What we really want, ideally, is useful discussion about the steps 2 and 3 that follow the definitional step.

On the subject of offering formal versions of inner alignment, you write:

A weakness of this as it currently stands is that I purport to offer the formal version of the inner optimization problem, but really, I just gesture at a cloud of possible formal versions.

My recommendation would be to see the above weakness as a feature, not a bug. I'd be interested in reading posts (or papers) where you pick one formal problem out of this cloud and run with it, to develop new proposals for safety mechanisms or failure mode analysis.

# Some technical comments on the formal problem you identify

From your section 'the formal problem', I gather that the problems you associate with inner alignment failures are those that might produce treacherous turns or other forms of reward hacking.

You then consider the question if these failure modes could be suppressed by somehow limiting the complexity of the 'inner optimization' process, limited so that it is no longer capable of finding the unwanted 'malign' solutions. I'll give you my personal intuition on that approach here, by way of an illustrative example.

Say we have a shepherd who wants to train a newborn lion as a sheepdog. The shepherd punishes the lion whenever the lion tries to eat a sheep. Now, once the lion is grown, it will either have internalized the goal of not eating sheep but protecting them, or the goal of not getting punished. If the latter, the lion may at one point sneak up while the shepherd is sleeping and eat the shepherd.

It seems to me that the possibility of this treacherous turn happening is encoded from the start into the lion's environment and the ambiguity inherent in their reward signal. For me, the design approach of suppressing the treacherous turn dynamic by designing a lion that will not be able to imagine the solution of eating the shepherd seems like a very difficult one. The more natural route would be to change the environment or reward function.

That being said, I can interpret Cohen's imitation learner as a solution that removes (or at least attempts to suppress) all creativity from the lion's thinking.

If you want to keep the lion creative, you are looking for a way to robustly resolve the above inherent ambiguity in the lion's reward signal, to resolve it in a particular direction. Dogs are supposed to have a mental architecture which makes this easier, so they can be seen as an existence proof.

# Reward hacking

I guess I should re-iterate that, though treacherous turns seem to be the most popular example that comes up when people talk inner optimizers, I see treacherous turns as just another example of reward hacking, of maximizing the reward signal in a way that was not intended by the original designers.

As 'not intended by the original designers' is a moral or utilitarian judgment, it is difficult to capture it in math, except indirectly. We can do it indirectly by declaring e.g. that a mentoring system is available which shows the intention of the original designers unambiguously by definition.

Comment by Koen.Holtman on Draft report on existential risk from power-seeking AI · 2021-05-07T20:38:44.737Z · LW · GW

Re: “there is a whole body of work which shows that evolved systems are often power-seeking” -- anything in particular you have in mind here?

For AI specific work, the work by Alex Turner mentioned elsewhere in this comment section comes to mind, as backing up a much larger body of reasoning-by-analogy work, like Omohundro (2008). But the main thing I had in mind when making that comment, frankly, was the extensive literature on kings and empires. In broader biology, many genomes/organisms (bacteria, plants, etc) will also tend to expand to consume all available resources, if you put them in an environment where they can, e.g. without balancing predators.

Comment by Koen.Holtman on Draft report on existential risk from power-seeking AI · 2021-05-04T19:17:42.571Z · LW · GW

I have two comments on section 4:

This section examines why we might expect it to be difficult to create systems of this kind that don’t seek to gain and maintain power in unintended ways.

First, I like your discussion in section 4.3.3. The option of controlling circumstances is too often overlooked I feel.

However, your further analysis of the level of difficulty seems to be based mostly on the assumption that we must, or at least will, treat an AI agent as a black box that is evolved, rather than designed. Section 4.5:

[full alignment] is going to be very difficult, especially if we build them by searching over systems that satisfy external criteria, but which we don’t understand deeply, and whose objectives we don’t directly control.

There is a whole body of work which shows that evolved systems are often power-seeking. But at the same time within the ML and AI safety literature, there is also a second body of work on designing systems which are not power seeking at all, or have limited power seeking incentives, even though they contain a machine-learning subsystem inside them. I feel that you are ignoring the existence and status of this second body of work in your section 4 overview, and that this likely creates a certain negative bias in your estimates later on.

Some examples of designs that explicitly try to avoid or cap power-seeking are counterfactual oracles, and more recently imitation learners like this one, and my power-limiting safety interlock here. All of these have their disadvantages and failure modes, so if you are looking for perfection they would disappoint you, but if you are looking for tractable x-risk management, I feel there is reason for some optimism.

BTW, the first page of chapter 7 of Russell's Human Compatible makes a similar point, flatly declaring that we would be toast if we made the mistake of viewing our task as controlling a black box agent that was given to us.

Comment by Koen.Holtman on Another (outer) alignment failure story · 2021-04-22T13:28:23.336Z · LW · GW

This story reminds me of the run-up to the 2007-2008 financial crisis:

But eventually the machinery for detecting problems does break down completely, in a way that leaves no trace on any of our reports.

There is also an echo of 'we know that we do not fully understand these complex financial products referencing other complex financial products, even the quants admit they do not fully understand them, but who cares if we are making that much money'.

Overall, if I replace 'AI' above with 'complex financial product', the story reads about the same. So was this story inspired and constructed by transposing certain historical events, or is it just a coincidence?

Comment by Koen.Holtman on Learning and manipulating learning · 2021-04-21T14:43:39.492Z · LW · GW

Meta: This comment has my thoughts about the paper Pitfalls of Learning a Reward Function Online. I figure I should post them here so that others looking for comments on the paper might find them.

I read the paper in back in 2020; it was on my backlog ever since to think more about it and share my comments. Apologies for the delay, etc.

# Mathematical innovation

First off, I agree with the general observations in the introduction that there are pitfalls to learning a reward function online, with a human in the loop.

The paper looks at options for removing some of these pitfalls, or at least to make them less dangerous. The research agenda pursued by the paper is one I like a lot, an agenda of mathematical innovation. The paper mathematically defines certain provable safety properties (uninfluencability and unriggability), and also explores how useful these might be.

Similar agendas of of mathematical innovation can be found in the work of Everitt et al, for example in Agent Incentives: A Causal Perspective, and in my work, for example in AGI Agent Safety by Iteratively Improving the Utility Function. These also use causal influence diagrams in some way, and try to develop them in a way that is useful for defining and analyzing AGI safety. My personal intuition is that we need more of this type of work, this agenda is important to advancing the field.

# The math in the paper

That being said: the bad news is that I believe that the mathematical route explored by Pitfalls of Learning a Reward Function Online is most likely a dead end. Understanding why is of course the interesting bit.

The main issue I will explore is: we have a mathematical property that we label with the natural language word 'uninfluencability'. But does this property actually produce the beneficial 'uninfluencability' effects we are after? Section 4 in the paper also explores this issue, and shows some problems, my main goal here is to add further insights.

My feeling is that 'uninfluencability', the mathematical property as defined, does not produce the effects I am after. To illustrate this, my best example is as follows. Take a reward function that measures the amount of smiling, by the human teaching the agent. observed over the entire history . Take a reward function learning process which assumes (in its prior ) that the probability of the choice for this reward function at the end of the history, , cannot be influenced by the actions taken by the agent during the history, so for example is such that , This reward function learning process is unriggable. But the agent using this reward function learning process also has a major incentive to manipulate the human teacher into smiling, by injecting them with smile-inducing drugs, or whatever.

So it seems to me that the choice taken in the paper to achieve the following design goal:

Ideally, we do not want the reward function to be a causal descendant of the policy.

is not taking us on a route that goes anywhere very promising, given the problem statement. The safety predicate of uninfluencability still allows for conditions that insert the mind of the human teacher directly into the path to value of a very powerful optimizer. To make the mathematical property of 'uninfluencability' do what it says on the tin, it seems to me that further constraints need to be added.

Some speculation: to go this route of adding constraints, I think we need a model that separates the mind state of the teacher, or at least some causal dependents of this mind state, more explicitly from the remainder of the agent environment. There are several such increased-separation causal models in Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective and in Counterfactual planning. This then brings us back on the path of using the math of indifference, or lack of causal incentives, to define safety properties.

# Secondary remarks

Here are some further secondary remarks.

With the above remarks. I do not mean to imply that the uninfluencability safety property as defined lacks any value: I may still want to have this as a desirable safety property in an agent. But if it were present, this triggers a new concern: if the environment is such that the reward function is clearly influencable, any learning system prior which is incompatible with that assumption may be making some pretty strange assumptions about the environment. These might produce unsafe consequences, or just vast inefficiencies, in the behavior of the agent.

This theme could be explored more, but the paper does not do so, and I have also not done so. (I spent some time trying to come up with clarifying toy examples, but no example I constructed really clarified things for me.)

More general concern: the approach in the paper suffers somewhat from a methodological problem that I have seen more often in the AI and AGI safety literature. At this point in time, there is a tendency to frame every possible AI-related problem as a machine learning problem, and to frame any solution as being the design of an improved machine learning system. To me, this framing obfuscates the solution space. To make this more specific: the paper sets out to define useful constraints on , a prior over the agent environment, but does not consider the step of first exploring constraints on , the actual agent environment itself. To me, the more natural approach would be to first look for useful constraints on , and only then to consider the option of projecting these into as a backup option, when happens to lack the constraints.

In my mind, the problem of an agent manipulating its teacher or supervisor to maximize its reward is not a problem of machine learning, but more fundamentally a problem of machine reasoning, or even more fundamentally a problem which is present in any game-theoretical setup where rewards are defined by a level of indirection. I talk more at length about these methodological points in my paper on counterfactual planning.

If I use this level-of-indirection framing to back-project the design in the paper, my first guess would be that 'uninfluencability' might possibly say something about the agent having no incentive to hack its own compute core in order to change the reward function encoded within. But I am not sure if that first guess would pan out.

Comment by Koen.Holtman on Disentangling Corrigibility: 2015-2021 · 2021-04-08T10:18:52.051Z · LW · GW

In category theory, one learns that good math is like kabbalah, where nothing is a coincidence.

OK, I think I see what inspired your question.

If you want to give this kind of give the math the kabbalah treatment, you may also look at the math in [EFDH16], which produces agents similar to my definitions (4) (5), and also some variants that have different types of self-reflection. In the later paper here, Everitt et al. develop some diagrammatic models of this type of agent self-awareness, but the models are not full definitions of the agent.

For me, the main questions I have about the math developed in the paper is how exactly I can map the model and the constraints (C1-3) back to things I can or should build in physical reality.

There is a thing going on here (when developing agent models, especially when treating AGI/superintelligence and embeddeness) that also often happens in post-Newtonian physics. The equations work, but if we attempt to map these equations to some prior intuitive mental model we have about how reality or decision making must necessarily work, we have to conclude that this attempt raises some strange and troubling questions.

I'm with modern physics here (I used to be an experimental physicist for a while), where the (mainstream) response to this is that 'the math works, your intuitive feelings about how X must necessarily work are wrong, you will get used to it eventually'.

BTW, I offer some additional interpretation of a difficult-to-interpret part of the math in section 10 of my 2020 paper here.

How does your math interact with quantilization?

You could insert quantilization in several ways in the model. Most obvious way is to change the basic definition (4). You might also define a transformation that takes any reward function and returns a quantilized reward function , this gives you a different type of quantilization, but I feel it would be in the same spirit.

In a more general sense, I do not feel that quantilization can produce the kind of corrigibility I am after in the paper. The effects you get on the agent by changing into , by adding a balancing term to the reward function, are not the same effects produced by quantilization.

Comment by Koen.Holtman on Disentangling Corrigibility: 2015-2021 · 2021-04-06T19:13:52.375Z · LW · GW

My comment was primarily judging your abstract and why it made me feel weird/hesitant to read the paper. The abstract is short, but it is important to optimize so that your hard work gets the proper attention!

OK, that clarifies your stance. You feeling weird definitely created a weird vibe in the narrative structure of your comment, a vibe that I picked up on.

(I had about half an hour at the time; I read about 6 pages of your paper to make sure I wasn't totally off-base, and then spent the rest of the time composing a reply.)

You writing it quickly in half an hour also explains a lot about how it reads.

it's returning to my initial reactions as I read the abstract, which is that this paper is about intuitive-corrigibility.

I guess we have established by now that the paper is not about your version of intuitive-corrigibility.

For my analysis of intuitive-corrigibility, see the contents of the post above. My analysis is that intuitions on corrigibility are highly diverse, and have gotten even more diverse and divergent over time.

You interpret the abstract as follows:

You aren't just saying "I'll prove that this AI design leads to such-and-such formal property", but (lightly rephrasing the above): "This paper shows how to construct a safety layer that [significantly increases the probability that] arbitrarily advanced utility maximizing agents [will not] resist attempts by authorized parties to alter the goals and constraints that were encoded in the agent when it was first started.

Yes that is what I am saying in the abstract. Your light rephrasing where you add [significantly increases the probability that] indeed expresses the message I intended to convey.

[I] prove that the corrigibility layer works as intended in a large set of non-hostile universes."

The phrasing 'works as intended' in the abstract is supposed to indicate clearly that the layer is designed to produce specific such-and-such formal corrigibility properties only, not some broad idea of 'intuitive corrigibility'.

So I am guessing you did not pick up on that when reading the abstract.

OK, moving away from a discussion about abstracts, initial impressions, feelings and their causes, moving towards a discussion of more technical stuff:

But if the 'off-switch' is only a binary sensory modality (there's a channel that says '0' or '1' at each time step), then how do you have AIXI pick out 'the set of worlds in which humans are pressing the button' versus 'the set of worlds in which a rock fell on the button'?

In the paper I don't try to make the agent's world model distinguish between 'humans are pressing the button' versus 'the set of worlds in which a rock fell on the button'. The 'works as intended' is that any button press for whatever reason is supposed to stop the agent.

So this is a design tradeoff I am making. Specifically, this is driven by a lack of trust that any ML-based 'human versus rock' discrimination system will work reliably enough in every possible future emergency. The design tradeoff is to accept the risk that a stray rock will stop the agent accidentally, potentially causing some level of inconvenience.

Even with black-box learned world models, high levels of robustness can be achieved by a regime of testing on-distribution and then ensuring that the agent environment never goes off-distribution.

This is not going to happen for AGI, since we might not survive testing on-distribution,

You are saying we might not survive AGI. This kind of observation is almost a tautology, and therefore not very interesting.

What interests me is identifying techniques that can increase the chance of survival. Drive down the risk. Admit that zero risk is impossible and move on. There are plenty of risk-reduction techniques out there if you look.

and how would we ensure that the environment "stays on-distribution"? Is that like, pausing the world forever?

Here is a paperclip optimizer example of 'ensuring that the agent environment never goes off-distribution'. Your design goals are to 1) keep your paperclip optimizing AGI agent inside your paperclip factory, and also 2) to make sure that crazy people and/or radically new technologies do not manage to get into the factory.

In other words, keep the environment inside the factory on-distribution, and keep the agent in the factory.

Yes, you will never perfectly achieve these two goals with 100% certainty. Your goal is to drive down the risk of failure on both of them.

For goal 1), there are many techniques that come to mind, you probably want to use them all. 1a) physical barriers that the agent cannot breach with the resources it has at hand, 1b) active agent oversight, where oversight has a kill switch and is not afraid to use it 1c) agent design measures (penalty terms in the reward function, use of counterfactual planning, certain training regimes, etc.) that robustly lower the chance of the agent ever wanting to get out of the factory. 1d) automatic safety interlocks.

I spent some time in my professional life doing safety engineering for physical systems and cyber-physical systems, and this is how safety gets done there. Progress is often expensive, often annoying to certain stakeholders, and not always philosophically very satisfying. But you cannot have everything.

Comment by Koen.Holtman on Disentangling Corrigibility: 2015-2021 · 2021-04-06T15:11:11.175Z · LW · GW

Could you change the definition of to , and give the agent the ability to self-modify arbitrarily? The idea is that it would edit itself into its original form in order to make sure is large and small after the button press.

In general if you would forcefully change the agent's reward function into some , it will self-preserve from that moment on and try to maintain this , so it won't self-edit its back into the original form.

There are exceptions to this general rule, for special versions of and special versions of agent environments (see section 7.2), where you can get the agent to self-edit, but on first glance, your example above does not seem to be one.

If you remove the bits from the agent definition then you can get an agent that self-edits a lot, but without changing its fundamental goals. The proofs of 'without changing its fundamental goals' will get even longer and less readable than the current proofs in the paper, so that is why I did the privileging.

Comment by Koen.Holtman on Disentangling Corrigibility: 2015-2021 · 2021-04-06T14:39:23.722Z · LW · GW

Thanks for expanding on your question about the use of . Unfortunately. I still have a hard time understanding your question, so I'll say a few things and hope that will clarify.

If you expand the term defined in (5) recursively, you get a tree-like structure. Each node in the tree has as many sub nodes as there are elements in the set . The tree is in fact a tree of branching world lines. Hope this helps you visualize what is going on.

I could shuffle around some symbols and terms in the definitions (4) and (5) and still create a model of exactly the same agent that will behave in exactly the same way. So the exact way in which these two equations are written down and recurse on each other is somewhat contingent. My equations stay close to what is used when you model an agent or 'rational' decision making process with a Bellman equation. If your default mental model of an agent is a set of Q-learning equations, the model I develop will look strange, maybe even unnatural at first sight.

or your theory is going to end up with the wrong prior.

OK, maybe this is the main point that inspired your question. The agency/world models developed in the paper are not a 'theory', in the sense that theories have predictive power. A mathematical model used as a theory, like , predicts how objects will accelerate when subjected to a force.

The agent model in the paper does not really 'predict' how agents will behave. The model is compatible with almost every possible agent construction and agent behavior, if we are allowed to pick the agent's reward function freely after observing of reverse-engineering the agent to be modeled.

On purpose, the agent model is constructed with so many 'free parameters' that is has no real predictive power. What you get here is an agent model that can describe almost every possible agent and world in which it could operate.

In mathematics. the technique I am using in the paper is sometimes called 'without loss of generality'. I am developing very general proofs by introducing constraining assumptions 'without loss of generality'.

Another thing to note is that the model of the agent in the paper, the model of an agent with the corrigibility-creating safety layer, acts as a specification of how to add this layer to any generic agent design.

This dual possible use, theory or specification, of models can be tricky if you are not used to it. In observation-based science, mathematical models are usually always theories only. In engineering (and in theoretical CS, the kind where you prove programs correct, which tends to be a niche part of CS nowadays) models often act as specifications. In statistics, the idea that statistical models act as theories tends to be de-emphasized. The paper uses models in the way they are used in theoretical CS.

You may want to take a look at this post in the sequence, which copies text from a 2021 paper where I tried to make the theory/specification use of models more accessible. If you read that post, if might be easier to fully track what is happening, in a mathematical sense, in my 2019 paper.

Comment by Koen.Holtman on Disentangling Corrigibility: 2015-2021 · 2021-04-05T13:07:58.373Z · LW · GW

OK, so we now have people who read this abstract and feel it makes objectionable 'very large claims' or 'big claims', where these people feel the need to express their objections even before reading the full paper itself. Something vaguely interesting is going on.

I guess I have to speculate further about the root cause of why you are reading the abstract in a 'big claim' way, whereas I do not see 'big claim' when I read the abstract.

# Utopian priors?

Specifically, you are both not objecting to the actual contents of the paper, you are taking time to offer somewhat pre-emptive criticism based on a strong prior you have about what the contents of that paper will have to be.

Alex, you are even making rhetorical moves to maintain your strong prior in the face of potentially conflicting evidence:

That said, the rest of this comment addresses your paper as if it's proving claims about intuitive-corrigibility.

Curious. So here is some speculation.

In MIRI's writing and research agenda, and in some of the writing on this forum, there seems to be an utopian expectation that hugely big breakthroughs in mathematical modeling could be made, mixed up with a wish that they must be made. I am talking about breakthroughs that allow us to use mathematics to construct AGI agents that will provably be

• perfectly aligned

• with zero residual safety risk

• under all possible circumstances.

Suppose you have these utopian expectations about what AGI safety mathematics can do (or desperately must do, or else we are all dead soon). If you have these expectations of perfection, you can only be disappointed when you read actually existing mathematical papers with models and correctness proofs that depend on well-defined boundary conditions. I am seeing a lot of pre-emptive expression of disappointment here.

Alex: your somewhat extensive comments above seem to be developing and attacking the strawman expectation that you will be reading a paper that will

• resolve all open problems in corrigibility perfectly,

• not just corrigibility as the paper happens to define it, but corrigibility as you define it

• while also resolving, or at least name-checking, all the open items on MIRI's research agenda

You express doubts that the paper will do any of this. Your doubts are reasonable:

So I think your paper says 'an agent is corrigible' when you mean 'an agent satisfies a formal property that might correspond to corrigible behavior in certain situations.'

What you think is broadly correct. The surprising thing that needs to be explained here is: why would you even expect to get anything different in a paper with this kind of abstract?

# Structure of the paper: pretty conventional

My 2019 paper is a deeply mathematical work, but it proceeds in a fairly standard way for such mathematical work. Here is what happens:

1. I introduce the term corrigibility by referencing the notion of corrigibility developed in the 2015 MIRI/FHI paper

2. I define 6 mathematical properties which I call corrigibility desiderata. 5 of them are taken straight from the 2015 MIRI/FHI paper that introduced the term.

3. I construct an agent and prove that it meets these 6 desiderata under certain well-defined boundary conditions. The abstract mentions an important boundary condition right from the start:

A detailed model for agents which can reason about preserving their utility function is developed, and used to prove that the corrigibility layer works as intended in a large set of non-hostile universes.

The paper devotes a lot of space (it is 35 pages long!) to exploring and illustrating the matter of boundary conditions. This is one of the main themes of the paper. In the end, the proven results are not as utopian as one might conceivably hope for,

1. What I also do in the paper is that I sometimes us the term 'corrigible' as a shorthand for 'provably meets the 6 defined corrigibility properties'. For example I do that in the title of section 9.8.

You are right that the word 'corrigible' is used in the paper in both an informal (or intuitive) sense, and in a more formal sense where it is equated to these 6 properties only. This is a pretty standard thing to do in mathematical writing. It does rely on the assumption that the reader will not confuse the two different uses.

You propose a writing convention where 'POWER' always is the formal in-paper definition of power and 'power' is the 'intuitive' meaning of power, which puts less of a burden on the reader. Frankly I feel that is a bit too much of a departure from what is normal in mathematical writing. (Depends a bit I guess on your intended audience.)

If people want to complain that the formal mathematical properties you named X do not correspond to their own intuitive notion of what the word X really means, then they are going to complain. Does not matter whether you use uppercase or not.

Now, back in 2019 when I wrote the paper, I was working under the assumption that when people in the AGI safety community read the world 'corrigibility', they would naturally map this word to the list of mathematical desiderata in the 2015 MIRI/FHI paper titled 'corrigibility'. So I assumed that my use of the word corrigibility in the paper would not be that confusing or jarring to anybody.

I found out in late 2019 that the meaning of the 'intuitive' term corrigibility was much more contingent, and basically all over the place. See the 'disentangling corrigibility' post above, where I try to offer a map to this diverse landscape. As I mention in the post above:

Personally, I have stopped trying to reverse linguistic entropy. In my recent technical papers, I have tried to avoid using the word corrigibility as much as possible.

But I am not going to update my 2019 paper to covert some words to uppercase.

# On the 'bigness' of the mathematical claims

You write:

On p2, you write:

The main contribution of this paper is that it shows, and proves correct, the construction of a corrigibility safety layer that can be applied to utility maximizing AGI agents.

If this were true, I could give you AIXI, a utility function, and an environmental specification, and your method will guarantee it won't try to get in our way / prevent us from deactivating it, while also ensuring it does something non-trivial to optimize its goals? That is a big claim.

You seem to have trouble believing the 'if this were true'. The open question here is how strong of a guarantee you are looking for, when you are saying 'will guarantee' above.

If you are looking for absolute, rock-solid utopian 'provable safety' guarantees, where this method will reduce AGI risk to zero under all circumstances, then I have no such guarantees on offer.

If you are looking for techniques that can will deliver weaker guarantees, of the kind where there is a low but non-zero residual risk of corrigibility failure, if you wrap these techniques around a well-tested AI or AGI-level ML system, these are the kind of techniques that I have to offer.

If this were true it would be an absolute breakthrough

Again, you seem to be looking for the type of absolute breakthrough that delivers mathematically perfect safety always, even though we have fallible humans, potentially hostile universes that might contain unstoppable processes that will damage the agent, and agents that have to learn and act based on partial observation only. Sorry, I can't deliver on that kind of utopian programme of provable safety. Nobody can.

Still, I feel that the mathematical results in the paper are pretty big. They clarify and resolve several issues identified in the 2015 MIRI/FHI paper. They resolve some of these by saying 'you can never perfectly have this thing unless boundary condition X is met', but that is significant progress too.

On the topic of what happens to the proven results when I replace the agent that I make the proofs for with AIXI, see section 5.4 under learning agents. AIXI can make certain prediction mistakes that the agent I am making the proofs for cannot make by definition. These mistakes can have the result of lowering the effectiveness of the safety layer. I explore the topic in some more detail in later papers.

# Stability under recursive self-improvement

You say:

I think you might be discussing corrigibility in the very narrow sense of "given a known environment and an agent with a known ontology, such that we can pick out a 'shutdown button pressed' event in the agent's world model, the agent will be indifferent to whether this button is pressed or not."

1. We don't know how to robustly pick out things in the agent's world model, and I don't see that acknowledged in what I've read thus far.

First off, your claim that 'We don't know how to robustly pick out things in the agent's world model' is deeply misleading.

We know very well 'how to do this' for many types of agent world models. Robustly picking out simple binary input signals like stop buttons is routinely achieved in many (non-AGI) world models as used by today's actually existing AI agents, both hard-coded and learned world models, and there is no big mystery about how this is achieved.

Even with black-box learned world models, high levels of robustness can be achieved by a regime of testing on-distribution and then ensuring that the agent environment never goes off-distribution.

You seem to be looking for 'not very narrow sense' corrigibility solutions where we can get symbol grounding robustness even in scenarios where the AGI does recursive self improvement, where it re-builds is entire reasoning system from the ground up, and where it then possibly undergoes an ontological crisis. The basic solution I have to offer for this scenario is very simple. Barring massive breakthroughs, don't build a system like that if you want to be safe.

# The problem of formalizing humility

In another set of remarks you make, you refer to the web page Hard problem of corrigibility, were Ellezer speculates that to solve the problem of corrigibility, what really we want to formalize is not indifference but

something analogous to humility or philosophical uncertainty.

I don't even know how to begin formalizing that property, and so a priori I'd be quite surprised if that were done successfully all in one paper.

I fully share your stance that I would not even know how to begin with 'humility or philosophical uncertainty' and end successfully.

In the paper I ignore this speculation about humility-based solution directions, and leverage and formalize the concept of 'indifference' instead. Sorry to disappoint if you were expecting major progress on the humility agenda advanced by Ellezer.

# Superintelligence

Another issue is that you describe a "superintelligent" AGI simulator

Yeah, in the paper I explicitly defined the adjective superintelligent in a somewhat provocative way, I defined 'superintelligent' to mean 'maximally adapted to solving the problem of utility maximization in its universe'.

I know this is somewhat jarring to many people, but in this case it was fully intended to be jarring. It is supposed to make you stop and think...

(This grew into a very long response, and I do not feel I have necessarily addressed or resolved all of your concerns. If you want to move further conversation about the more technical details of my paper or of corrigibility to a video call, I'd be open to that.)

Comment by Koen.Holtman on Disentangling Corrigibility: 2015-2021 · 2021-04-02T19:03:58.312Z · LW · GW

First I've seen this paper, haven't had a chance to look at it yet, would be very surprised if it fulfilled the claims made in the abstract. Those are very large claims and you should not take them at face value without a lot of careful looking.

I wrote that paper and abstract back in 2019. Just re-read the abstract.

I am somewhat puzzled how you can read the abstract and feel that it makes 'very large claims' that would be 'very surprising' when fulfilled. I don't feel that the claims are that large or hard to believe.

Feel free to tell me more when you have read the paper. My more recent papers make somewhat similar claims about corrigibility results, but they use more accessible math.

Comment by Koen.Holtman on Disentangling Corrigibility: 2015-2021 · 2021-04-02T16:09:58.887Z · LW · GW

I like your "Corrigibility with Utility Preservation" paper.

Thanks!

I don't get why you prefer not using the usual conditional probability notation.

Well, I wrote in the paper (section 5) that I used instead of the usual conditional probability notation because it 'fits better with the mathematical logic style used in the definitions and proofs below.' i.e. the proofs use the mathematics of second order logic, not probability theory.

However this was not my only reason for this preference. The other reason what that I had an intuitive suspicion back in 2019 that the use of conditional probability notation, in the then existing papers and web pages on balancing terms, acted as an of impediment to mathematical progress. My suspicion was that it acted as an overly Bayesian framing that made it more difficult to clarify and generalize the mathematics of this technique any further.

In hindsight in 2021, I can be a bit more clear about my 2019 intuition. Armstrong's original balancing term elements and , where and are low-probability near-future events, can be usefully generalized (and simplified) as the Pearlian and where the terms are interventions (or 'edits') on the current world state.

The notation makes it look like the balancing terms might have some deep connection to Bayesian updating or Bayesian philosophy, whereas I feel they do not have any such deep connection.

That being said, in my 2020 paper I present a simplified version of the math in the 2019 paper using the traditional notation again, and without having to introduce .

leads to TurnTrout's attainable utility preservation.

Yes it is very related: I explore that connection in more detail in section 12 of my 2020 paper. In general I think that counterfactual expected-utility reward function terms are a Swiss army knifes with many interesting uses. I feel that as a community, we have not yet gotten to the bottom of their possibilities (and their possible failure modes).

Why not use in the definition of ?

In definition of (section 5.3 equation 4) I am using a term, so I am not sure if I understand the question.

(I am running out of time now, will get back to the remaining questions in your comment later)

Comment by Koen.Holtman on Disentangling Corrigibility: 2015-2021 · 2021-04-02T13:53:05.102Z · LW · GW

Thanks at lot all! I just edited the post above to change the language as suggested.

FWIW, Paul's post on corrigibility here was my primary source for the into that Robert Miles named the technical term. Nice to see the original suggestion as made on Facebook too.

Comment by Koen.Holtman on My research methodology · 2021-03-25T20:01:30.990Z · LW · GW

Interesting... On first reading your post, I felt that your methodological approach for dealing with the 'all is doomed in the worst case' problem is essentially the same as my approach. But on re-reading, I am not so sure anymore. So I'll try to explore the possible differences in methodological outlook, and will end with a question.

The key to your methodology is that you list possible process steps which one might take when one feels like

all of our current algorithms are doomed in the worst case.

The specific doom-removing process step that I want to focus on is this one:

If so, I may add another assumption about the world that I think makes alignment possible (e.g. the strategy stealing assumption), and throw out any [failure] stories that violate that assumption [...]

My feeling is that AGI safety/alignment community is way too reluctant to take this process step of 'add another assumption about the world' in order to eliminate a worst case failure story.

These seem to be several underlying causes for this reluctance. One of them is that in the field of developing machine learning algorithms, in the narrow sense where machine learning equals function approximation, the default stance is to make no assumptions about the function that has to be approximated. But the main function to be approximated in the case of an ML agent is the function that determines the behavior of the agent environment. So the default methodological stance in ML is that we can introduce no assumptions whatsoever about the agent environment, we can't for example assume that it contains a powerful oversight body that will help to keep the agent aligned. Obviously this stance is not very helpful if you want to make progress on certain alignment problems.

So I'm happy to see a post that encourages people to make explicit assumptions about the agent's environment. I have definitely used this technique to make progress in my own work.

But.

When I look at your example of 'the strategy stealing assumption' as one useful assumption to add, it is very much not the default example that would first come to my mind. So I am wondering if you would even recommend the approach of adding the kind of default assumptions that I tend to add.

To make this more specific, in this post I introduce an agent design with three safety interlocks that are supposed to be helpful to agent oversight. The interlocks are agent design refinements that make it easier for oversight to keep control over the agent. The interlocks contribute to more successful oversight not by making the oversight people+machines smarter (which is your main line of research I believe), but by making the agent less smart in very specific ways.

But at the same time, these interlocks do not remove all possible worst-case failure stories of doom. To quote from the post and the underlying paper:

We can always construct a highly unlikely but theoretically possible failure mode story where the agent ends up being unsafe.

The key here is the 'highly unlikely'. If we have an algorithm were

We train a generative model that predicts these camera observations given the motor actions.

then I typically add the following assumption to avoid doom:

• Assumption: Take a sequence of of disastrously unsafe motor actions, actions that that modify the agent compute core to remove the safety interlocks. We assume that, while the following mis-prediction scenario is theoretically possible, the risk that it actually happens is very low. The mis-prediction scenario is that, for the above motor actions that modify the compute core, the above generative model incorrectly predicts camera observations in which the agent compute core remains perfectly untouched by these actions, and where the observed paperclip production resulting from these actions is better than under any other sequence of actions.

In terms of methodology, I usually describe the above move as one where we seek to drive down the risk of certain failure modes to residual levels. There is a link to empirical work here. To make it more plausible that the above assumption about low risk is valid for a particular ML system and physical realization of an agent and its environment, we can do simulations and real-life experiments with trained generative models.

So my question is: would the above assumption-adding step, about the low risk of mis-predictions, be a natural and valid assumption-adding process step for 'throwing out failure stories' in your methodology?

Or is the existence of this assumption automatically implied by default in your process?

Comment by Koen.Holtman on Formal Solution to the Inner Alignment Problem · 2021-03-05T13:49:18.600Z · LW · GW

I don't think this is a problem. There will be plenty of them, but when they're wrong they'll get removed from the posterior.

I have seen you mention a number of times in this comment thread that 'this is not a problem because eventually the bad/wrong policies will disappear from the top set'. You have not qualified this statement with 'but we need a very low like to make this work in a safe way', so I remain somewhat uncertain about your views are about how low needs to go.

In any case, I'll now try to convince you that if , your statement that 'when they're wrong they'll get removed from the posterior' will not always mean what you might want it to mean.

# Is the demonstrator policy πd to get themselves killed?

The interesting thing in developing these counterexamples is that they often show that the provable math in the paper gives you less safety than you would have hoped for.

Say that is the policy of producing paperclips in the manner demonstrated by the human demonstrator. Now, take my construction in the counterexample where and where at time step , we have the likely case that . In the world I constructed for the counterexample, the remaining top policies now perform a synchronized treacherous turn where they kill the demonstrator.

In time step and later, the policies diverge a lot in what actions they will take, so the agent queries the demonstrator, who is now dead. The query will return the action. This eventually removes all 'wrong' policies from , where 'wrong' means that they do not take the action at all future time steps.

The silver lining is perhaps that at least the agent will eventually stop, perform actions only, after it has killed the demonstrator.

Now. the paper proves that the behavior of the agent policy will approximate that of the true demonstrator policy closer and closer when time progresses. We therefore have to conclude that in the counterexample world, the true demonstrator policy had nothing to do with producing paperclips, this was a wrong guess all along. The right demonstrator policy is one where the demonstrator always intended to get themselves killed.

This would be a somewhat unusual solution to the inner alignment problem.

The math in the paper has you working in a fixed-policy setting where the demonstrator policy is immutable/time-invariant. The snag is that this does not imply that the policy defines a behavioral trajectory that is independent of the internals of the agent construction. If the agent is constructed in a particular way and when it operates in a certain environment, it will force into a self-fulfilling trajectory where it kills the demonstrator.

Side note: if anybody is looking for alternative math that allows one to study and manage the interplay between a mutable time-dependent demonstrator policy and the agent policy, causal models seem to be the way to go. See for example here where this is explored in a reward learning setting.

Comment by Koen.Holtman on Formal Solution to the Inner Alignment Problem · 2021-03-05T10:39:13.896Z · LW · GW

I agree with your description above about how it all works. But I guess I was not explaining well enough why I got confused and why the edits of inserting the and the bold text above would have stopped me getting confused. So I'll try again.

The 0 on the l.h.s. means the imitator is picking the action itself instead of deferring to the demonstrator,

below equation (3) as an explanatory claim that the value defined in equation (3) defines the probability that the imitator is picking the action itself instead of deferring to the demonstrator, the probability given the history . However, this is not the value being defined by equation (3), instead it defines the probability the imitator is picking the action itself instead of deferring to the demonstrator when the history is and the next action taken is .

The actual probability of the imitator is picking the action itself under , is given by , which is only mentioned in passing in the lines between equations (3) and (4).

So when I was reading the later sections in the paper and I wanted to look back at what the probability was that the imitator would pick the action, my eye landed on equation (3) and the sentence below it. When I read that sentence, it stopped me from looking further to find the equation , which is the equation I was really looking for. Instead my mind auto-completed equation (3) by adding an term to it, which makes for a much more conservative querying policy than the one you defined, and this then got me into wondering how you were dealing with learning nondeterminstic policies, if at all, etc.

So overall I think you can improve readability by doing some edits to draw attention more strongly to the conditional nature of , and foregroundig the definition of more clearly as a single-line equation.

Comment by Koen.Holtman on Formal Solution to the Inner Alignment Problem · 2021-03-03T16:36:05.386Z · LW · GW

[edited to delete and replace an earlier question] Question about the paper: under equation (3) on page 4 I am reading:

The 0 on the l.h.s. means the imitator is picking the action itself instead of deferring to the demonstrator,

This confused me initially to no end, and still confuses me. Should this be:

The 0 on the l.h.s. means the imitator is picking the action itself instead of deferring to the demonstrator or picking one of the other actions???

This would seem to be more consistent with the definitions that follow, and it would seem to make more sense overall.

Comment by Koen.Holtman on Formal Solution to the Inner Alignment Problem · 2021-03-03T15:44:54.942Z · LW · GW

[long comment, my bigger agenda here is to get to a state where discussions on this forum start using much more math and subcase analysis when they talk about analysing and solving inner alignment problems.]

If we set small enough, we can make it arbitrarily like[ly] that never leaves the set of top policies.

OK, I think you are kind of saved here, in the case of my counter-example, because of the special nature of your equation (2) in the paper that computes the set of policies. To make this more specific:

Say I refine the counter-example by adding the prior , so each candidate is considered equally likely from the start. Say that .

The key thing to note in my example is that, before the time-synchronized treacherous turn happens for all policies , all policies are equally compatible with events so far, so the posterior is the same for all of them.

When I use with these all-the-same-posterior policies, then (if I am reading equation (2) correctly) the set of top policies will have size . Ties are broken arbitrarily, so the real is unlikely to be among these top candidates. In that case, the agent will perform the treacherous turn and kill the operator. Correct me if I am wrong here.

However.

# Case of α<1/|Π|: much better than I thought!

However, and this is where things get interesting, if we use , so , then equation (2) will compute , the top set of policies is all policies. The correct policy is therefore still in the top set, and we get no treacherous turn.

So equation (2) is doing a lot of heavy lifting in making the approach work! This also seems to be a major difference with the based approach in your earlier paper. Not sure if you made this change specifically to deal with examples like my counter-example above, or if you had a more general reason -- the paper does not say.

Pretty interesting! So I would now agree with your statement that your approach provably solves the inner alignment problem, at least for and for certain values of 'the inner alignment problem'. I can also imagine several setups where would still lead to reasonably fast learning (=reasonably fast drop in frequency of demonstrator requests).

This is what progress looks like. There are certain boundary conditions here that might be unrealistic: needs to be a finite set and the real policy needs to be in there, but at least we have some well-defined boundary conditions we can now look at, where we can try to understand and interpret the math further.

# Case of α>1/|Π|: the challenge of designing a prior

In the case of using or , because you want the system to learn faster, you are instead looking at a setup where, if you want to solve the inner alignment problem, you will have to shape the prior values in such a way that gets a higher prior than the all the treacherous policies.

My counter-example above shows that if you are using the 1/(number of bits in the program that computes ) as the prior, you will not get what you need. There will be plenty of functions that have fewer bits in their encoding than the real function used by the demonstrator.

Earlier in this comment section, there is a whole sub-thread with speculation on the number of bits needed too encode benign vs. treacherous policies, but for me that discussion does not yet arrive at any place where I would get convinced that the problem of assigning higher priors to benign vs. treacherous policies has been made tractable yet. (Vanessa has identified some additional moving parts however.)

There is of course a tradition in the AI safety community where this is made 'tractable' by the device of polling many AI researchers to ask them whether they think that bits(benign policy)<bits(treacherous policy) for future ML systems, and then graphing the results, but this is not what I would call a tractable solution.

What I would call tractable is a solution like the one, for a much simpler case, in section 10.2.4 of my paper Counterfactual Planning in AGI Systems. I show there that random exploration can be used to make sure that bits(agent environment model which includes unwanted self-knowledge about agent compute core internals) bits(agent environment model that lacks this unwanted self-knowledge), no matter what the encoding. Extending this to the bits(benign policy) case would be nice, but I can't immediately see a route here.

My answer to the above hypothetical bits(benign policy)<bits(treacherous policy) poll is that we cannot expect this to be true any possible encoding of policies (see counter-example above), but it might be true for some encodings. Figuring out where deep neural net encodings fit on the spectrum would be worthwhile.

Also. my answer to bits(benign policy)<bits(treacherous policy) would depend on whether the benign policy is supposed to be about making paperclips in the same way humans do, or about maximizing human values over the lifetime of the universe in ways that humans will not be able to figure out themselves.

For the paperclip making imitation policy, I am somewhat more optimistic about tractability than in the more general case.

Comment by Koen.Holtman on Behavioral Sufficient Statistics for Goal-Directedness · 2021-03-01T19:23:28.317Z · LW · GW

This doesn't feel like a good summary of what Rohin says in his sequence.

I was not trying to summarize the entire sequence, only summarizing my impressions of some things he said in the first post of the sequence. Those impressions are that Rohin was developing his intuitive notion of goal-directedness in a very different direction than you have been doing, given the examples he provides.

Which would be fine, but it does lead to questions of how much your approach differs. My gut feeling is that the difference in directions might be much larger than can be expressed by the mere adjective 'behavioral'.

On a more technical note, if your goal is to search for metrics related to "less probability that the AI steals all my money to buy hardware and goons to ensure that it can never be shutdown", then the metrics that have been most productive in my opinion are, first, 'indifference', in the meaning where it is synonymous with 'not having a control incentive'. Other very relevant metrics are 'myopia' or 'short planning horizons' (see for example here) and 'power' (see my discussion in the post Creating AGI Safety Interlocks).

(My paper counterfactual planning has a definition of 'indifference' which I designed to be more accessible than the not having a control incentive' definition, i.e. more accessible for people not familiar with Pearl's math.)

None of the above metrics look very much like 'non-goal-directedness' to me, with the possible exception of myopia.

Comment by Koen.Holtman on Model splintering: moving from one imperfect model to another · 2021-03-01T15:08:46.197Z · LW · GW

OK. Reading the post originally, my impression was that you were trying to model ontological crisis problems that might happen by themselves inside the ML system when it learns of self-improves.

This is a subcase that can be expressed in by your model, but after the Q&A in your SSC talk yesterday, my feeling is that your main point of interest and reason for optimisim with this work is different. It is in the problem of the agent handling ontological shifts that happen in human models of what their goals and values are.

I might phrase this question as: If the humans start to splinter their idea of what a certain kind morality-related word they have been using for ages really means, how is the agent supposed to find out about this, and what should it do next to remain aligned?

The ML literature is full of uncertainty metrics that might be used to measure such splits (this paper comes to mind as a memorable lava-based example). It is also full of proposals for mitigation like 'ask the supervisor' or 'slow down' or 'avoid going into that part of the state space'.

The general feeling I have, which I think is also the feeling in the ML community, is that such uncertainty metrics are great for suppressing all kinds of failure scenarios. But if you are expecting a 100% guarantee that the uncertainty metrics will detect every possible bad situation (that the agent will see every unknown unknown coming before it can hurt you), you will be disappointed. So I'd like to ask you: what is your sense of optimism or pessimism in this area?

Comment by Koen.Holtman on Behavioral Sufficient Statistics for Goal-Directedness · 2021-03-01T14:06:41.569Z · LW · GW

This post proposes such a behavioral definition of goal-directedness. If it survives the artillery fire of feedback and criticism, it will provide a more formal grounding for goal-directedness,

I guess you are looking for critical comments. I'll bite.

# Technical comment on the above post

So if I understand this correctly. then is a metric of goal-directedness. However, I am somewhat puzzled because only measures directedness to the single goal .

But to get close to the concept of goal-directedness introduced by Rohin, don't you need then do an operation over all possible values of ?

# More general comments on goal-directedness

Reading the earlier posts in this sequence and several of the linked articles, I see a whole bunch of problems.

I think you are being inspired by the The Misspecified Goal Argument. From Rohin's introductory post on goal directedness:

The Misspecified Goal Argument for AI Risk: Very intelligent AI systems will be able to make long-term plans in order to achieve their goals, and if their goals are even slightly misspecified then the AI system will become adversarial and work against us.

Rohin then speculates that if we remove the 'goal' from the above argument, we can make the AI safer. He then comes up with a metric of 'goal-directedness' where an agent can have zero goal-directedness even though he can model it as a system that is maximizing a utility function. Also, in Rohin's terminology, an agent gets safer it if is less goal-directed.

Rohin then proposes that intuitively, a table-driven agent is not goal-directed. I think you are not going there with your metrics, you are looking at observable behavior, not at agent internals.

Where things completely move off the main sequence is in Rohin's next step in developing his intuitive notion of goal-directedness:

This suggests a way to characterize these sorts of goal-directed agents: there is some goal such that the agent’s behavior in new circumstances can be predicted by figuring out which behavior best achieves the goal.

So what I am reading here is that if an agent behaves more unpredictably off-distribution, it is becomes less goal-directed in Rohin's intuition. But I can't really make sense of this anymore, as Rohin also associates less goal-directedness with more safety.

This all starts to look like a linguistic form of Goodharting: the meaning of the term 'goal-directed' collapses completely because too much pressure is placed on it for control purposes.

To state my own terminology preference: I am perfectly happy to call any possible AI agent a goal-directed agent. This is because people build AI agents to help them pursue some goals they have, which naturally makes these agents goal-directed. Identifying a sub-class of agents which we then call non-goal-directed looks like a pretty strange program to me, which can only cause confusion (and an artillery fire of feedback and criticism).

To bring this back to the post above, this leaves me wondering how the metrics you define above relate to safety, and how far along you are in your program of relating them to safety.

• Is your idea that a lower number on a metric implies more safety? This seems to be Rohin's original idea.

• Are these metrics supposed to have any directly obvious correlation to safety, or the particular failure scenario of 'will become adversarial and work against us' at all? If so I am not seeing the correlation.

Comment by Koen.Holtman on Counterfactual control incentives · 2021-02-26T16:18:57.652Z · LW · GW

This is because I think that the counterexample given here dissolves if there is an additional path without node from the matchmaking policy to the priced payed

I think you are using some mental model where 'paths with nodes' vs. 'paths without nodes' produces a real-world difference in outcomes. This is the wrong model to use when analysing CIDs. A path in a diagram -->[node]--> can always be replaced by a single arrow --> to produce a model that makes equivalent predictions, and the opposite operation is also possible.

So the number of nodes on a path better read as a choice about levels of abstraction in the model, not as something that tells us anything about the real world. The comment I just posted with the alternative development of the game model may be useful for you here, it offers a more specific illustration of adding nodes.

Comment by Koen.Holtman on Counterfactual control incentives · 2021-02-26T16:01:28.078Z · LW · GW

In this comment (last in my series of planned comments on this post) I'll discuss the detailed player-to-match-with example developed in the post:

In order to analyse the issues with the setup, let's choose a more narrowly defined example. There are many algorithms that aim to manipulate payers of mobile games in order to get them to buy more expensive in-game items.

I have by now re-read this analysis with the example several times. First time I read it, I already felt that it was a strange way to analyse the problem, but it took me a while to figure out exactly why.

Best I can tell right now is that there are two factors

1. I can't figure out if the bad thing that the example tries to prove is that a) agent is trying to maximize purchases, which is unwanted or b) the agent is manipulating user's item ranking, which is unwanted. (If it is only a), then there is no need to bring in all this discussion about correlation.)

2. the example refines its initial CID by redrawing it in a strange way

So now I am going to develop the same game example in a style that I find less strange. I also claim that this gets closer to the default style people use when they want to analyse and manage causal incentives.

To start with, this is the original model of the game mechanics: the model of the mechanics in the real world in which the game takes place.

This shows that the agent has an incentive to control predicted purchases upwards, but also to do so by influencing the item rankings that exist in the mind of the player.

If we want to weaken this incentive to influence the item rankings that exist in the mind of the player, we can construct a counterfactual planning world for the agent (see here for an explanation of the planning world terminology I am using):

(Carey et all call often call this planning world a twin model, a model which combines both factual and counterfactual events.) In both my work and in Carey et intention, the is that the above diagram defines the world model in which the agent will plan the purchases-maximizing action, and then this same action is applied in the real world model above.

Now, the important things to note are:

• this counterfactual construction does not eliminate the incentive of the agent to maximize purchases, as we still have the red arrow in there

• this counterfactual construction does not eliminate the ability of the agent to influence item rankings, as we still have the orange arrow in there

• but as the orange halo around the influenced item rankings is gone, the agent has lost its instrumental control incentive on item rankings. (The meaning of the orange halo and the terminology of instrumental control incentives are defined in Agent Incentives: A Causal Perspective.)

Now, say that we want to drill down further on these models, to a more detailed level of modeling. We might do so if we may want to examine further how the orange arrow above will act in practice.

We could add more detail by adding a node 'gameplay knowledge' which is correlated with item rankings. In the real-world model, this would be depicted as follows:

I added the blue arrow above to make the correlation between influenced gameplay knowledge and influenced rankings explicit as a line of causal influence. An equivalent blue arrow is not present in the drawings in the post above: technically speaking the drawing in the post is compatible with the assumption that there may be a correlation between the two, but it does not spell out the presence of this correlation, which would be unusual when doing this type of analysis.

The drawing in the post is also unusual in that it omits the red arrow I left in above. If I were to remove the red arrow, this would amount to a claim that in the real world situation, item ranking and gameplay knowledge are the only two channels by which the agent's decision may influence purchases. Clearly this is not the case: if the agent were to match the player with an opponent who is sure to win the fight and steal all of the player's possessions afterward, this would have an effect on predicted purchases, because the player would have to buy some replacements for all the stuff they lost.

I can now again construct a counterfactual planning world to go along with the refined real world model:

If we use the above planning world we have removed the instrumental control incentive on influenced rankings again.

We can go one step further and apply the same construction to gameplay knowledge:

The construction also goes some way towards addressing the problem of low-probability counterfactual worlds discussed in the post.

If we keep this up long enough, to split further observables out of the red arrow and give them the counterfactual treatment, we will weaken the ability of the agent to take 'rational' actions that maximize purchases in the real world further and further.

Going this route is nice in a thought experiment, but in practice a less cumbersome way to weaken this ability is to decouple the yellow utility node entirely, e.g. to use a random number generator for opponent selection.

Comment by Koen.Holtman on Formal Solution to the Inner Alignment Problem · 2021-02-25T22:38:07.738Z · LW · GW

Interesting paper! I like the focus on imitation learning, but the really new food-for-thought thing to me is the bit about dropping i.i.d. assumptions and then seeing how far you can get. I need to think more about the math in the paper before I can ask some specific questions about this i.i.d. thing.

My feelings about the post above are a bit more mixed. Claims about inner alignment always seem to generate a lot of traffic on this site. But a lot of this traffic consists of questions and clarification about what exactly counts as an inner alignment failure or a mesa optimization related failure. The term is so fluid that I find the quantitative feelings that people express in the comment section hard to interpret. Is everybody talking about the same and ?

# Thought experiment counter-example

Moving beyond these mixed feelings, here is a fun thought experiment, for various values of fun. You state:

our construction allows us to bound the extent to which mesa-optimizers could accomplish anything.

So now I will try to construct a counter-example to this claim: an example where mesa-optimizers (as I understand them) will accomplish very bad things even inside your construction.

Your construction uses a set of candidate policies , where one of them equals the real demonstrator policy .

A single policy is a function were computes the probability that action will be taken in the world state represented by the history .

Here is the first special step in the thought experiment. I am going to define each by using a different reward function that the policy is going to try to maximize. Every policy definition will use the same world state transition predictor that allows me to estimate for every . (I am probably getting slightly creative with the notation in the paper at this point.)

I now define each function as follows: I compute by assigning a probability of 1 to the one that comes out of the of the Bellman equation constructed using the shared predictor and the policy specific above. So the policy picks the action that will optimize the expected time discounted return of the reward function starting from the world state , and projected forward by the predictor above, under the assumption that the policy also controls all future actions.

So each is a model with a definite inner optimizer inside of it: the inner optimizer does a search over a world state tree to find the optimal solution to the Bellman equation.

Now for the second special step: I am going to make some assumptions about the agent environment, building up to a treacherous turn scenario:

• The demonstrator in the paper is a human, working alone with the agent.

• The demonstrator has a big hammer that they will use to smash the agent when it starts misbehaving.

• The world model projections in the policies are all detailed enough that they know about the hammer, and they know that they can expect the agent to get smashed when it performs an unexpected action.

• Therefore, for all policies where , policies with reward functions not shared by the demonstrator, the optimal course of action is to take exactly that action that the demonstrator would expect the agent to take, to avoid getting smashed. The world models are good enough that they can predict this action.

• The above is of course only the optimal action in these policies because they all project that there will be a point in time where the demonstrator becomes distracted, distracted enough that the agent can kill the demonstrator with a single action without getting smashed first. The projection also shows that the agent will have plausible deniability to other humans afterwards, so that it will survive to do whatever it wants.

When that point in time comes, these policies will all take exactly the same action, the action that kills the demonstrator. Only in the time steps and later, the policies start diverging in what actions they will take, because they have now accomplished the emergent sub-goal that was shared by all, so that each has their hands free to accomplish their real goal .

Now if I re-examine the claim:

our construction allows us to bound the extent to which mesa-optimizers could accomplish anything.

I have to conclude that the mesa-optimizers have at least accomplished the feat of killing the demonstrator. Whether they then can accomplish any of their divergent post-treachery goals in later time steps remains an open question, but the damage has been done.

# Issues

Some issues remain with this fun (for various values of fun) thought experiment.

I can note that at time , the policy that was fully aligned with the demonstrator did not want to kill the demonstrator, it would have taken another action. So, if I understand the paper right, if was still in the set of top policies at time , the agent would not take the action to kill the demonstrator, but ask the demonstrator to take an action instead, which avoids the bad outcome, and might also demote all of the treacherous turn policies out of the top set.

But I am also reading in the abstract of the paper that

[...] queries to the demonstrator rapidly diminish in frequency

so I think this implies that over time, it is likely that might disappear from the top set. But I am not fully sure here. I might be reading the paper wrong.

# Questions

So I have some questions:

• Does this thought experiment look reasonable or have I overlooked something?

• What about the probability that is still in the set of top policies at time ? To me, it looks like if we start with candidates in and the agent quickly narrows this down to a top set of size , then it is likely that will not be in there anymore.

• Any strong opinions from anyone about whether the above counts as a definite inner alignment failure due to mesa optimization (I think so) or definitely not?

This comment is long enough already so I am not going to speculate here about possible ways to avoid inserting models like the models I constructed above into the set . But speculation is welcome...

(I am calling this a 'fun' thought experiment because I am thinking of this as a limit case. This limit case useful to identify and provides some food for thought, but it does not really change my opinion about how stable or safe the approach in the paper might be in practice. I my experience, you can always find a limit case where things fail if you start looking for it.)

Comment by Koen.Holtman on Counterfactual control incentives · 2021-02-25T13:36:20.989Z · LW · GW

On recent terminology innovation:

we have decided to slightly update the terminology: in the latest version of our paper (accepted to AAAI, just released on arXiv) we prefer the term instrumental control incentive (ICI), to emphasize that the distinction to "control as a side effect".

For exactly the same reason, In my own recent paper Counterfactual Planning, I introduced the terms direct incentive and indirect incentive, where I frame the removal of a path to value in a planning world diagram as an action that will eliminate a direct incentive, but that may leave other indirect incentives (via other paths to value) intact. In section 6 of the paper and in this post of the sequence I develop and apply this terminology in the case of an agent emergency stop button.

In high-level descriptions of what the technique of creating indifference via path removal (or balancing terms) does, I have settled on using the terminology suppresses the incentive instead of removes the incentive.

I must admit that I have not read many control theory papers, so any insights from Rebecca about standard terminology from control theory would be welcome.

Do they have some standard phrasing where they can say things like 'no value to control' while subtly reminding the reader that 'this does not imply there will be no side effects?'

Comment by Koen.Holtman on Counterfactual control incentives · 2021-02-25T13:13:12.875Z · LW · GW

In this comment I will focus on the case of the posts-to-show agent only. The main question I explore is: does the agent construction below actually stop the agent from manipulating user opinions?

The post above also explores this question, my main aim here is to provide an exploration which is very different from the post, to highlight other relevant parts of the problem.

Carey et al designed an algorithm to remove this control incentive. They do this by instructing the algorithm to choose its posts, not on predictions of the user's actual clicks - which produce the undesired control incentive - but on predictions of what the user would have clicked on, if their opinions hadn't been changed.

In this graph, there is no longer any control incentive for the AI on the "Influenced user opinions", because that node no longer connects to the utility node.

[...]

It seems to neutralise a vicious, ongoing cycle of opinion change in order to maximize clicks. But, [...]

The TL;DR of my analysis is that the above construction may suppress a vicious, ongoing cycle of opinion change in order to maximize clicks, but there are many cases where a full suppression of the cycle will definitely not happen.

Here is an example of when full suppression of the cycle will not happen.

First, note that the agent can only pick among the posts that it has available. If all the posts that the agent has available are posts that make the user change their opinion on something, then user opinion will definitely be influenced by the agent showing posts, no matter how the decision what posts to show is computed. If the posts are particularly stupid and viral, this may well cause vicious, ongoing cycles of opinion change.

But the agent construction shown does have beneficial properties. To repeat the picture:

The above construction makes the agent indifferent about what effects it has on opinion change. It removes any incentive of the agent to control future opinion in a particular direction.

Here is a specific case where this indifference, this lack of a control incentive, leads to beneficial effects:

• Say that the posts to show agent in the above diagram decides on a sequence of 5 posts that will be suggested in turn, with the link to the next suggested post being displayed at the bottom of the current one. The user may not necessarily see all 5 suggestions, they may leave the site instead of clicking the suggested link. The objective is to maximize the number of clicks.

• Now, say that the user will click the next link with a 50% chance if the next suggested post is about cats. The agent's predictive model knows this.

• But if the suggested post is a post about pandas, then the user will click only with 40% chance, and leave the site with 60%. However, if they do click on the panda post, this will change their opinion about pandas. If the next suggested posts are also all about pandas, they will click the links with 100% certainty. The agent's predictive model knows this.

• In the above setup, the click-maximizing strategy is to show the panda posts.

• However, the above agent does not take the influence on user opinion by the first panda post into account. It will therefore decide to show a sequence of suggested cat posts.

To generalize from the above example: the construction creates a type of myopia in the agent, that makes it under-invest (compared to the theoretical optimum) into manipulating the user's opinion to get more clicks.

But also note that in this diagram:

there is still an arrow from 'posts to show' to 'influenced user opinion'. In the graphical language of causal influence diagrams. this is a clear warning that the agent's choices may end up influencing opinion, in some way. We have eliminated the agent incentive to control future opinion, but not the possibility that it might influence future opinion as a side effect.

I guess I should also say something about how the posts-to-show agent construction relates to real recommender systems as deployed on the Internet.

Basically, the posts-to-show agent is a good toy model to illustrate points about counterfactuals and user manipulation, but it does not provide a very complete model of the decision making processes that takes place inside real-world recommender systems. There is a somewhat hidden assumption in the picture below, represented by the arrow from 'model of original opinions' to 'posts to show':

The hidden assumption is that the agent's code which computes 'posts to show' will have access to a fairly accurate 'model of original opinions' for that individual user. In practice, that model would be very difficult to construct accurately, if the agent has to do so based on only past click data from that user. (A future superintelligent agent might of course design a special mind-reading ray to extract a very accurate model of opinion without relying on clicks....)

To implement at least a rough approximation of the above decision making process, we have to build user opinion models that rely on aggregating click data collected from many users. We might for example cluster users into interest groups, and assign each individual user to one or more of these groups. But if we do so, then the fine-grained time-axis distinction between 'original user opinions' and 'influenced opinions after the user has seen the suggested posts' gets very difficult to make. The paper "The Incentives that Shape Behaviour" suggests:

We might accomplish this by using a prediction model that assumes independence between posts, or one that is learned by only showing one post to each user.

An assumption of independence between posts is not valid in practice, but the idea of learning based on only one post per user would work. However, this severely limits the amount of useful training data we have available. So it may lead to much worse recommender performance, if we measure performance by either a profit-maximizing engagement metric or a happiness-maximizing user satisfaction metric.

Comment by Koen.Holtman on Counterfactual control incentives · 2021-02-25T13:12:34.893Z · LW · GW

Thanks for working on this! I my opinion, the management of incentives via counterfactuals is a very promising route to improving AGI safety, and this route has been under-explored by the community so far.

I am writing several comments on this post, this is the first one.

My goal is to identify and discuss angles of the problem which have not been identified in the post itself, and to identify related work.

On related work: there are obvious parallels between the counterfactual agent designs discussed in "The Incentives that Shape Behaviour" and the post above and the ITC agent that I constructed in my recent paper Counterfactual Planning. This post, about the paper presents the ITC agent construction in a more summarized way.

The main difference is that "The Incentives that Shape Behaviour" and the post above are about incentives in single-action agents, in my paper and related sequence I generalize to multi-action agents.

Quick pictorial comparison:

The similarity in construction is that some of the arrows into the yellow utility nodes emerge from a node that represents the past: the 'model of original opinions' in the first picture and the node in the second picture. This construction removes the agent's control incentive on the downstream nodes, 'influenced user opinions' and .

In the terminology I developed for my counterfactual planning paper, both pictures above depict 'counterfactual planning worlds' because the projected mechanics of how the agent's blue decision nodes determine outcomes in the model are different from the real-world mechanics that will determine the real-world outcomes that these decisions will have.

Comment by Koen.Holtman on Model splintering: moving from one imperfect model to another · 2021-02-24T12:15:01.266Z · LW · GW

The distinction is important if you want to design countermeasures that lower the probability that you land in the bad situation in the first place. For the first case, you might look at improving the agent's environment, or in making the agent detect when its environment moves off the training distribution. For the second case, you might look at adding features to the machine learning system itself. so that dangerous types of splintering become less likely.

I agree that once you have landed in the bad situation, mitigation options might be much the same, e.g. switch off the agent.

Comment by Koen.Holtman on Generalised models as a category · 2021-02-22T12:09:23.616Z · LW · GW

Definitely, it has also been my experience that you can often get new insights by constructing mappings to different models or notations.

Comment by Koen.Holtman on Generalised models as a category · 2021-02-18T16:03:35.165Z · LW · GW

Cross reference: I am not a big fan of stating things in category theory notation, so I made some remarks on the building and interpretation of generalised models in the comment section of this earlier post on model splintering.

Comment by Koen.Holtman on Model splintering: moving from one imperfect model to another · 2021-02-18T15:51:01.337Z · LW · GW

Just read the above post and some your related posts on model splintering and symbol grounding. Here are some thoughts and comments, also on some of the other posts.

In this post you are considering a type of of machine learning where the set of features in the learned model can be updated, not just the model's probability distribution . This is neat because it allows you to identify some specific risks associated with model refinements where changes. In many discussions in the AI alignment community, these risks are associated with the keywords 'symbol grounding' and 'ontological crises', so it is good to have some math that can deconfuse and disentangle the issues.

However, you also link model splintering to out-of-distribution robustness. Specifically, in section 1.1:

In the language of traditional ML, we could connect all these issues to "out-of-distribution" behaviour. This is the problems that algorithms encounter when the set they are operating on is drawn from a different distribution than the training set they were trained on.

[....] 2. What should the AI do if it finds itself strongly out-of-distribution?

and then in section 5 you write:

We can now rephrase the out-of-distribution issues of section 1.1 in terms of the new formalism:

1. When the AI refines its model, what would count as a natural refactoring of its reward function?
2. If the refinements splinter its reward function, what should the AI do?
3. If the refinements splinter its reward function, and also splinters the human's reward function, what should the AI do?

Compared to Rohin's comment above, I interpret the strength of this link vert differently.

I believe that the link is pretty weak, in that I cannot rephrase the out-of-distribution problems you mentioned as being the same 'if the AI's refinements do X' problems of section 5.

To give a specific example which illustrates my point:

• Say that we train a classifier to classify 100x100 pixel 24-bit color pictures as being pictures of either cats or dogs. The in this example consists of symbols that can identify each possible picture, and the symbols and . You can then have a probability distribution that gives you .

• We train the classifier on correctly labeled pictures of black cats and white dogs only. So it learns to classify by looking at the color of the animal.

• After training, we move the classifier out-of-distribution by feeding it pictures of white cats, black dogs, cats that look a bit like pandas, etc.

The main observation now is that this last step moves the classifier out-of-distribution. It is not the step of model refinement by the ML system that is causing any out-of-distribution issue here. The classifier is still using the same and , but it has definitely moved out-of-distribution in the last step.

So I might classify moving out-of-distribution as something that happens to a classifier or agent, and model splintering as something that the machine learning system does to itself.

Or I might think of splintering as something that can have two causes: 1) the ML system/agent landing out of distribution, 2) certain updates that machine learning does.

You are considering several metrics of model splintering above: I believe some of them are splintering metrics that would measure both causes. Others only measure cause 2.

As you note, there is an obvious connection between some of your metrics and those used in several RL and especially IRL reward function learning papers. To detect shattering from cause 2), one might use a metric form such a paper even if the paper did not consider cause 2), only cause 1).

Some more general remarks (also targeted at general readers of this comment section who want to get deeper into the field covered by the above post):

In many machine learning systems, from AIXI to most deep neural nets, the set of model features never changes: the system definition is such that all changes happen inside the model parameters representing .

Systems where a learned function is represented by a neural net with variable nodes, or by a dynamically constructed causal graph, would more naturally be ones where might be updated.

Of course, mathematical modeling is very flexible: one represent any possible system as having a fixed by shoving all changes it ever makes into .

As a general observation on building models to show and analyze certain problems: if we construct a machine learning system where never changes, then we can still produce failure modes that we can interpret as definite symbol grounding problems, or definite cases where the reward function is splintered, according to some metric that measures splintering.

Interpreting such a system as being capable of having an ontological crises gets more difficult, but if you really want to, you could.

I have recently done some work on modeling AGI symbol grounding failures, and on listing ways to avoid them, see section 10.2 of my paper here. (No current plans to cover the topic in the sequence about the topics in the paper too.) I wrote that section 10.2 to be accessible also to people who do not have years of experience with ML math, so in that sense it is similar to what the above post tries to do.

My approach to modeling symbol grounding failure in the paper is similar to that in your blog post here. I model in symbol grounding failures in an agent as failures of prediction that might be proven empirically.

In the terminology of this post, in the paper I advance the argument that it would be very good design practice (and that it is a commonly used design practice in ML architectures) to avoid reward function splintering as follows. First, define the reward function in a way where references only a subset of symbols , where any improved made by model refinement still has the same subset inside it. Furthermore, to prevent splintering, this has to be limited to the of symbols which directly represent a) possible sensor readings of physical sensors connected to the agent compute core, or b) potential commands to physical actuators connected to the agent compute core.

I also mention that in RL architectures with learning on a reward signal, the reward signal is the only sensor reading that one aims to keep symbol grounded always.

In your more recent modeling of symbol grounding errors here, that model strikes me more as being a special case that models symbol mapping failures in translation settings, not the symbol grounding problem we usually worry about in a black box RL agents.

Comment by Koen.Holtman on Creating AGI Safety Interlocks · 2021-02-11T18:26:51.077Z · LW · GW

Good questions.

(1) What are the differences / advantages / disadvantages between what you're proposing vs "make an off switch but don't tell the AGI about it"?

With the approach "make an off switch but don't tell the AGI about it", if you have a reasonably curios AGI agent, it is difficult to stop it from eventually learning about the existence of the stop button. So your safety with this approach really relies on the question what the agent will do after it learns about the stop button you made. Will it start caring about preventing the stop button from being pressed?

It is easy to construct an AI agent design that will definitely start caring about preventing its stop button from being pressed in this case. AGI safety thought experiments tend to silently assume that you must always end up with this dangerous type of design in an AGI agent. But this is not the only option.

You can interpret counterfactual planning as a design method that will stop you from ending up with the above dangerous type of AGI design.

The SI agent specification forces you to construct an agent implementation with a very low probability that it starts to care.

I need to explain the very low probability above. The key here is that perfect learning, , provably produces a zero probability that the SI agent has a direct incentive to care. The SI agent can only get a direct incentive to care if . For it to start caring via this route, the machine learning system has to make some specific and large mistakes in learning the properties of the agent environment.

Here is an example of a large learning mistake that will do it. Say that to such an extent that, in the planning world projected by , pressing the stop button will set off some explosives that will destroy the agent's paperclip factory. This is a large mistake because there are absolutely no such explosives in the learning world. But with that , the agent will definitely have an incentive to care about the button.

To achieve the above low probability of caring, we must therefore avoid large learning failures. This implies that we need to start the agent in a state where , where the learning system already starts out with a very reasonable approximation of the agent environment, and where it is unlikely to make large approximation mistakes later on.

(2) do you expect there to be another copy of the off-switch and its consequences (M) inside the St nodes?

Short answer: no, I do not expect there to be any copy of inside the nodes .

(Cross-reference: the mentioned above is the in the section The possibility of learned self-knowledge of the previous post in this sequence. It represents potentially learned self-knowledge about the learning world agent compute core.)

Longer answer: Not for , but under reasonable learning, I do expect there to be a projected model of the stop button itself inside the nodes. I expect that this model will deliver reasonably correct predictions of future stop button states to the arrows that emerge from these nodes.

So the stop button signal will be delivered to the planning world compute core, but the planning world compute core is a counterfactual one. Unlike the core in , it will not stop (in the sense of taking only actions in future) when the button is pressed.

If so, is it one of "the arrows which traverse the walls of the node St"? Because I don't see any arrows from M to St.

When a learning system failure puts some version of inside the nodes , in my above drawing this version would look like a set of small nodes inside the nodes, connected by some arrows, and these nodes might also be connected directly or indirectly to the outside arrows.

I consider the possibility of such an appearing in the nodes very unlikely if one has a high-quality machine learning system, for reasons I discuss in section 10.2 of the paper.

But if you want to reason about some theoretically possible large learning mistakes that put an inside the nodes, some interesting examples are:

• (will make the SI agent freeze up),

• (will produce an incentive in the SI agent to prevent the stop button from being pressed).

Comment by Koen.Holtman on Creating AGI Safety Interlocks · 2021-02-07T18:47:14.282Z · LW · GW

Or do you imagine that the programmer puts in the world-model by hand, or something?

That is exactly what I imagine. However, only certain key parts of the planning world model are hand-coded by programmer, not the whole thing.

I imagine an AGI world-model being a bit like a giant souped-up version of a probabilistic graphical model that can be learned from scratch and updated on the fly

Yes, that is usually how people imagine it. What I am doing in counterfactual planning is to get more deeply than usual into the details of how these giant souped-up models get built inside the agent's compute core.

The SI agent I specify is one that builds its planning world model to have the exact macro-level structure shown in the diagram that defines . The only degree of freedom that the learning system has is to determine the function , which defines only what happens at the level of detail below, what happens inside the nodes .

If you want to draw the whole planning world as a giant souped-up version of a probabilistic graphical model, you can do so by filling in the insides of the nodes with additional nodes and arrows, nodes and arrows whose detailed structure is specified by :

By making the SI agent construct its planning world models as depicted above, I can reliably build counterfactual planning worlds, even if I have no reliable way to implement something that can dig deep into the structures inside each node. I don't have to dig deep into to locate and change the bits that predict whether the off switch will be pressed. I have set up other parts of the model so that in the planning world, pressing the off switch has no effect on the planning world compute core's ability to take actions .

This mixing of hand-coded and learned world model parts happens in almost all machine learning agent designs I can think of. But most models of machine learning push these mixing steps into the background. In counterfactual planning, I tried to construct a viewpoint that depicts the exact mixing being done as a foreground feature of the planning world diagram.

Another feature of the graphical notation used is that it makes the possibility of having certain machine learning failure modes more visible. In the above picture, there is the obvious possibility that the arrows which traverse the walls of the nodes will not all be connected to right nodes of the learned model inside, as these connections are defined by the learned . In the paper, I define this as a failure of symbol grounding, and I examine this failure mode in the context of the reasonableness constraint . This leads to some interesting insights into the role of random exploration and Occam's law in symbol grounding. (For the details, see section 10 of the paper. I am starting to wonder if I should turn this section 10 into a post in this sequence.)

Comment by Koen.Holtman on Graphical World Models, Counterfactuals, and Machine Learning Agents · 2021-02-05T14:38:32.935Z · LW · GW

Thanks!

I don't have any novel modeling approach to resolve your question, I can only tell you about the standard approach.

You can treat planning where multiple actions spanning many time steps are considered as a single chunk as an approximation method, and approximation method for solving the optimal planning problem in the world model. In the paper, I mention and model this type of approximation briefly in section 3.2.1, but that section 3.2.1 is not included in the post above.

Some more details of how a approximation approach using action chunks would work: you start by setting the time step in the planning world model to something arbitrarily small, say 1 millisecond (anything smaller than the sample rate of the agent's fastest sensors will do in practical implementations). Then, treat any action chunk C as a special policy function C(s) where this policy function can return a special value end' to denote 'this chunk of actions is now finished'. The agent's machine leaning system may then construct a prediction function X(s',s,C) which predicts the probability that, starting in agent environment state s, executing C till the end will land the agent environment in state s'. It also needs to construct a function T(t,s,C) that estimates the probability distribution over the time taken (time steps in the policy C) till the policy ends, and an UC(s,C) that estimates the chunk of utility gained in the underlying reward nodes covered by C. These functions can then be used to compute an approximate solution to the of planning world . Graphically, a whole time series of , and nodes in the model gets approximated by cutting out all the middle nodes and writing the functions X and UC over the nodes and .

Representing the use of the function T in a graphical way is more tricky, it is easier to write the role of that function during the approximation process down by using a Bellman equation that unrolls the world model into individual time lines and ends each line when the estimated time is up. But I won't write out the Bellman equation here.

The solution found by the machinery above will usually be approximately optimal only, and the approximately optimal policy found may also end having estimated by averaging over over a set of world lines that are all approximately N time steps long in , but some world lines might be slightly shorter or longer.

The advantage of this approximation method with action/thought chunks C is that it could radically speed up planning calculations. In the Kahneman and Tversky system 1/system 2 model, something like this happens also.

Now, is is possible to imagine someone creating an illegible machine learning system that is capable of constructing the functions X and UC, but not T. If you have this exact type of illegibility, then you can not reliably (or even semi-reliably) approximate anymore, so you cannot built an approximation of an STH agent around such a learning system. However, learning the function T seems to be somewhat easy to me: there is no symbol grounding problem here, as long as we include time stamps in the agent environment states recorded in the observational record. We humans are also not too bad at estimating how long our action chunks will usually take. By the way, see section 10.2 of my paper for a more detailed discussion of my thoughts on handling illegibility, black box models and symbol grounding. I have no current plans to add that section of the paper as a post in this sequence too, as the idea of the sequence is to be a high-level introduction only.

Comment by Koen.Holtman on The Case for a Journal of AI Alignment · 2021-01-13T21:48:49.070Z · LW · GW

An idea for having more AI Alignment peer review

[...]

[...] might solve two problems at once:

• The lack of public feedback and in-depth peer review in most posts here
• The lack of feedback at all for newcomers [...]

I think you need to distinguish clearly between wanting more peer interaction/feedback and wanting more peer review

Academic peer review is a form of feedback, but it is mainly a form of quality control, so the scope of the feedback tends to be very limited in my experience.

The most valuable feedback, in terms of advancing the field, is comments like 'maybe if you combine your X with this Y, then something very new/even better will come out'.   This type of feedback can happen in private gdocs or LW/AF comment sections, less so in formal peer review.

That being said, I don't think that private gdocs or LW/AF comment sections are optimal peer interaction/feedback mechanisms, something better might be designed.   (The usual offline solution is to put a bunch of people together in the same building, either permanently or at a conference, and have many coffee breaks. Creating the same dynamics online is difficult.)

To make this more specific, here is what stops me usually from contributing feedback in AF comment sections. The way I do research, I tend to go on for months without reading any AF posts, as this would distract me too much.   When I catch up, I have little motivation to add a quick or detailed comment to a 2-month old post.

Comment by Koen.Holtman on The Case for a Journal of AI Alignment · 2021-01-13T20:56:57.147Z · LW · GW

I agree with Ryan's comments above on this being somewhat bad timing to start a journal for publishing work like the two examples mentioned at the start of the post above.  I have an additional reason, not mentioned by Ryan, for feeling this way.

There is an inherent paradox when you want to confer academic credibility or prestige on much of the work that has appeared on LW/AF, work that was produced from an EA or x-risk driven perspective.    Often, the authors chose the specific subject area of the work exactly because at the time, they felt that the subject area was a) important for x-risk while also b) lacking the credibility or prestige in main-stream academia that would have been necessary for academia to produce sufficient work in the subject area.

If condition b) is not satisfied, or becomes satisfied, then the EA or x-risk driven researchers (and EA givers of research funds) will typically move elsewhere.

I can't see any easy way to overcome this paradox of academic prestige-granting on prestige-avoiding work in an academic-style journal.  So I think that energy is better spent elsewhere.

Comment by Koen.Holtman on Some AI research areas and their relevance to existential safety · 2020-11-24T22:19:22.774Z · LW · GW

Nice post!  In particular, I like your reasoning about picking research topics:

The main way I can see present-day technical research benefiting existential safety is by anticipating, legitimizing and fulfilling governance demands for AI technology that will arise over the next 10-30 years.  In short, there often needs to be some amount of traction on a technical area before it’s politically viable for governing bodies to demand that institutions apply and improve upon solutions in those areas.

I like this as a guiding principle, and have used it myself, though my choices have also been driven in part by more open-ended scientific curiosity.  But when I apply the above principle, I get to quite different conclusions about recommended research areas.

As a specific example, take the problem of oversight of companies that want to create of deploy strong AI: the problem of getting to a place where society has accepted and implemented policy proposals that demand significant levels of oversight for such companies.  In theory, such policy proposals might be held back by a lack of traction in a particular technical area, but I do not believe this is a significant factor in this case.

To illustrate, here are some oversight measures that apply right now to companies that create medical equipment, including diagnostic equipment that contains AI algorithms. (Detail: some years ago I used to work in such a company.) If the company wants to release any such medical technology to the public, it has to comply with a whole range of requirements about documenting all steps taken in development and quality assurance.  A significant paper trail has to be created, which is subject to auditing by the regulator.  The regulator can block market entry if the processes are not considered good enough.  Exactly the same paper trail + auditing measures could be applied to companies that develop powerful non-medical AI systems that interact with the public.  No technical innovation would be necessary to implement such measures.

So if any activist group or politician wants to propose measures to improve oversight of AI development and use by companies (either motivated by existential safety risks or by a more general desire to create better outcomes in society), there is no need for them to wait for further advances in Interpretability in ML (IntML), Fairness in ML (FairML) or Accountability in ML (AccML) techniques.

To lower existential risks from AI, it is absolutely necessary to locate proposals for solutions which are technically tractable.  But to find such solutions, one must also look at low-tech and different-tech solitions that go beyond the application of even more AI research.  The existence of tractable alternative solutions to make massive progress leads me to down-rank the three AI research areas I mention above, at least when considered from a pure existential safety perspective.  The non-existence of alternatives also leads me to up-rank other areas (like corrigibility) which are not even mentioned in the original post.

I like the idea of recommending certain fields for their educational value to existential-safety-motivated researchers. However, I would also recommend that such researchers read broadly beyond the CS field, to read about how other high-risk fields are managing (or have failed to manage) to solve their safety and governance problems.

I believe that the most promising research approach for lowering AGI safety risk is to find solutions that combine AI research specific mechanisms with more general mechanisms from other fields, like the use of certain processes which are run by humans.

Comment by Koen.Holtman on Question: MIRI Corrigbility Agenda · 2020-08-20T16:50:41.002Z · LW · GW

Nope, not intentional.

You should feel free to write a literature overview that cites or draws heavily on paper-announcement blog posts. I definitely won't mind. In general, the blog posts tend to use language that is less mathematical and more targeted at a non-specialist audience. So if you aim to write a literature overview that is as readable as possible for a general audience, then drawing on phrases from the author's blog posts describing the papers (when such posts are available) may be your best bet.

Comment by Koen.Holtman on Question: MIRI Corrigbility Agenda · 2020-08-20T16:20:49.440Z · LW · GW

Thanks, you are welcome!

Dutch custom prevents me from recommending my own recent paper in any case, so I had to recommend one paper from the time frame 2015-2020 that you probably have not read yet, I'd recommend 'Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective'. This stands out as an overview of different approaches, and I think you can get a good feeling of the state of the field out of it even if you do not try to decode all the math.

Note that there are definitely some worthwhile corrigibility related topics that are discussed only/mainly in blog posts and in LW comment threads, but not in any of the papers I mention above or in my mid-2019 related work section. For example, there is the open question whether Christiano's Iterated Amplification approach will produce a kind of corrigibility as an emergent property of the system, and if so what kind, and is this the kind we want, etc. I have not seen any discussion of this in the 'formal literature', if we define the formal literature as conference/arxiv papers, but there is a lot of discussion of this in blog posts and comment threads.

Comment by Koen.Holtman on My Understanding of Paul Christiano's Iterated Amplification AI Safety Research Agenda · 2020-08-20T13:02:53.869Z · LW · GW

In that comment I focus on corrigibility related work that has appeared as scientific papers and/or arxiv preprints.

Comment by Koen.Holtman on Question: MIRI Corrigbility Agenda · 2020-08-20T11:14:45.017Z · LW · GW

Just found your question via comment sections of recent posts. I understand you are still interested in the topic. so I'll add to the comments below. In the summer of 2019 I did significant work trying to understand the status of the corrigibility literature, so here is a long answer mostly based on that.

First, at this point in time there is no up-to-date centralised reading list on corrigibility. All research agenda or literature overview lists that I know of lack references to the most recent work.

Second, the 'MIRI corrigibility agenda', if we define this agenda as a statement of the type of R&D that MIRI wants to encourage when it comes to the question of corrigibility, is very different from e.g. the 'Paul Christiano corrigibility agenda', if we define that agenda as the type of R&D that Paul Christiano likes to do when it comes to the question of corrigibility. MIRI's agenda related to corrigibility still seems to be to encourage work on decision theory and embeddedness. I am saying 'still seems' here because MIRI as an organisation has largely stopped giving updates about what they are thinking collectively.

Below I am going to talk about the problem of compiling or finding up to date reading lists that show all work on the problem of corrigibility, not a subset of work that is most preferred or encouraged by a particular agenda.

One important thing to note is that by now, unfortunately, the word corrigibility means very different things to different people. MIRI very clearly defined corrigibility, in their 2015 paper with that title, by a list of 4 criteria, (and in a later section also by a list of 5 criteria at a different level of abstraction), 4 criteria that an agent has to satisfy in order to be corrigible. Many subsequent authors have used the terms 'corrigibility' or 'this agent is corrigible' to denote different, and usually weaker, desirable properties of an agent. So if someone says that they are working on corrigibility, they may not be working towards the exact 4 (or 5) criteria that MIRI defined. MIRI stresses that a corrigible agent should not take any action that tries to prevent a shutdown button press (or more generally a reward function update). But many authors are defining success in corrigibility to mean a weaker property, e.g. that the agent must always accept the shutdown instruction (or the reward function update) when it gets it, irrespective of whether the agent tried to manipulate the human into not pressing the stop button beforehand.

When writing the related work section of my 2019 paper corrigibility with utility preservation, I tried to do a survey of all related work on corrigibility, a survey without bias towards my own research agenda. I quickly found that there is a huge amount of writing about corrigibility in various blog/web forum posts and their comment sections, way too much for me to describe in a related work section. There was too much for me to even read it all, though I read a lot of it. So I limited myself, for the related work section, to reading and describing the available scientific papers, including arxiv preprints. I first created a long list of some 60 papers by using google scholar to search for all papers that reference the 2015 MIRI paper, by using some other search terms, any by using literature overviews. I then filtered out all the papers which a) just mention corrigibility in a related work section or b) describe the problem in more detail, but without contributing any new work or insights towards a solution. This left me with a short list of only a few papers to cite as related work, actually it surprised me that so little further work had been done on corrigibility after 2015, at least work that made it to publication in a formal paper or preprint.

In any case, I can offer the related work section in my mid 2019 paper on corrigibility is an up-to-date-as-of-mid-2019 reading list on corrigibility, for values of the word corrigibility that stay close to the original 2015 MIRI definition. For broader work that departs further from the definition, I used the device of referencing the 2018 literature review of Everitt, Lee and Hutter.

So what about the literature written after mid-2019 that would belong on a corrigibility reading list? I have not done a complete literature search since then, but definitely my feeling is that the pace of work on corrigibility has picked up a bit since mid 2019, for various values of the word corrigibility.

Several authors, including myself, are avoiding the word corrigibility, to refer to the problem of corrigibility, My own reason for avoiding it is that it just means too many different things to different people. So I prefer to use a broader terms like 'reward tampering' or 'unwanted manipulation of the end user by the agent'. In the 2019 book human compatible, Russell is using the phrasing 'the problem of control' to kind-of denote the problem of corrigibility.

So here is my list of post-mid-2019 books and papers are useful to read if you want to do new R&D on safety mechanisms that achieve corrigibility/that prevent reward tampering or unwanted manipulation, if you want to do more R&D on such mechanisms without risking re-inventing the wheel. Unlike the related work section discussed above, this is not based on a systematic global long-list-to-short-list literature search, it is just work that happened to encounter (and write myself).

• The book human compatible by Russell. -- This book provides a good natural-language problem statement of the reward tampering problem, but it does not get into much technical detail about possible solutions, because it is not aimed at a technical audience. For technical detail about possible solutions:
• Everitt, T., Hutter, M.: Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective. arXiv:1908.04734 (2019) -- this paper is not just about causal influence diagrams but it also can be used as a good literature overview of many pre-mid-2019 reward tampering solutions, a literature overview that is more recent, and provides more descriptive detail, than the 2018 literature review I mentioned above.
• Stuart Armstrong, Jan Leike, Laurent Orseau, Shane Legg: Pitfalls of learning a reward function online
https://arxiv.org/abs/2004.13654
-- this has a very good problem statement in the introduction, phrasing the tampering problem in an 'AGI agent as a reward learner' context. It then gets into a very mathematical examination of the problem.
• Koen Holtman: AGI Agent Safety by Iteratively Improving the Utility Function
https://arxiv.org/abs/2007.05411
(blog post intro here) -- This deals with a particular solution direction to the tampering problem. It also uses math, but I have tried to make the math as accessible as possible to a general technical audience.

This post-mid-2019 reading list is also biased to my own research agenda, and my agenda favours the use of mathematical methods and mathematical analysis over the use of natural language when examining AGI safety problems and solutions. Other people might have other lists.

Comment by Koen.Holtman on Tradeoff between desirable properties for baseline choices in impact measures · 2020-07-21T13:23:20.419Z · LW · GW
By "semantic models that are rich enough", do you mean that the AI might need a semantic model for the power of other agents in the environment?

Actually in my remarks above I am less concerned about how rich a model the AI may need. My main intuition is that we ourselves may need a semantic model for that describes the comparable power of several players, if our goal is to understand motivations towards power more deeply and generally.

To give a specific example from my own recent work: in working out more details about corrigibility and indifference, I ended up defining a safety property 2 (S2 in the paper) that is about control. Control is a form of power: if I control an agent's future reward function, I have power over the agent, and indirect power over the resources it controls. To define safety property 2 mathematically, I had to make model extensions that I did not need to make to define or implement the reward function of the agent itself. So by analogy, if you want to understand and manage power seeking in an n-player setting, you may end up needing to define model extensions and metrics that are not present inside the reward functions or reasoning systems of each player. You may need them to measure, study, or define the nature of the solution.

The interesting paper you mention gives a kind-of example of such a metric, when it defines an equality metric for its battery collecting toy world, an equality metric that is not (explicitly represented) inside the agent's own semantic model. For me, an important research challenge is to generalise such toy-world specific safety/low-impact metrics into metrics that can apply to all toy (and non-toy) world models.

Yet I do not see this generalisation step being done often, and I am still trying to find out why not. Partly I think I do not see it often because it is mathematically difficult. But I do not think that is the whole story. So that is one reason I have been asking opinions about semantic detail.

In one way, the interesting paper you mention goes in a direction that is directly counter to the one I feel is the most promising one. The paper explicitly frames its solution as a proposed modification of a specific deep Q-learning machine learning algorithm, not as an extension to the reward function that is being supplied to this machine learning algorithm. By implication, this means they add more semantic detail inside the machine learning code, while keeping it out of it out of the reward function. My preference is to extend the reward function if at all possible, because this produces solutions that will generalise better over current and future ML algorithms.

Comment by Koen.Holtman on Tradeoff between desirable properties for baseline choices in impact measures · 2020-07-19T14:53:05.567Z · LW · GW

Thanks for clarifying your view! I agree that for point 1 above, less semantic structure should be needed.

Reading some of the links above again, I still feel that we might be having different views on how much semantic structure is needed. But this also depends on what you count as semantic structure.

To clarify where I am coming from, I agree with the thesis of your paper Optimal Farsighted Agents Tend to Seek Power. I am not in the camp which, to quote the abstract of the paper, 'voices scepticism' about emergent power seeking incentives.

But me the, the main mechanism that turns power seeking incentives into catastrophic power-seeking is when at least two power-seeking entities with less than 100% aligned goals start to interact with each other in the same environment. So I am looking for semantic models that are rich enough to capture at least 2 players being present in the environment.

I have the feeling that you believe that moving to the 2-or-more-players level of semantic modelling is of lesser importance, is in fact a distraction, that we may be able to solve things cleanly enough if we just make every agent not seek power too much. Or maybe you are just prioritizing a deeper dive in that particular direction initially?

Comment by Koen.Holtman on Tradeoff between desirable properties for baseline choices in impact measures · 2020-07-19T11:39:16.493Z · LW · GW

Thanks for the clarification, I think our intuitions about how far you could take these techniques may be more similar than was apparent from the earlier comments.

You bring up the distinction between semantic structure that is learned via unsupervised learning, and semantic structure that comes from 'explicit human input'. We may be using the term 'semantic structure' in somewhat different ways when it comes to the question of how much semantic structure you are actually creating in certain setups.

If you set up things to create an impact metric via unsupervised learning, you still need to encode some kind of impact metric on the world state by hand, to go into the agents's reward function, e.g. you may encode 'bad impact' as the observable signal 'the owner of the agent presses the do-not-like feedback button'. For me, that setup uses a form of indirection to create an impact metric that is incredibly rich in semantic structure. It is incredibly rich because it indirectly incorporates the impact-related semantic structure knowledge that is in the owner's brain. You might say instead that the metric does not have a rich of semantic structure at all, because it is just a bit from a button press. For me, an impact metric that is defined as 'not too different from the world state that already exists' would also encode a huge amount of semantic structure, in case the world we are talking about is not a toy world but the real world.

Comment by Koen.Holtman on Tradeoff between desirable properties for baseline choices in impact measures · 2020-07-17T15:56:44.333Z · LW · GW

Reading the above, I am reminded of a similar exchange about the need for semantic structure between Alex Turner and me here, so I'd like to get to the bottom of this. Can you clarify your broader intuitions about the need or non-need for semantic structure? (Same question goes to Alex.)

Frankly, I expected you would have replied to Stuart's comment with a statement like the following: 'using semantic structure in impact measures is a valid approach, and it may be needed to encode certain values, but in this research we are looking at how far we can get by avoiding any semantic structure'. But I do not see that.

Instead, you seem to imply that leveraging semantic structure is never needed when further scaling impact measures. It looks like you feel that we can solve the alignment problem by looking exclusively at 'model-free' impact measures.

To make this more specific, take the following example. Suppose a mobile AGI agent has a choice between driving over one human, driving over P pigeons, or driving over C cats. Now, humans have very particular ideas about how they value the lives of humans, pigeons, and cats, and would expect that those ideas are reflected reasonably well in how the agent computes its impact measure. You seem to be saying that we can capture all this detail by just making the right tradeoffs between model-free terms, by just tuning some constants in terms that calculate 'loss of options by driving over X'.

Is this really what you are saying?

I have done some work myself on loss-of-options impact measures (see e.g. section 12 of my recent paper here). My intuition about how far you can scale these 'model-free' techniques to produce human-morality-aligned safety properties in complex environments seems to be in complete disagreement with your comments and those made by Alex.

Comment by Koen.Holtman on New paper: AGI Agent Safety by Iteratively Improving the Utility Function · 2020-07-16T13:28:02.386Z · LW · GW

Thanks!

I'm not sure if I fully understood the section on machine learning. Is the main idea that you just apply the indifference correction at every timestep, so that the agent always acts as if it believes that use of the terminal does nothing?

Yes, the learning agent also applies the indifference-creating balancing term at each time step. I am not sure if there is a single main idea that summarizes the learning agent design -- if there had been a single main idea then the section might have been shorter. In creating the learning agent design I combined several ideas and techniques, and tweaked their interactions until I had something that provably satisfies the safety properties.

What about the issue that "the terminal does nothing" is actually a fact that has impacts on the world, which might produce a signal in the training data?

As a general rule, inside the training data gathered during previous time steps, it will be very visible that the signals coming from the input terminal, and any changes in them, will have an effect on the agent's actions.

This is not a problem, but to illustrate why not I will first describe an alternative learning agent design where it would be a problem. Consider a model-free advanced Q-learning type agent, which uses the decision making policy of 'do more of what earlier versions of myself did when they got high reward signals'. If such an agent has the container reward function defined in the paper, then if the training record implies the existence of attractive wireheading options, these might well be used. If the Q-learner has the container reward function, then the policy process end up with an emergent drive to revert any updates made via the input terminal, so that the agent gets back to a set of world states which are more familiar territory. The agent might also want to block updates, for the same reason. But the learning agent in the paper does not use this Q-learning type of decision making policy.

The agent in the paper takes actions using a reasoning process that is different from do what earlier versions of myself did when..'. Before I try to describe it, first a disclaimer. Natural language analogies to human reasoning are a blunt tool for describing what happens in the learning agent: this agent has too many moving and partially self-referential parts inside to capture them all in a single long sentence. That being said, the learning agent's planning process is like 'do what a hypothetical agent would do, in the world you have learned about, under the counterfactual assumption that the payload reward function of that hypothetical agent will never change, no matter what happens at the input terminal you also learned about'.

In section 11 of the paper, I describe the above planning process as creating a form of bureaucratic blindness. By design, the process simply ignores some of the information in the training record: this information is simply not relevant to maximizing the utility that needs to be maximized.

The analogy is that if you tell a robot that not moving is safe behavior, and get it to predict what happens in safe situations, it will include a lot of "the humans get confused why the robot isn't moving and try to fix it" in its predictions. If the terminal actually does nothing, humans who just used the terminal will see that, and will try to fix the robot, as it were. This creates an incentive to avoid situations where the terminal is used, even if it's predicted that it does nothing.

I think what you are describing above is the residual manipulation incentive in the second toy world of section 6 of the paper. This problem also exists for optimal-policy agents that have nothing left to learn', so it is an emergent effect that is unrelated to machine learning.

Comment by Koen.Holtman on An overview of 11 proposals for building safe advanced AI · 2020-06-04T15:43:13.313Z · LW · GW

Thanks for the post! Frankly this is a sub-field of alignment that I have not been following closely, so it is very useful to have a high-level comparative overview.

I have a question about your thoughts on what 'myopia verification' means in practice.

Do you see 'myopia' as a single well-defined mathematical property that might be mechanically verified by an algorithm that tests the agent? Or is it a more general bucket term that means bad in a particular way', where a human might conclude, based on some gut feeling when seeing the output of a transparency tool, that the agent might not be sufficiently myopic?

What informs this question is that I can't really tell when I re-read your Towards a mechanistic understanding of corrigibility and the comments there. So I am wondering about your latest thinking.

Comment by Koen.Holtman on Specification gaming: the flip side of AI ingenuity · 2020-05-15T14:11:22.318Z · LW · GW

In the TAISU unconference the original poster asked for some feedback:

I recently wrote a blog post with some others from the DM safety team on specification gaming. We were aiming for a framing of the problem that makes sense to reinforcement learning researchers as well as AI safety researchers. Haven't received much feedback on it since it came out, so it would be great to hear whether people here found it useful / interesting.

My thoughts: I feel that engaging/reaching out to the wider community of RL researchers is an open problem, in terms of scaling work on AGI safety. So great to see a blog post that also tries to frame this particular problem for a RL researcher audience.

As a member of the AGI safety researcher audience, I echo the comments of johnswenthworth : well-written, great graphics, but mostly stuff that was already obvious. I do like picture 'spectrum of unexpected solutions' a lot, this is an interesting way of framing the issues. So, can I read this post as a call to action for AGI safety researchers? Yes, because it identifies two open problem areas, 'reward design' and 'avoidance of reward tampering', with links.

Can I read the post as a call to action for RL researchers? Short answer: no.

If try to read the post from the standpoint of an RL researcher, what I notice most is the implication that work on 'RL algorithm design', on the right in the aligned RL agent design' illustrations has an arrow pointing to 'specification gaming is valid'. If I were an RL algorithm designer, I would read this as saying there is nothing I could contribute, if I stay in my own area of RL algorithm design expertise, to the goal of 'aligned RL agent design'.

So, is this the intended message that the blog post authors want to send to the RL researcher community? A non-call-to-action? Not sure. So this leaves me puzzled.