Posts

Reward Hacking from a Causal Perspective 2023-07-21T18:27:39.759Z
Incentives from a causal perspective 2023-07-10T17:16:28.373Z
Agency from a causal perspective 2023-06-30T17:37:58.376Z
Causality: A Brief Introduction 2023-06-20T15:01:39.377Z
Introduction to Towards Causal Foundations of Safe AGI 2023-06-12T17:55:24.406Z
Progress on Causal Influence Diagrams 2021-06-30T15:34:33.381Z
Specification gaming: the flip side of AI ingenuity 2020-05-06T23:51:58.171Z
CIRL Wireheading 2017-08-08T06:33:57.000Z
Sequential Extensions of Causal and Evidential Decision Theory 2015-10-15T23:45:48.000Z

Comments

Comment by tom4everitt on A Shutdown Problem Proposal · 2024-01-23T13:11:45.317Z · LW · GW

The main thing this proposal is intended to do is to get past the barriers MIRI found in their old work on the shutdown problem. In particular, in a toy problem basically-identical to the one MIRI used, we want an agent which:

  • Does not want to manipulate the shutdown button
  • Does respond to the shutdown button
  • Does want to make any child-agents it creates responsive-but-not-manipulative to the shutdown button, recursively (i.e. including children-of-children etc)

If I understand correctly, this is roughly the combination of features which MIRI had the most trouble achieving simultaneously.

 

From a quick read, your proposal seems closely related to Jessica Taylor's causal-counterfactual utility indifference. Ryan Carey and I also recently had a paper formalising some similar ideas, with some further literature review https://arxiv.org/abs/2305.19861

Comment by tom4everitt on 3. Premise three & Conclusion: AI systems can affect value change trajectories & the Value Change Problem · 2023-11-02T16:47:27.873Z · LW · GW

I really like this articulation of the problem!

To me, a way to point to something similar is to say that preservation (and enhancement) of human agency is important (value change being one important way that human agency can be reduced). https://www.alignmentforum.org/s/pcdHisDEGLbxrbSHD/p/Qi77Tu3ehdacAbBBe

One thing I've been trying to argue for is that we might try to pivot agent foundations research to focus more on human agency instead of artificial agency. For example, I think value change is an example of self-modification, which has been studied a fair bit for artificial agents.

Comment by tom4everitt on Reward Hacking from a Causal Perspective · 2023-08-15T09:58:41.453Z · LW · GW

I see, thanks for the careful explanation.

I think the kind of manipulation you have in mind is bypassing the human's rational deliberation, which is an important one. This is roughly what I have in mind when I say "covert influence". 

So in response to your first comment: given that the above can be properly defined, there should also be a distinction between using and not using covert influence?

Whether manipulation can be defined as penetration of a Markov blanket, it's possible. I think my main question is how much it adds to the analysis, to characterise it in terms of a Markov blanket. Because it's non-trivial to define the membrane variable, in a way that information that "covertly" passes through my eyes and ears bypasses the membrane, while other information is mediated by the membrane. 

The SEP article does a pretty good job at spelling out the many different forms manipulation can take https://plato.stanford.edu/entries/ethics-manipulation/

Comment by tom4everitt on Reward Hacking from a Causal Perspective · 2023-08-14T17:06:07.917Z · LW · GW

The point here isn't that the content recommender is optimised to use covert means in particular, but that it is not optimised to avoid them. Therefore it may well end up using them, as they might be the easiest path to reward.

Re Markov blankets, won't any kind of information penetrate a human's Markov blanket, as any information received will alter the human's brain state?

Comment by tom4everitt on Agency from a causal perspective · 2023-07-07T17:28:46.623Z · LW · GW

Thanks, that's a nice compilation, I added the link to the post. Let me check with some of the others in the group, who might be interested in chatting further about this

Comment by tom4everitt on Agency from a causal perspective · 2023-07-07T17:24:15.892Z · LW · GW

fixed now, thanks! (somehow it added https:// automatically)

Comment by tom4everitt on Causality: A Brief Introduction · 2023-07-07T17:21:54.893Z · LW · GW

Sure, I think we're saying the same thing: causality is frame dependent, and the variables define the frame (in your example, you and the sensor have different measurement procedures for detecting the purple cube, so you don't actually talk about the same random variable).

How big a problem is it? In practice it seems usually fine, if we're careful to test our sensor / double check we're using language in the same way. In theory, scaled up to super intelligence, it's not impossible it would be a problem.

But I would also like to emphasize that the problem you're pointing to isn't restricted to causality, it goes for all kinds of linguistic reference. So to the extent we like to talk about AI systems doing things at all, causality is no worse than natural language, or other formal languages.

I think people sometimes hold it to a higher bar than natural language, because it feels like a formal language could somehow naturally intersect with a programmed AI. But of course causality doesn't solve the reference problem in general. Partly for this reason, we're mostly using causality as a descriptive language to talk clearly and precisely (relative to human terms) about AI systems and their properties.

Comment by tom4everitt on Causality: A Brief Introduction · 2023-07-04T10:45:25.136Z · LW · GW

The way I think about this, is that the variables constitute a reference frame. They define particular well-defined measurements that can be done, which all observers would agree about. In order to talk about interventions, there must also be a well-defined "set" operation associated with each variable, so that the effect of interventions is well-defined.

Once we have the variables, and a "set" and "get" operation for each (i.e. intervene and observe operations), then causality is an objective property of the universe. Regardless who does the experiment (i.e. sets a few variables) and does the measurement (i.e. observes some variables), the outcome will follow the same distribution.

So in short, I don't think we need to talk about an agent observer beyond what we already say about the variables.

Comment by tom4everitt on Causality: A Brief Introduction · 2023-06-26T14:24:57.763Z · LW · GW

nice, yes, I think logical induction might be a way to formalise this, though others would know much more about it

Comment by tom4everitt on Causality: A Brief Introduction · 2023-06-22T16:34:54.157Z · LW · GW

I had intended to be using the program's output as a time series of bits, where we are considering the bits to be "sampling" from A and B. Let's say it's a program that outputs the binary digits of pi. I have no idea what the bits are (after the first few) but there is a sense in which P(A) = 0.5 for either A = 0 or A = 1, and at any timestep. The same is true for P(B). So P(A)P(B) = 0.25. But clearly P(A = 0, B = 0) = 0.5, and P(A = 0, B = 1) = 0, et cetera. So in that case, they're not probabilistically independent, and therefore there is a correlation not due to a causal influence.

 

Just to chip in on this: in the case you're describing, the numbers are not statistically correlated, because they are not random in the statistics sense. They are only random given logical uncertainty. 

When considering logical "random" variables, there might well be a common logical "cause" behind any correlation. But I don't think we know how to properly formalise or talk about that yet. Perhaps one day we can articulate a logical version of Reichenbach's principle :)

Comment by tom4everitt on Causality: A Brief Introduction · 2023-06-22T16:17:08.269Z · LW · GW

Thanks for the suggestion. We made an effort to be brief, but perhaps we went too far. In our paper Reasoning about causality in games, we have a longer discussion about probabilistic, causal, and structural models (in Section 2), and Pearl's book A Primer also offers a more comprehensive introduction.

I agree with you that causality offers a way to make out-of-distribution predictions (in post number 6, we plan to go much deeper into this). In fact, a causal Bayesian network is equivalent to an exponentially large set of probability distributions, where there is one joint distribution $P_{\do(X=x)}$ for any possible combinations of interventions $X=x$.

We'll probably at least add some pointers to further reading, per your suggestion. (ETA: also added a short paragraph near the end of the Intervention section.)

Comment by tom4everitt on Introduction to Towards Causal Foundations of Safe AGI · 2023-06-16T15:47:40.908Z · LW · GW

Preferences and goals are obviously very important. But I'm not sure they are inherently causal, which is why they don't have their own bullet point on that list.  We'll go into more detail in subsequent posts

Comment by tom4everitt on Introduction to Towards Causal Foundations of Safe AGI · 2023-06-15T17:05:35.043Z · LW · GW

I'm not sure I entirely understand the question, could you elaborate? Utility functions will play a significant role in follow-up posts, so in that sense we're heavily building on VNM.

Comment by tom4everitt on Discovering Agents · 2023-01-31T09:39:39.398Z · LW · GW

The idea ... works well on mechanised CIDs whose variables are neatly divided into object-level and mechanism nodes. ... But to apply this to a physical system, we would need a way to obtain such a partition those variables

Agree, the formalism relies on a division of variable. One thing that I think we should perhaps have highlighted much more is Appendix B in the paper, which shows how you get a natural partition of the variables from just knowing the object-level variables of a repeated game.

Does a spinal reflex count as a policy?

A spinal reflex would be different if humans had evolved in a different world. So it reflects an agentic decision by evolution. In this sense, it is similar to the thermostat, which inherits its agency from the humans that designed it.

Does an ant's decision to fight come from a representation of a desire to save its queen?

Same as above.

How accurate does its belief about the forthcoming battle have to be before this representation counts?

One thing that I'm excited about to think further about is what we might call "proper agents", that are agentic in themselves, rather than just inheriting their agency from the evolution / design / training process that made them. I think this is what you're pointing at with the ant's knowledge. Likely it wouldn't quite be a proper agent (but a human would, as we are able to adapt without re-evolving in a new environment). I have some half-developed thoughts on this.

Comment by tom4everitt on Clarifying AI X-risk · 2022-11-14T11:58:16.566Z · LW · GW

This makes sense, thanks for explaining. So a threat model with specification gaming as its only technical cause, can cause x-risk under the right (i.e. wrong) societal conditions.

Comment by tom4everitt on Clarifying AI X-risk · 2022-11-04T12:12:37.198Z · LW · GW

For instance: why expect that we need a multi-step story about consequentialism and power-seeking in order to deceive humans, when RLHF already directly selects for deceptive actions?

Is deception alone enough for x-risk? If we have a large language model that really wants to deceive any human it interacts with, then a number of humans will be deceived. But it seems like the danger stops there. Since the agent lacks intent to take over the world or similar, it won't be systematically deceiving humans to pursue some particular agenda of the agent. 

As I understand it, this is why we need the extra assumption that the agent is also a misaligned power-seeker.

Comment by tom4everitt on Agency engineering: is AI-alignment "to human intent" enough? · 2022-09-29T13:16:20.058Z · LW · GW

I think the point that even an aligned agent can undermine human agency is interesting and important. It relates to some of our work on defining agency and preventing manipulation. (Which I know you're aware of, so I'm just highlighting the connection for others.)

Comment by tom4everitt on Discovering Agents · 2022-09-02T10:46:07.280Z · LW · GW

Sorry, I worded that slightly too strongly. It is important that causal experiments can in principle be used to detect agents. But to me, the primary value of this isn't that you can run a magical algorithm that lists all the agents in your environment. That's not possible, at least not yet. Instead, the primary value (as i see it) is that the experiment could be run in principle, thereby grounding our thinking. This often helps, even if we're not actually able to run the experiment in practice.

I interpreted your comment as "CIDs are not useful, because causal inference is hard". I agree that causal inference is hard, and unlikely to be automated anytime soon. But to me, automatic inference of graphs was never the intended purpose of CIDs.

Instead, the main value of CIDs is that they help make informal, philosophical arguments crisp, by making assumptions and inferences explicit in a simple-to-understand formal language.

So it's from this perspective that I'm not overly worried about the practicality of the experiments.

Comment by tom4everitt on Discovering Agents · 2022-08-24T16:02:22.755Z · LW · GW

The way I see it, the primary value of this work (as well as other CID work) is conceptual clarification. Causality is a really fundamental concept, which many other AI-safety relevant concepts build on (influence, response, incentives, agency, ...). The primary aim is to clarify the relationships between concepts and to derive relevant implications. Whether there are practical causal inference algorithms or not is almost irrelevant. 

TLDR: Causality > Causal inference :)

Comment by tom4everitt on Will Capabilities Generalise More? · 2022-07-13T20:53:56.881Z · LW · GW

Sure, humans are sometimes inconsistent, and we don't always know what we want (thanks for the references, that's useful!). But I suspect we're mainly inconsistent in borderline cases, which aren't catastrophic to get wrong. I'm pretty sure humans would reliably state that they don't want to be killed, or that lots of other people die, etc. And that when they have a specific task in mind , they state that they want the task done rather than not. All this subject to that they actually understand the main considerations for whatever plan or outcome is in question, but that is exactly what debate and rrm are for

Comment by tom4everitt on Will Capabilities Generalise More? · 2022-07-06T15:23:50.543Z · LW · GW

alignment of strong optimizers simply cannot be done without grounding out in something fundamentally different from a feedback signal.

I don't think this is obvious at all.  Essentially, we have to make sure that humans give feedback that matches their preferences, and that the agent isn't changing the human's preferences to be more easily optimized.

We have the following tools at our disposal:

  1. Recursive reward modelling / Debate. By training agents to help with feedback, improvements in optimization power boosts both the feedback and the process potentially fooling the feedback. It's possible that it's easier to fool humans than it is to help them not be fooled, but it's not obvious this is the case.
  2. Path-specific objectives. By training an explicit model of how humans will be influenced by agent behavior, we can design an agent that optimizes the hypothetical feedback that would have been given, had the agent's behavior not changed the human's preferences (under some assumptions).

This makes me mildly optimistic of using feedback even for relatively powerful optimization.

Comment by tom4everitt on Investigating causal understanding in LLMs · 2022-06-14T15:45:43.385Z · LW · GW

Really interesting, even though the result aren't that surprising. I'd be curious to see how the results improve (or not) with more recent language models. I also wonder if there are other formats to test causal understanding. For example, what if receives a more natural story plot (about Red Riding Hood, say), and asked about some causal questions ("what would have happened if grannma wasn't home when the wolf got there?", say).

It's less clean, but it could be interesting to probe it in a few different ways.

Comment by tom4everitt on Various Alignment Strategies (and how likely they are to work) · 2022-05-24T09:25:24.126Z · LW · GW

Nice post! The Game Theory / Bureaucracy is interesting. It reminds me of Drexler's CAIS proposal, where services are combined into an intelligent whole. But I (and Drexler, I believe) agree that much more work could be spent on figuring out how to actually design/combine these systems.

Comment by tom4everitt on Causality, Transformative AI and alignment - part I · 2022-01-31T16:04:29.851Z · LW · GW

Nice summary, thanks for sharing!

Comment by tom4everitt on Causality, Transformative AI and alignment - part I · 2022-01-28T14:20:59.652Z · LW · GW

Thanks Marius and David, really interesting post, and super glad to see interest in causality picking up!

I very much share your "hunch that causality might play a role in transformative AI and feel like it is currently underrepresented in the AI safety landscape."

Most relevant, I've been working with Mary Phuong on a project which seems quite related to what you are describing here. I don't want to share too many details publicly without checking with Mary first, but if you're interested perhaps we could set up a call sometime?

I also think causality is relevant to AGI safety in several additional ways to those you mention here. In particular, we've been exploring how to use causality to describe agent incentives for things like corrigibility and tampering (summarized in this post), formalizing ethical concepts like intent, and understanding agency.

So really curious to see where your work is going and potentially interested in collaborating!

Comment by tom4everitt on Causality, Transformative AI and alignment - part I · 2022-01-28T13:59:28.020Z · LW · GW

There are numerous techniques for this, based on e.g. symmetries, conserved properties, covariances, etc.. These techniques can generally be given causal justification.

 

I'd be curious to hear more about this, if you have some pointers

Comment by tom4everitt on Progress on Causal Influence Diagrams · 2021-07-01T13:19:02.392Z · LW · GW

Thanks Ilya for those links, in particular the second one looks quite relevant to something we’ve been working on in a rather different context (that's the benefit of speaking the same language!)

We would also be curious to see a draft of the MDP-generalization once you have something ready to share!

Comment by tom4everitt on AMA: Paul Christiano, alignment researcher · 2021-05-17T08:55:12.490Z · LW · GW

 

  • I think the existing approach and easy improvements don't seem like they can capture many important incentives such that you don't want to use it as an actual assurance (e.g. suppose that agent A is predicting the world and agent B is optimizing A's predictions about B's actions---then we want to say that the system has an incentive to manipulate the world but it doesn't seem like that is easy to incorporate into this kind of formalism).

 

This is what multi-agent incentives are for (i.e. incentive analysis in multi-agent CIDs).  We're still working on these as there are a range of subtleties, but I'm pretty confident we'll have a good account of it.

Comment by tom4everitt on Counterfactual control incentives · 2021-03-25T16:44:58.140Z · LW · GW

Glad she likes the name :) True, I agree there may be some interesting subtleties lurking there. 

(Sorry btw for slow reply; I keep missing alignmentforum notifications.)

Comment by tom4everitt on Counterfactual control incentives · 2021-02-03T11:30:58.358Z · LW · GW

Thanks Stuart and Rebecca for a great critique of one of our favorite CID concepts! :)

We agree that lack of control incentive on X does not mean that X is safe from influence from the agent, as it may be that the agent influences X as a side effect of achieving its true objective. As you point out, this is especially true when X and a utility node probabilistically dependent. 

What control incentives do capture are the instrumental goals of the agent. Controlling X can be a subgoal for achieving utility if and only if the CID admits a control incentive on X. For this reason, we have decided to slightly update the terminology: in the latest version of our paper (accepted to AAAI, just released on arXiv) we prefer the term instrumental control incentive (ICI), to emphasize that the distinction to "control as a side effect.

Comment by tom4everitt on (A -> B) -> A in Causal DAGs · 2020-02-07T16:13:35.189Z · LW · GW

Glad you liked it.

Another thing you might find useful is Dennett's discussion of what an agent is (see first few chapters of Bacteria to Bach). Basically, he argues that an agent is something we ascribe beliefs and goals to. If he's right, then an agent should basically always have a utility function.

Your post focuses on the belief part, which is perhaps the more interesting aspect when thinking about strange loops and similar.

Comment by tom4everitt on (A -> B) -> A in Causal DAGs · 2020-01-24T17:26:04.286Z · LW · GW

There is a paper which I believe is trying to do something similar to what you are attempting here:

Networks of Influence Diagrams: A Formalism for Representing Agents’ Beliefs and Decision-Making Processes, Gal and Pfeffer, Journal of Artificial Intelligence Research 33 (2008) 109-147

Are you aware of it? How do you think their ideas relate to yours?

Comment by tom4everitt on Wireheading is in the eye of the beholder · 2020-01-21T17:10:32.828Z · LW · GW

Is this analogous to the stance-dependency of agents and intelligence?

Comment by tom4everitt on Defining AI wireheading · 2020-01-14T17:01:39.819Z · LW · GW

Thanks Stuart, nice post.

I've moved away from the wireheading terminology recently, and instead categorize the problem a little bit differently:

The top-level category is reward hacking / reward corruption, which means that the agent's observed reward differs from true reward/task performance.

Reward hacking has two subtypes, depending on whether the agent exploited a misspecification in the process that computes the rewards, or modified the process. The first type is reward gaming and the second reward tampering.

Tampering can subsequently be divided into further subcategories. Does the agent tamper with its reward function, its observations, or the preferences of a user giving feedback? Which things the agent might want to tamper with depends on how its observed rewards are computed.

One advantage with this terminology is that it makes it clearer what we're talking about. For example, its pretty clear what reward function tampering refers to, and how it differs from observation tampering, even without consulting a full definition.

That said, I think you're post nicely puts the finger on what we usually mean when we say wireheading, and it is something we have been talking about a fair bit. Translated into my terminology, I think your definition would be something like "wireheading = tampering with goal measurement".

Comment by tom4everitt on Computational Model: Causal Diagrams with Symmetry · 2019-08-23T12:37:54.804Z · LW · GW

Thanks for a nice post about causal diagrams!

Because our universe is causal, any computation performed in our universe must eventually bottom out in a causal DAG.

Totally agree. This is a big part of the reason why I'm excited about these kinds of diagrams.

This raises the issue of abstraction - the core problem of embedded agency. ... how can one causal diagram (possibly with symmetry) represent another in a way which makes counterfactual queries on the map correspond to some kind of counterfactual on the territory?

Great question, I really think someone should look more carefully into this. A few potentially related papers:

https://arxiv.org/abs/1105.0158

https://arxiv.org/abs/1812.03789

In general, though, how to learn causal DAGs with symmetry is still an open question. We’d like something like Solomonoff Induction, but which can account for partial information about the internal structure of the causal DAG, rather than just overall input-output behavior.

Again, agreed. It would be great if we could find a way to make progress on this question.

Comment by tom4everitt on "Designing agent incentives to avoid reward tampering", DeepMind · 2019-08-20T09:52:08.151Z · LW · GW

Actually, I would argue that the model is naturalized in the relevant way.

When studying reward function tampering, for instance, the agent chooses actions from a set of available actions. These actions just affect the state of the environment, and somehow result in reward or not.

As a conceptual tool, we label part of the environment the "reward function", and part of the environment the "proper state". This is just to distinguish between effects that we'd like the agent to use from effects that we don't want the agent to use.

The current-RF solution doesn't rely on this distinction, it only relies on query-access to the reward function (which you could easily give an embedded RL agent).

The neat thing is that when we look at the objective of the current-RF agent using the same conceptual labeling of parts of the state, we see exactly why it works: the causal paths from actions to reward that pass the reward function have been removed.

Comment by tom4everitt on "Designing agent incentives to avoid reward tampering", DeepMind · 2019-08-19T16:29:43.740Z · LW · GW

Thanks for the Dewey reference, we'll add it.

Comment by tom4everitt on "Designing agent incentives to avoid reward tampering", DeepMind · 2019-08-19T16:29:25.378Z · LW · GW

We didn't expect this to be surprising to the LessWrong community. Many RL researchers tend to be surprised, however.

Comment by tom4everitt on "Designing agent incentives to avoid reward tampering", DeepMind · 2019-08-19T16:28:45.757Z · LW · GW

Yes, that is partly what we are trying to do here. By summarizing some of the "folklore" in the community, we'll hopefully be able to get new members up to speed quicker.

Comment by tom4everitt on "Designing agent incentives to avoid reward tampering", DeepMind · 2019-08-19T16:28:16.612Z · LW · GW

Hey Steve,

Thanks for linking to Abram's excellent blog post.

We should have pointed this out in the paper, but there is a simple correspondence between Abram's terminology and ours:

Easy wireheading problem = reward function tampering

Hard wireheading problem = feedback tampering.

Our current-RF optimization corresponds to Abram's observation-utility agent.

We also discuss the RF-input tampering problem and solutions (sometimes called the delusion box problem), which I don’t fit into Abram’s distinction.

Comment by tom4everitt on "Designing agent incentives to avoid reward tampering", DeepMind · 2019-08-19T16:27:56.288Z · LW · GW

Hey Charlie,

Thanks for bringing up these points. The intended audience is researchers more familiar with RL than the safety literature. Rather than try to modify the paper to everyone's liking, let me just give a little intro / context for it here.

The paper is the culmination of a few years of work (previously described in e.g. my thesis and alignment paper). One of the main goals has been to understand whether it is possible to redeem RL from a safety viewpoint, or whether some rather different framework would be necessary to build safe AGI.

As a first step along this path, I tried to categorize problems with RL, and see which solutions applied to which categories. For this purpose, I found causal graphs valuable (thesis), and I later realized that causal influence diagrams (CID) provided an even better foundation. Any problem corresponds to an 'undesired path' in a CID, and basically all the solutions corresponded to ways of getting rid of that path. As highlighted in the introduction of the paper, I now view this insight as one of the most useful ones.

Another important contribution of the paper is pinpointing which solution idea solves which type of reward tampering problem, and a discussion of how the solutions might fit together. I see this as a kind of stepping stone towards more empirical RL work in this area.

Third, the paper puts a fair bit of emphasis on giving brief but precise summaries of previous ideas in the safety literature, and may therefore serve as a kind of literature review. You are absolutely right that solutions to reward function tampering (often more loosely referred to as wireheading) have been around for quite some time. However, the explanations of these methods have been scattered across a number of papers, using a number of different frameworks and formalisms.

Tom

Comment by tom4everitt on Modeling AGI Safety Frameworks with Causal Influence Diagrams · 2019-06-28T12:43:35.558Z · LW · GW
I really like this layout, this idea, and the diagrams. Great work.

Glad to hear it :)

I don't agree that counterfactual oracles fix the incentive. There are black boxes in that proposal, like "how is the automated system not vulnerable to manipulation" and "why do we think the system correctly formally measures the quantity in question?" (see more potential problems). I think relying only on this kind of engineering cleverness is generally dangerous, because it produces safety measures we don't see how to break (and probably not safety measures that don't break).

Yes, the argument is only valid under the assumptions that you mention. Thanks for pointing to the discussion post about the assumptions.

Also, on page 10 you write that during deployment, agents appear as if they are optimizing the training reward function. As evhub et al point out, this isn't usually true: the objective recoverable from perfect IRL on a trained RL agent is often different (behavioral objective != training objective).

Fair point, we should probably weaken this claim somewhat.

Comment by tom4everitt on Modeling AGI Safety Frameworks with Causal Influence Diagrams · 2019-06-28T10:36:23.230Z · LW · GW

Hey Charlie,

Thanks for your comment! Some replies:

sometimes one makes different choices in how to chop an AI's operation up into causally linked boxes, which can lead to an apples-and-oranges problem when comparing diagrams (for example, the diagrams you use for CIRL and IDI are very different choppings-up of the algorithms)

There is definitely a modeling choice involved in choosing how much "to pack" in each node. Indeed, most of the diagrams have been through a few iterations of splitting and combining nodes. The aim has been to focus on the key dynamics of each framework.

As for the CIRL and IDA difference, this is a direct effect of the different levels the frameworks are specified at. CIRL is a high-level framework, roughly saying "somehow you infer the human preferences from their actions". IDA, in contrast, provides a reasonably detailed supervised learning criteria. So I think the frameworks themselves are already like apples and oranges, it's not just the diagrams. (And drawing the diagrams, this is something you notice.)

But I am skeptical that there's a one-size-fits-all solution, and instead think that diagram usage should be tailored to the particular point it's intended to make.

We don't want to claim the CIDs are the one-and-only diagram to always use, but as you mentioned above, they do allow for quite some flexibility in what aspects to highlight.

I actually have a draft sitting around of how one might represent value learning schemes with a hierarchical diagram of information flow.

Interesting. A while back I was looking at information flow diagram myself, and was surprised to discover how hard it was to make them formally precise (there seems to be no formal semantics for them). In contrast, causal graphs and CIDs have formal semantics, which is quite useful.

For hierarchical representations, there are networks of influence diagrams https://arxiv.org/abs/1401.3426

Comment by tom4everitt on Risks from Learned Optimization: Introduction · 2019-06-14T10:11:41.250Z · LW · GW

Chapter 4 in Bacteria to Bach is probably most relevant to what we discussed here (with preceding chapters providing a bit of context).

Yes, it would interesting to see if causal influence diagrams (and the inference of incentives) could be useful here. Maybe there's a way to infer the CID of the mesa-optimizer from the CID of the base-optimizer? I don't have any concrete ideas at the moment -- I can be in touch if I think of something suitable for collaboration!

Comment by tom4everitt on Risks from Learned Optimization: Introduction · 2019-06-09T09:03:37.693Z · LW · GW
What’s at stake here is: describing basically any system as an agent optimising some objective is going to be a leaky abstraction. The question is, how do we define the conditions of calling something an agent with an objective in such a way to minimise the leaks?

Indeed, this is a super slippery question. And I think this is a good reason to stand on the shoulders of a giant like Dennett. Some of the questions he has been tackling are actually quite similar to yours, around the emergence of agency and the emergence of consciousness.

For example, does it make sense to say that a tree is *trying to* soak up sun, even though it doesn't have any mental representation itself? Many biologists would hesitate to use such language other than metaphorically.

In contrast, Dennett's answer is yes: Basically, it doesn't matter if the computation is done by the tree, or by the evolution that produced the tree. In either case, it is right to think of the tree as an agent. (Same goes for DQN, I'd say.)

There are other situations where the location of the computation matters, such as for consciousness, and for some "self-reflective" skills that may be hard to pre-compute.

Basically, I would recommend looking closer at Dennett to

  • avoid reinventing the wheel (more than necessary), and
  • connect to his terminology (since he's so influential).

He's a very lucid writer, so quite a joy to read him really. His most recent book Bacteria to Bach summarizes and references a lot of his earlier work.

I am just wary of throwing away seemingly relevant assumptions about internal structure before we can show they’re unhelpful.

Yes, starting with more assumptions is often a good strategy, because it makes the questions more concrete. As you say, the results may potentially generalize.

But I am actually unsure that DQN agents should be considered non-optimisers, in the sense that they do perform rudimentary optimisation: they take an argmax of the Q function.

I see, maybe PPO would have been a better example.

Comment by tom4everitt on Risks from Learned Optimization: Introduction · 2019-06-08T17:14:04.711Z · LW · GW

Thanks for the interesting post! I find the possibility of a gap between the base optimization objective and the mesa/behavioral objective convincing, and well worth exploring.

However, I'm less convinced that the distinction between the mesa-objective and the behavioral objective is real/important. You write:

Informally, the behavioral objective is the objective which appears to be optimized by the system’s behavior. More formally, we can operationalize the behavioral objective as the objective recovered from perfect inverse reinforcement learning (IRL).[4] This is in contrast to the mesa-objective, which is the objective actively being used by the mesa-optimizer in its optimization algorithm.

According to Dennett, many systems behave as if they are optimizing some objective. For example, a tree may behave as if optimizes the amount of sun that it can soak up with its leaves. This is a useful description of the tree, offering real predictive power. Whether there is some actual search process going on in the tree is not that important, the intentional stance is useful in either case.

Similarly, a fully trained DQN algorithm will behave as if it optimizes the score of the game, even though there is no active search process going on at a given time step (especially not if the network parameters are frozen). In neither of these example is it necessary to distinguish between mesa and behavior objectives.

At this point, you may object that the mesa objective will be more predictive "off training distribution". Perhaps, but I'm not so sure.

First, the behavioral objective may be predictive "off training distribution": For example, the DQN agent will strive to optimize reward as long as the Q-function generalizes.

Second, the mesa-objective may easily fail to be predictive off distribution. Consider a model-based RL agent with a learned model of the environment, that uses MCTS to predict the return of different policies. The mesa-objective is then the expected return. However, this objective may not be particularly predictive outside the training distribution, because the learned model may only make sense on the distribution.

So the behavioral objective may easily be predictive outside the training distribution, and the mesa-objective easily fail to be predictive.

While I haven't read the follow-up posts yet, I would guess that most of your further analysis would go through without the distinction between mesa and behavior objective. One possible difference is that you may need to be even more paranoid about the emergence of behavior objectives, since they can emerge even in systems that are not mesa-optimizing.

I would also like to emphasize that I really welcome this type of analysis of the emergence of objectives, not the least because it nicely complements my own research on how incentives emerge from a given objective.

Comment by tom4everitt on Announcement: AI alignment prize round 2 winners and next round · 2018-05-20T07:40:42.167Z · LW · GW

Thank you! Really inspiring to win this prize. As John Maxwell stated in the previous round, the recognition is more important than the money. Very happy to receive further comments and criticism by email tom4everitt@gmail.com. Through debate we grow :)

Comment by tom4everitt on Announcement: AI alignment prize round 2 winners and next round · 2018-05-03T08:01:58.045Z · LW · GW

Thanks for your comments, much appreciated! I'm currently in the middle of moving continents, will update the draft within a few weeks.

Comment by tom4everitt on Smoking Lesion Steelman II · 2018-01-08T09:17:16.000Z · LW · GW

Nice writeup. Is one-boxing in Newcomb an equilibria?

Comment by tom4everitt on Delegative Inverse Reinforcement Learning · 2017-08-20T09:34:46.000Z · LW · GW

My confusion is the following:

Premises (*) and inferences (=>):

  • The primary way for the agent to avoid traps is to delegate to a soft-maximiser.

  • Any action with boundedly negative utility, a soft-maximiser will take with positive probability.

  • Actions leading to traps do not have infinitely negative utility.

=> The agent will fall into traps with positive probability.

  • If the agent falls into a trap with positive probability, then it will have linear regret.

=> The agent will have linear regret.

So when you say in the beginning of the post "a Bayesian DIRL agent is guaranteed to attain most of the value", you must mean that in a different sense than a regret sense?