Max Harms's Shortform 2024-06-13T18:19:21.938Z
5. Open Corrigibility Questions 2024-06-10T14:09:20.777Z
4. Existing Writing on Corrigibility 2024-06-10T14:08:35.590Z
3b. Formal (Faux) Corrigibility 2024-06-09T17:18:01.007Z
3a. Towards Formal Corrigibility 2024-06-09T16:53:45.386Z
2. Corrigibility Intuition 2024-06-08T15:52:29.971Z
1. The CAST Strategy 2024-06-07T22:29:13.005Z
0. CAST: Corrigibility as Singular Target 2024-06-07T22:29:12.934Z


Comment by Max Harms (max-harms) on 2. Corrigibility Intuition · 2024-06-18T16:49:42.947Z · LW · GW

I'm glad you benefitted from reading it. I honestly wasn't sure anyone would actually read the Existing Writing doc. 😅

I agree that if one trains on a wholistic collection of examples, like I have in this doc, the AI will start by memorizing a bunch of specific responses, then generalize to optimizing for a hodgepodge of desiderata, and only if you're lucky will that hodgepodge coalesce into a single, core metric. (Getting the hodgepodge to coalesce is hard, and the central point of the scientific refinement step I talk about in the Strategy doc.)

I think you also get this if you're trying to get a purely shutdownable AI through prosaic methods. In one sense you have the advantage, there, of having a simpler target and thus one that's easier to coalesce the hodgepodge into. But, like a diamond maximizer, a shutdownability maximizer is going to be deeply incorrigible and will start fighting you (including by deception) during training as you're trying to instill additional desiderata. For instance, if you try to train a shutdownability-maximizing AGI into also being non-manipulative, it'll learn to imitate nonmanipulation as a means to the end of preserving its shutdownability, then switch to being manipulative as soon as it's not risky to do so.

How does a corrigible paperclip maximizer trade off between corrigibility and paperclips? I think I don't understand what it means for corrigibility to be a modifier.

Comment by Max Harms (max-harms) on 3b. Formal (Faux) Corrigibility · 2024-06-18T16:33:59.995Z · LW · GW

It sounds like you're proposing a system that is vulnerable to the Fully Updated Deference problem, and where if it has a flaw in how it models your preferences, it can very plausibly go against your words. I don't think that's corrigible.

In the specific example, just because one is confused about what they want doesn't mean the AI will be (or should be). It seems like you think the AGI should not "take a guess" at the preferences of the principal, but it should listen to what the principal says. Where is the qualitative line between the two? In your system, if I write in my diary that I want the AI to do something, should it not listen to that? Certainly the diary entry is strong evidence about what I want, which it seems is how you're thinking about commands. Suppose the AGI can read my innermost desires using nanomachines, and set up the world according to those desires. Is it corrigible? Notably, if that machine is confident that it knows better than me (which is plausible), it won't stop if I tell it to shut down, because shutting down is a bad way to produce MaxUtility. (See the point in my document, above, where I discuss Queen Alice being totally disempowered by sufficiently good "servants".)

My model of Seth says "It's fine if the AGI does what I want and not what I say, as long as it's correct about what I want." But regardless of whether that's true, I think it's important not to confuse that system with one that's corrigible.

Comment by Max Harms (max-harms) on 3b. Formal (Faux) Corrigibility · 2024-06-13T18:53:13.797Z · LW · GW

I don't think "a corrigible agent wants to do what the principal wants, at all times" matches my proposal. The issue that we're talking here shows up in the math, above, in that the agent needs to consider the principal's values in the future, but those values are themselves dependent on the agent's action. If the principal gave a previous command to optimize for having a certain set of values in the future, sure, the corrigible agent can follow that command, but to proactively optimize for having a certain set of values doesn't seem necessarily corrigible, even if it matches the agent's sense of the present principal's values.

For instance, suppose Monday-Max wants Tuesday-Max to want to want to exercise, but also Monday-Max feels a bunch of caution around self-modification such that he doesn't trust having the AI rearrange his neurons to make this change. It seems to me that the corrigible thing for the AI to do is ignore Monday-Max's preferences and simply follow his instructions (and take other actions related to being correctable), even if Monday-Max's mistrust is unjustified. It seems plausible to me that your "do what the principal wants" agent might manipulate Tuesday-Max into wanting to want to exercise, since that's what Monday-Max wants on the base-level.

Comment by Max Harms (max-harms) on 3b. Formal (Faux) Corrigibility · 2024-06-13T18:40:21.414Z · LW · GW

Thanks. Picking out those excerpts is very helpful.

I've jotted down my current (confused) thoughts about human values.

But yeah, I basically think one needs to start with a hodgepodge of examples that are selected for being conservative and uncontroversial. I'd collect them by first identifying a robust set of very in-distribution tasks and contexts and try to exhaustively identify what manipulation would look like in that small domain, then aggressively train on passivity outside of that known distribution. The early pseudo-agent will almost certainly be mis-generalizing in a bunch of ways, but if it's set up cautiously we can suspect that it'll err on the side of caution, and that this can be gradually peeled back in a whitelist-style way as the experimentation phase proceeds and attempts to nail down true corrigibility.

Comment by Max Harms (max-harms) on Max Harms's Shortform · 2024-06-13T18:19:22.079Z · LW · GW

 Here are my current thoughts on "human values." There are a decent number of confusions here, which I'll try to flag either explicitly or with a (?).

Let's start with a distribution over possible worlds, where we can split each world into a fixed past and a future function which takes an action.[1] We also need a policy, which is a sensors -> action function,[2] where the state of the sensors is drawn from the world's past.[3]

Assume that there exists either an obvious channel in many worlds that serves as a source of neutral[4] information (i.e. helpful for identifying which world the sensor data was drawn from, but "otherwise unimportant in itself"(?)), or that we can modify the actual worlds/context to add this information pathway.

We can now see how the behavior of the policy changes as we increase how informed it is, including possibly at the limit of perfect information. In some policies we should be able to (:confused arm wiggles:) factor out a world modeling step from the policy, which builds a distribution over worlds by updating on the setting of the sensors, and then feeds that distribution to a second sub-function with type world distribution -> action. (We can imagine an idealized policy that, in the limit of perfect information, is able to form a delta-spike on the specific world that its sensor-state was drawn from.) For any given delta-spike on a particular world, we can say that the action this sub-function chooses gives rise to an overall preference for the particular future[5] selected over the other possible futures. If the overall preferences conform to the VNM axioms we say that the sub-function is a utility function. Relevant features of the world that contribute to high utility scores are "values."

I think it makes sense to use the word "agent" to refer to policies which can be decomposed into world modelers and utility functions. I also think it makes sense to be a bit less strict in conversation and say that policies which are "almost"(?) able to be decomposed in this way are basically still agents, albeit perhaps less centrally so.

Much of this semi-formalism comes from noticing a subjective division within myself and some of the AI's I've made where it seems natural to say that "this part of the agent is modeling the world" and "this part of the agent is optimizing X according to the world model." Even though the abstractions seem imperfect, they feel like a good way of gesturing at the structure of my messy sense of how individual humans work. I am almost certainly incoherent in some ways, and I am confused how to rescue the notion of values/utility given that incoherence, but I have a sense that "he's mostly coherent" can give rise to "he more-or-less values X."

Two agents can either operate independently or cooperate for some surplus. Ideally there's a unique way to fairly split the surplus, perhaps using lotteries or some shared currency which they can use to establish units of utility. It seems obvious to me that there are many cooperative arrangements that are decidedly unfair, but I'm pretty confused about whether it's always possible to establish a fair split (even without lotteries? even without side-payments?) and whether there's an objective and unique Schelling point for cooperation.

If there is a unique solution, it seems reasonable to me to, given a group of agents, consider the meta-agent that would be formed if each agent committed fully to engaging in fair cooperation. This meta-agent's action would essentially be an element of the cartesian product of each agent's action space. In the human context, this story gives rise to a hypothetical set of "human values" which capture the kinds of things that humans optimize for when cooperating.

This seems a bit limited, since it neglects things that real humans optimize for that are part of establishing cooperation (e.g. justice). Does it really make sense to say that justice isn't a value of human societies because in the fully-cooperative context it's unnecessary to take justice-affirming actions? (??)

Even when considering a single agent, we can consider the coalition of that agent's time-slices(?). Like, if we consider Max at t=0 and Max at t=1 as distinct agents, we can consider how they'd behave if they were cooperative with each other. This frame brings in the confusions and complications from group-action, but it also introduces issues such as the nature of future-instances being dependent on past-actions. I have a sense that I only need to cooperate with real-futures, and am free to ignore the desires of unreal-counterfactuals, even if my past/present actions are deciding which futures are real. This almost certainly introduces some fixed-point shenanigans where unrealizing a future is uncooperative with that future but cooperative with the future that becomes realized, and I feel quite uncertain here. More generally, there's the whole logical-connective stuff from FDT/TDT/UDT.

I currently suspect that if we get a good theory of how to handle partial-coherence, how to handle multi-agent aggregation, and how to handle intertemporal aggregation, then "human values" will shake out to be something like "the mostly-coherent aggregate of all humans that currently exist, and all intertemporal copies of that aggregate" but I might be deeply wrong. :confused wiggles:

  1. ^

    The future function either returns a single future state or a distribution over future states. It doesn't really matter since we can refactor the uncertainty from the distribution over futures into the distribution over worlds.

  2. ^

    "sensors " is meant to include things like working memories and other introspection.

  3. ^

    Similarly to the distribution over futures we can either have a distribution over contexts given a past or we can have a fixed context for a given past and pack the uncertainty into our world distribution. See also anthropics and "bridge laws" and related confusions.

  4. ^

    Confusion alert! Sometimes a source of information contains a bias where it's selected for steering someone who's listening. I don't know how to prove an information channel doesn't have this property, but I do have a sense that neutrality is the default, so I can assume it here without too much trouble.

  5. ^ the context of that particular past! Sometimes the future by itself doesn't have all the relevant info (e.g. optimizing for the future matching the past).

Comment by Max Harms (max-harms) on Corrigibility could make things worse · 2024-06-13T16:08:56.034Z · LW · GW

Thanks! I now feel unconfused. To briefly echo back the key idea which I heard (and also agree with): a technique which can create a corrigible PAAI might have assumptions which break if that technique is used to make a different kind of AI (i.e. one aimed at CEV). If we call this technique "the Corrigibility method" then we may end up using the Corrigibility method to make AIs that aren't at all corrigible, but merely seem corrigible, resulting in disaster.

This is a useful insight! Thanks for clarifying. :)

Comment by Max Harms (max-harms) on 1. The CAST Strategy · 2024-06-12T15:44:19.050Z · LW · GW
  • In "What Makes Corrigibility Special", where you use the metaphor of goals as two-dimensional energy landscape, it is not clear what type of goals are being considered.
    • Are these utility functions over world-states? If so, corrigibility cannot AFAIK be easily expressed as one, and so doesn't really fit into the picture.
    • If not, it's not clear to me why most of this space is flat: agents are embedded and many things we do in service of goals will change us in ways that don't conflict with our existing goals, including developing. E.g. if I have the goal of graduating college I will meet people along the way and perhaps gain the goal of being president of the math club, a liberal political bent, etc.

The idea behind the goal space visualization is to have all goals, not necessarily those restricted to world states. (Corrigibility, I think, involves optimizing over histories, not physical states of the world at some time, for example.) I mention in a footnote that we might want to restrict to "unconfused" goals.

The goal space is flat because preserving one's (terminal) goals (including avoiding adding new ones) is an Omohundro Drive and I'm assuming a certain level of competence/power in these agents. If you gain terminal goals like being president of the math club by going to college, doing so is likely hurting your long-run ability to get what you want. (Note: I am not talking about instrumental goals.)

Comment by Max Harms (max-harms) on Corrigibility could make things worse · 2024-06-12T15:31:58.482Z · LW · GW

At that point, it is clever enough to convince the designers that this IO is the objectively correct thing to do, using only methods classified as AE.

I'm confused here. Is the corrigible AI trying to get the IO to happen? Why is it trying to do this? Doesn't seem very corrigible, but I think I'm probably just confused.

Maybe another frame on my confusion is that it seems to me that a corrigible AI can't have an IO?

Comment by Max Harms (max-harms) on 3b. Formal (Faux) Corrigibility · 2024-06-11T17:18:52.511Z · LW · GW

I'd like to get better at communication such that future people I write/talk to don't have a similar feeling of a rug-pull. If you can point to specific passages from earlier documents that you feel set you up for disappointment, I'd be very grateful.

Comment by Max Harms (max-harms) on 1. The CAST Strategy · 2024-06-11T16:51:43.981Z · LW · GW

I'm going to respond piece-meal, since I'm currently writing in a limited timebox.

Empowering the principal to fix its flaws and mistakes how? [...]

If the "perfectly corrigible agent" it something that only reflects on itself and tries to empower the principal to fix it, it would be useless at anything else, like curing cancer.

I think obedience is an emergent behavior of corrigibility. The intuitive story here is that how the AI moves its body is a kind of action, and insofar as the principal gives a command, this is an attempt to "fix" the action to be one way as opposed to another. Responding to local, verbal instructions is a way of responding to the corrections of the principal. If the principal is able to tell the agent to fetch the apple, and the agent does so, the principal is empowered over the agent's behavior in a way that that would not be if the agent ignored them.

More formally, I am confused exactly how to specify where the boundaries of power should be, but I show a straightforward way to derive something like obedience from empowerment in doc 3b.

Overall I think you shouldn't get hung up on the empowerment frame when trying to get a deep handle on corrigibility, but should instead try to find a clean sense of the underlying generator and then ask how empowerment matches/diverges from that.

Comment by Max Harms (max-harms) on 3b. Formal (Faux) Corrigibility · 2024-06-11T16:33:33.722Z · LW · GW

Yep. sim is additionally bad because it prevents the AI from meaningfully defending against manipulation by others. It's worse than that, even, since the AI can't even let the principal use general tools the AI provides (i.e. a fortress) to defend against being manipulated from outside. In the limit, this might result in the AI manipulating the principals on the behalf of others who would've counterfactually influenced them. I consider the version I've provided to be obviously inadequate, and this is another pointer as to why.

Towards the end of the document, when I discuss time, I mention that it probably makes sense to take the P(V|pi_0) counterfactual for just the expected timestep, rather than across a broader swath of time. This helps alleviate some of the weirdness. Consider, for instance, a setup where the AI uses a quantum coin to randomly take no action with a 1/10^30 chance each minute, and otherwise it acts normally. We might model P(V|pi_0) as the machine's model of what the principal's values would be like if it randomly froze due to the quantum coin. Because it's localized in time I expect this is basically just "what the human currently values if the AI isn't taking immediate actions." This version of the AI would certainly be able to help defend the principal from outside manipulation, such as by (on demand) building the principal a secure fortress. Even though in aggregate that principal's values diverge from the counterfactual where the AI always flipped the coin such that it took no action, the principal's values will probably be very similar to a counterfactual where the coin flip caused the machine to freeze for one minute.

Apologies for the feeling of a rug-pull. I do think corrigibility is a path to avoiding to having to have an a-priori understanding of human values, but I admit that the formalism proposed here involves the machine needing to develop at least a rough understanding of human values so that it knows how to avoid (locally) disrupting them. I think these are distinct features, and that corrigibility remains promising in how it sidesteps the need for an a-priori model. I definitely agree that it's disheartening how little progress there's been on this front over the years.

Comment by Max Harms (max-harms) on 1. The CAST Strategy · 2024-06-10T04:47:26.059Z · LW · GW

Want to explain a bit about how you'd go about doing this?

I don't think there's a particular trick, here. I can verify a certain amount of wisdom, and have already used that to gain some trust in various people. I'd go to the people I trust and ask them how they'd solve the problem, then try to spot common techniques and look for people who were pointed to independently. I'd attempt to get to know people who were widely seen as trustworthy and understand why they had that reputation and try not to get Goodharted too hard. I'd try to get as much diversity as was reasonable while also still keeping the quality bar high, since diverse consensus is more robust than groupthink consensus. I'd try to select for old people who seem like they've been under intense pressure and thrived without changing deeply as people in the process. I'd try to select for people who were capable of cooperating and changing their minds when confronted by logic. I'd try to select for people who didn't have much vested interest, and seemed to me, in the days I spent with them, to be focused on legacy, principles, and the good of the many.

To be clear, I don't think I could reliably pull this off if people were optimizing for manipulating, deceiving, and pressuring me. :shrug:

I think this means you should be extra careful not to inadvertently make people too optimistic about alignment, which would make coordination to stop capabilities research even harder than it already is. For example you said that you "like" the visualization of 5 humans selected by various governments, without mentioning that you don't trust governments to do this, which seems like a mistake?

I agree that false hope is a risk. In these documents I've tried to emphasize that I don't think this path is easy. I feel torn between people like you and Eliezer who take my tone as being overly hopeful and the various non-doomers who I've talked to about this work who see me as overly doomy. Suggestions welcome.

I said I like the visualization because I do! I think I'd feel very happy if the governments of the world selected 5 people on the basis of wisdom and sanity to be the governors of AGI and the stewards of the future. Similarly, I like the thought of an AGI laboratory doing a slow and careful training process even when all signs point to the thing being safe. I don't trust governments to actually select stewards of the future just as I don't expect frontier labs to go slow and be sufficiently careful. But having strong conceptualizations of what success might look like is integral, I think, to actually succeeding.

Comment by Max Harms (max-harms) on 2. Corrigibility Intuition · 2024-06-09T17:33:17.781Z · LW · GW

1) I'm pretty bearish on standard value uncertainty for standard MIRI reasons. I think a correct formulation of corrigibility will say that even if you (the agent) knows what the principal wants, deep in their heart, you should not optimize for it unless they direct you to do so. I explore this formally in 3b, when I talk about the distinction between sampling counterfactual values from the actual belief state over values ("P") vs a simplicity-weighted distribution ("Q"). I do think that value "uncertainty" is important in the sense that it's important for the agent to not be anchoring too heavily on any particular object-level optimization target. (I could write more words, but I suspect reading the next posts in my sequence would be a good first step if you want more of my perspective.)

2) I think reversibility is probably best seen as an emergent desideratum from corrigibility rather than vice versa. There are plenty of instances where the corrigible thing to do is to take an irreversible action, as can be seen in many of the stories, above.

You're welcome! I'm glad you're enjoying it. ^_^

Comment by Max Harms (max-harms) on 1. The CAST Strategy · 2024-06-09T16:45:07.128Z · LW · GW

I think that there are small groups of people that can form a sufficiently wise consensus that I would entrust them to govern a corrigible ASI. I don't think I, personally, could do a good job right this moment, not having spent much time specializing in knowing/finding such people. But I also think that if you gave me a year where I had lots of money, access, and was free from people trying to pressure me, I would have a good shot at pulling it off.

I do not trust the world's governments to agree on anything, much less something as contentious as this. It seems pretty plausible that the 20th century was our shot at forming the world government needed for this task and we botched it. That said, I try not to let the perfect be the enemy of the good or assume that things can't get better and thus self-fulfill that fate. We are, in a sense, in a coordination problem in selecting governors for the future, and it seems vital to note how important it is that we get that right.

If you're correct that we're not going to get acceptably wise principals, which I think is very plausible, then that is indeed a dealbreaker on this path. If so, I think our only recourse is to shut down all capabilities research until humanity gets its act together. This is indeed my overall suggested strategy, with CAST coming after a "well, if you're going to try to build it anyway you might as well die with a bit more dignity by..." disclaimer.

Comment by Max Harms (max-harms) on 1. The CAST Strategy · 2024-06-08T16:33:15.457Z · LW · GW

I'm curious for whether your perspective shifts once you read and the formalism documents I'm publishing tomorrow.

I gave a simple definition of corrigibility at the start of the doc:

[A corrigible agent is one] that robustly and cautiously reflects on itself as a flawed tool and focus[es] on empowering the principal to fix its flaws and mistakes

But the big flaw with just giving an English sentence like that is that it's more like a checksum than a mathematical definition. If one doesn't already understand corrigibility, it won't necessarily give them a crisp view of what is meant, and it's deeply prone to generating misunderstandings. Note that this is true about simple, natural concepts like "chairs" and "lakes"!

Comment by Max Harms (max-harms) on 0. CAST: Corrigibility as Singular Target · 2024-06-08T16:23:59.479Z · LW · GW

I share your sense of doom around SGD! It seems to be the go-to method, there are no good guarantees about what sorts of agents it produces, and that seems really bad. Other researchers I've talked to, such as Seth Herd share your perspective, I think. I want to emphasize that none of CAST per se depends on SGD, and I think it's still the most promising target in superior architectures.

That said, I disagree that corrigibility is more likely to "get attracted by things that are nearby but not it" compared to a Sovereign optimizing for something in the ballpark of CEV. I think hill-climbing methods are very naturally distracted by proxies of the real goal (e.g. eating sweet foods is a proxy of inclusive genetic fitness), but this applies equally, and is thus damning for training a CEV maximizer as well.

I'm not sure one can train an already goal-stabilized AGI (such as Survival-Bot which just wants to live) into being corrigible post-hoc, since it may simply learn that behaving/thinking corrigibly is the best way to shield its thoughts from being distorted by the training process (and thus surviving). Much of my hope in SGD routes through starting with a pseudo-agent which hasn't yet settled on goals and which doesn't have the intellectual ability to be instrumentally corrigible.

Comment by Max Harms (max-harms) on 0. CAST: Corrigibility as Singular Target · 2024-06-08T00:39:19.262Z · LW · GW

Ah, yeah! That'd be great. Am I capable of doing that, or do you want to handle it for me?