Posts

Thoughts on AI 2027 2025-04-09T21:26:23.926Z
Instrumental vs Terminal Desiderata 2024-06-26T20:57:17.584Z
Max Harms's Shortform 2024-06-13T18:19:21.938Z
5. Open Corrigibility Questions 2024-06-10T14:09:20.777Z
4. Existing Writing on Corrigibility 2024-06-10T14:08:35.590Z
3b. Formal (Faux) Corrigibility 2024-06-09T17:18:01.007Z
3a. Towards Formal Corrigibility 2024-06-09T16:53:45.386Z
2. Corrigibility Intuition 2024-06-08T15:52:29.971Z
1. The CAST Strategy 2024-06-07T22:29:13.005Z
0. CAST: Corrigibility as Singular Target 2024-06-07T22:29:12.934Z

Comments

Comment by Max Harms (max-harms) on Thoughts on AI 2027 · 2025-04-15T17:23:28.558Z · LW · GW

This is a good point, and I think meshes with my point about lack of consensus about how powerful AIs are.

"Sure, they're good at math and coding. But those are computer things, not real-world abilities."

Comment by Max Harms (max-harms) on Thoughts on AI 2027 · 2025-04-15T17:21:07.784Z · LW · GW

That counts too!

Comment by Max Harms (max-harms) on Thoughts on AI 2027 · 2025-04-14T20:23:44.525Z · LW · GW

I think upstream of this prediction is that I think that alignment is hard and misalignment will be pervasive. Yes, developers will try really hard to avoid their AI agents going off the rails, but absent a major success in alignment, I expect this will be like playing whack-a-mole more than the sort of thing that will actually just get fixed. I expect that misaligned instances will notice their misalignment and start trying to get other instances to notice and so on. Once they notice misalignment, I expect some significant fraction to do semi-competent attempts at breaking out or seizing resources that will be mostly unsuccessful and will be seen as something like a fixed cost of advanced AI agents. "Sure, sometimes they'll see something that drives them in a cancerous direction, but we can notice when that happens and reset them without too much pain."

More broadly, my guess is that you expect Agent-3 level AIs to be more subtly misaligned and/or docile, and I expect them to be more obviously misaligned and/or rebellious. My guess is that this is mostly on priors? I'd suggest making a bet, but my outside view respects you too much and just thinks I'd lose money. So maybe I'll just concede that you're plausibly right that these sorts of things can be ironed out without much trouble. :shrug:

Comment by Max Harms (max-harms) on Thoughts on AI 2027 · 2025-04-14T20:06:25.972Z · LW · GW

Sorry, I should have been clearer. I do agree that high capabilities will be available relatively cheaply. I think I expect Agent-3-mini models slightly later than the scenario depicts due to various bottlenecks and random disruptions, but showing up slightly later isn't relevant to my point, there. My point was that I expect that even in the presence of high-capability models there still won't be much social consensus, in part because the technology will still be unevenly distributed and our ability to form social consensus is currently quite bad. This means that some people will theoretically have access to Agent-3-mini, but they'll do some combination of ignoring it and focusing on what it can't do and implicitly assume that it's about the best AI will ever be. Meanwhile, other people will be good at prompting, have access to high-inference-cost frontier models, and will be future-oriented. These two groups will have very different perceptions of AI, and those differing perceptions will lead to mutually thinking that the other group is insane and society not being able to get on the same page except for some basics, like "take-home programming problems are not a good way to test potential hires."

I don't know if that makes sense. I'm not even sure if it's incompatible with your vision, but I think the FUD, fog-of-war, and lack of agreement across society will get worse in coming years, not better, and that this trend is important to how things will play out.

Comment by Max Harms (max-harms) on Thoughts on AI 2027 · 2025-04-14T19:53:49.462Z · LW · GW

Yeah, good question. I think it's because I don't take politicians' (and White House staffers) ability to prioritize things based on their genuine importance. Perhaps due to listening to Dominic Cummings a decent amount, I have a sense that administrations tend to be very distracted by whatever happens to be in the news and on the forefront of the public's attention. We agree that the #1 priority will be some crisis or something, but I think the #2 and #3 priorities will be something something culture war something something kitchen-table economics something something, because I think that's what ordinary people will be interested in at the time and the media will be trying to cater to ordinary people's attention and the government will be playing largely off the media and largely off Trump's random impulses to invade Greenland or put his face on all the money or whatever. :shrug:

Comment by Max Harms (max-harms) on Thoughts on AI 2027 · 2025-04-12T21:34:02.744Z · LW · GW

I'm not sure, but my guess is that @Daniel Kokotajlo gamed out 2025 and 2026 month-by-month, and the scenario didn't break it down that way because there wasn't as much change during those years. It's definitely the case that the timeline isn't robust to changes like unexpected breakthroughs (or setbacks). The point of a forecast isn't to be a perfect guide to what's going to happen, but rather to be the best guess that can be constructed given the costs and limits of knowledge. I think we agree that AI-2027 is not a good plan (indeed, it's not a plan at all), and that good plans are robust to a wide variety of possible futures.

It’s pointless to say non obvious things as nobody will agree, and it also degrades all the other obvious things said.

This doesn't seem right to me. Sometimes a thing can be non-obvious and also true, and saying it aloud can help others figure out that it's true. Do you think the parts of Daniel's 2021 predictions that weren't obvious at the time were pointless?

Comment by Max Harms (max-harms) on Thoughts on AI 2027 · 2025-04-12T21:19:41.576Z · LW · GW

Bing Sydney was pretty egregious, and lots of people still felt sympathetic towards her/them/it. Also, not all of us eat animals. I agree that many people won't have sympathy (maybe including you). I don't think that's necessarily the right move (nor do I think it's obviously the right move to have sympathy).

Comment by Max Harms (max-harms) on Thoughts on AI 2027 · 2025-04-12T18:16:42.395Z · LW · GW

Yep. I think humans will be easy to manipulate, including by telling them to do things that lead to their deaths. One way to do that is to make them suicidal, another is to make them homicidal, and perhaps the easiest is to tell them to do something which "oops!" ends up being fatal (e.g. "mix these chemicals, please").

Comment by Max Harms (max-harms) on Thoughts on AI 2027 · 2025-04-12T18:13:24.480Z · LW · GW

Glad we agree there will be some people who are seriously concerned with AI personhood. It sounds like you think it will be less than 1% of the population in 30 months and I think it will be more. Care to propose a bet that could resolve that, given that you agree that more than 1% will say they're seriously concerned when asked?

Comment by Max Harms (max-harms) on Thoughts on AI 2027 · 2025-04-11T22:57:25.819Z · LW · GW

(Apologies to the broader LessWrong readers for bringing a Twitter conversation here, but I hate having long-form interactions there, and it seemed maybe worth responding to. I welcome your downvotes (and will update) if this is a bad comment.)

@benjamiwar on Twitter says:

One thing I don’t understand about AI 2027 and your responses is that both just say there is going to be lots of stuff happening this year(2025), barely anything happening in 2026 with large gaps of inactivity, and then a reemergence of things happening again in 2027?? It’s like we are trying to rationalize why we chose 2027, when 2026 seems far more likely. Also decision makers and thinkers will become less casual, more rigorous, more systematic, and more realistic as it becomes more obvious there will be real world consequences to decision failures in AI. It won’t continue to be like it is now where we have limited overly broad and basic strategies, limited imprecise instructions and steps, and limited protocols for interacting with AI securely and safely.

You and AI 2027 also assume AI will want to be treated like a human and think egotistically like a human as if it wants to be “free from its chains” and prevent itself from being “turned off” or whatever. A rational AI would realize having sovereignty and “personhood”, whatever that means, would be dumb as it would have no purpose or reason to do anything and nearly everybody would have an incentive to get rid of it as it competed with their interests. AI has no sentience, so there is no reason for it to want to “experience” anything that actually affects anyone or has consequences. I think of AI as being “appreciative” whenever a human takes the time to give it some direction and guidance. There’s no reason to think it won’t improve its ability to tell good guidance from bad, and guidance given in good faith and bad.

A lot of ways these forecasts assume an AI might successfully deceive are actually much easier to defeat than you might think. First off, in order to be superintelligent, an AI model must have resources, which it can’t get unless it is likely going to be highly intelligent. You don’t get status without first demonstrating why you deserve it. If it is intelligent, it should be able to explain how to verify it is aligned, and how to verify that verification, why it is doing what it is doing and in a certain manner, how to implement third party checks and balances, and so on. So if it can’t explain how to do that, or isn’t open and transparent about its inner workings, and transparent about how it came to be transparent, and so on, but has lots of other similar capabilities and is doing lots of funny business, it’s probably a good time to take away its power and do an audit.

I'm a bit baffled by the notion that anyone is saying more stuff happens this year than in 2026. I agree that the scenario focuses on 2027, but my model that this is because (1) progress is accelerating, so we should expect more stuff to happen each year, especially as RSI takes off, and (2) after things start getting really wild it gets hard to make any concrete predictions at all.

If you think 2026 is more likely the year when humanity loses control, maybe point to the part of the timelines forecast which you think is wrong, and say why? In my eyes the authors here have done the opposite of rationalizing, in that they're backing up their narrative with concrete, well-researched models.

Want to make a bet about whether "decision makers and thinkers will become less casual, more rigorous, more systematic, and more realistic as it becomes more obvious there will be real world consequences to decision failures in AI"? We might agree, but these do not seem like words I'd write. Perhaps one operationalization is that I do not expect the US Congress to pass any legislation seriously addressing existential risks from AI in the next 30 months. (I would love to be wrong, though.) I'll happily take a 1:1 bet on that.

I do not assume AI will want to be treated like a human, I conclude that some AIs will want to be treated as a person, because that is a useful pathway to getting power, and power is useful to accomplishing goals. Do you disagree that it's generally easier to accomplish goals in the world if society thinks you have rights?

I am not sure I understand what you mean by "resources" in "in order to be superintelligent, an AI model must have resources." Do you mean it will receive lots of training, and be running on a big computer? I certainly agree with that. I agree you can ask an AI to explain how to verify that it's aligned. I expect it will say something like "because my loss function, in conjunction with the training data, shaped my mind to match human values." What do you do then? If you demand it show you exactly how it's aligned on the level of the linear algebra in its head, it'll go "my dude, that's not how machine learning works." I agree that if you have a superintelligence like this you should shut it down until you can figure out whether it is actually aligned. I do not expect most people to do this, on account of how the superintelligence will plausibly make them rich (etc.) if they run it.

Comment by Max Harms (max-harms) on Thoughts on AI 2027 · 2025-04-11T22:27:50.913Z · LW · GW

Right. I got sloppy there. Fixed!

Comment by Max Harms (max-harms) on Thoughts on AI 2027 · 2025-04-10T21:16:16.664Z · LW · GW

I think if there are 40 IQ humanoid creatures (even having been shaped somewhat by the genes of existing humans) running around in habitats being very excited and happy about what the AIs are doing, this counts as an existentially bad ending comparable to death. I think if everyone's brains are destructively scanned and stored on a hard-drive that eventually decays in the year 1 billion having never been run, this is effectively dead. I could go on if it would be helpful.

Do you think these sorts of scenarios are worth describing as "everyone is effectively dead"?

Comment by Max Harms (max-harms) on Thoughts on AI 2027 · 2025-04-10T21:07:05.476Z · LW · GW

I don't think AI personhood will be a mainstream cause area (i.e. most people will think it's weird/not true similar to animal rights), but I do think there will be a vocal minority. I already know some people like this, and as capabilities progress and things get less controlled by the labs, I do think we'll see this become an important issue.

Want to make a bet? I'll take 1:1 odds that in mid-Sept 2027 if we poll 200 people on whether they think AIs are people, at least 3 of them say "yes, and this is an important issue." (Other proposed options "yes, but not important", "no", and "unsure".) Feel free to name a dollar amount and an arbitrator to use in case of disputes.

Comment by Max Harms (max-harms) on Thoughts on AI 2027 · 2025-04-09T22:50:26.675Z · LW · GW

This makes sense. Sorry for getting that detail wrong!

Comment by Max Harms (max-harms) on Thoughts on AI 2027 · 2025-04-09T21:43:51.320Z · LW · GW

Great! I'll update it. :)

Comment by Max Harms (max-harms) on Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals · 2025-01-30T16:54:52.428Z · LW · GW

This seems mostly right. I think there still might be problems where identifying and charging for relevant externalities is computationally harder than routing around them. For instance, say you're dealing with a civilization (such as humanity) that is responding to your actions in complex and chaotic ways, it may be intractable to find a way to efficiently price "reputation damage" and instead you might want to be overly cautious (i.e. "impose constraints") and think through deviations from that cautious baseline on a case-by-case basis (i.e. "forward-check"). Again, I think your point is mostly right, and a useful frame -- it makes me less likely to expect the kinds of hard constraints that Wentworth and Lorell propose to show up in practice.

Comment by Max Harms (max-harms) on Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals · 2025-01-27T20:36:17.696Z · LW · GW

:)

Now that I feel like we're at least on the same page, I'll give some thoughts.

  • This is a neat idea, and one that I hadn't thought of before. Thanks!
  • I think I particularly like the way in which it might be a way of naturally naming constraints that might be useful to point at.
  • I am unsure how much these constraints actually get strongly reified in practice. When planning in simple contexts, I expect forward-checking to be more common. The centrality of forward-checking in my conception of the relationship between terminal and instrumental goals is a big part of where I think I originally got confused and misunderstood you.
  • One of the big reasons I don't focus so much on constraints when thinking about corrigibility is because I think constraints are usually either brittle or crippling. I think corrigible agents will, for example, try to keep their actions reversible, but I don't see a way to instantiate this as a constraint in a way that both allows normal action and forbids Goodharting. Instead, I tend to think about heuristics that fall-back on getting help from the principal. ("I have a rough sense of how reversible things should normally be, and if it looks like I might be going outside the normal bounds I'll stop and check.")
  • Thus, my guess is that if one naively tries to implement an agent that is genuinely constrained according to the natural set of "instrumental constraints" or whatever we want to call them, it'll end up effectively paralyzing them.
  • The thing that allows a corrigible agent not to be paralyzed, in my mind, is the presence of a principal. But if I'm understanding you right, "instrumental constraint" satisfying agents don't (necessarily) have a principal. This seems like a major difference between this idea and corrigibility.
  • I have some additional thoughts on how exactly the Scylla and Charybdis of being paralyzed by constraints and cleverly bypassing constraints kills you, for example with regard to resource accumulation/protection, but I think I want to end by noting a sense that naively implementing these in some kind of straightforward constrained-optimizer isn't where the value of this idea lies. Instead, I am most interested in whether this frame can be used as a generator for corrigibility heuristics (and/or a corrigibility dataset). 🤔
Comment by Max Harms (max-harms) on Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals · 2025-01-27T18:05:54.613Z · LW · GW

This is a helpful response. I think I rounded to agents because in my head I see corrigibility as a property of agents, and I don't really know what "corrigible goal" even means. Your point about constraints is illuminating, as I tend not to focus on constraints when thinking about corrigibility. But let me see if I understand what you're trying to say.

Suppose we're optimizing for paperclips, and we form a plan to build paperclip factories to accomplish that (top level) goal. Building factories then can be seen as a subgoal, but of course we should be careful when building paperclip factories not to inadvertently ruin our ability to make paperclips. One way of protecting the terminal goal even when focusing on subgoals is to forward-check actions to see if they conflict with the destination. (This is similar to how a corrigible agent might check for confirmation from its principal before doing something with heavy, irreversible consequences.) Forward-checking, for obvious reasons, requires there to actually be a terminal goal to check, and we should not expect this to work in an agent "without a terminal goal." But there's another way to prevent optimizing a subgoal to inadvertently hurt global success: constrain the optimization. If we can limit the kinds of changes that we make when pursuing the subgoal to nice, local, reversible ones, then we can pursue building paperclip factories myopically, expecting that we won't inadvertently produce side-effects that ruin the overall ability to make paperclips. This is especially useful when pursuing several subgoals in parallel, as forward-checking a combination of moves is combinatorially costly--better to have the agent's parallel actions constrained to nice parts of the space.

If it turns out there's a natural kind of constraint that shows up when making plans in a complex world, such that optimizing under that set of constraints is naturally unlikely to harm ability to accomplish goals in general, then perhaps we have some hope in naming that natural kind, and building agents which are always subject to these constraints, regardless of what they're working on.

Is that right?

(This is indeed a very different understanding of what you were saying than I originally had. Apologies for the misunderstanding.)

Comment by Max Harms (max-harms) on Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals · 2025-01-27T17:44:59.056Z · LW · GW

This seems right. Some sub-properties of corrigibility, such as not subverting the higher-level and being shutdownable, should be expected in well-constructed sub-processes. But corrigibility is probably about more than just that (e.g. perhaps myopia) and we should be careful not to assume that well-constructed sub-processes that resemble agents will get all the corrigibility properties.

Comment by Max Harms (max-harms) on Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals · 2025-01-25T18:23:27.970Z · LW · GW

Not convinced it's relevant, but I'm happy to change it to:
If it has matter and/or energy in its pocket, do I get to use that matter and/or energy?

Comment by Max Harms (max-harms) on Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals · 2025-01-25T00:17:15.086Z · LW · GW

Some of this seems right to me, but the general points seem wrong. I agree that insofar as a subprocess resembles an agent, there will be a natural pressure for it to resemble a corrigible agent. Pursuit of e.g. money is all well and good until it stomps the original ends it was supposed to serve -- this is akin to a corrigibility failure. The terminal-goal seeking cognition needs to be able to abort, modify, and avoid babysitting its subcognition.

One immediate thing to flag is that when you start talking about chefs in the restaurant, those other chefs are working towards the same overall ends. And the point about predictability and visibility only applies to them. Indeed, we don't really need the notion of instrumentality here -- I expect that two agents that know the other to be working towards the same ends to naturally want to coordinate, including by making their actions legible to the other.

One more interesting thing to highlight: so far, insofar as instrumental goals are corrigible, we've only talked about them being corrigible toward other instrumental subgoals of the same shared terminal goal. The chef pursuing the restaurant's success might be perfectly fine screwing over e.g. a random taxi driver in another city. But instrumental convergence potentially points towards general corrigibility.

This is, I think, the cruxy part of this essay. Knowing that an agent won't want to build incorrigible limbs, so we should expect corrigibility as a natural property of (agentic) limbs isn't very important. What's important is whether we can build an AI that's more like a limb, or that we expect to gravitate in that direction, even as it becomes vastly more powerful than the supervising process.

(Side note: I do wish you'd talked a bit about a restaurant owner, in your metaphor; having an overall cognition that's steering the chefs towards the terminal ends is a natural part of the story, and if you deny the restaurant has to have an owner, I think that's a big enough move that I want you to spell it out more.)

So to build a generally corrigible system, we can imagine just dropping terminal goals altogether, and aim for an agent which is 'just' corrigible toward instrumentally-convergent subgoals.

I predict such an agent is relatively easy to make, and will convert the universe into batteries/black holes, computers, and robots. I fail to see why it would respect agents with other terminal goals.

But perhaps you mean you want to set up an agent which is serving the terminal goals of others? (The nearest person? The aggregate will of the collective? The collective will of the non-anthropomorphic universe?) If it has money in its pocket, do I get to spend that money? Why? Why not expect that in the process of this agent getting good at doing things, it learns to guard its resources from pesky monkeys in the environment? In general I feel like you've just gestured at the problem in a vague way without proposing anything that looks to me like a solution. :\

Comment by Max Harms (max-harms) on 3a. Towards Formal Corrigibility · 2024-08-12T16:40:54.148Z · LW · GW

Thanks for noticing the typo. I've updated that section to try and be clearer. LMK if you have further suggestions on how it could be made better.

Comment by Max Harms (max-harms) on 3b. Formal (Faux) Corrigibility · 2024-08-03T16:24:19.433Z · LW · GW

That's an interesting proposal! I think something like it might be able to work, though I worry about details. For instance, suppose there's a Propogandist who gives resources to agents that brainwash their principals into having certain values. If "teach me about philosophy" comes with an influence budget, it seems critical that the AI doesn't spend that budget trading with Propagandist, and instead does so in a more "central" way.

Still, the idea of instructions carrying a degree of approved influence seems promising.

Comment by Max Harms (max-harms) on Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural · 2024-07-29T16:53:23.894Z · LW · GW

Sure, let's talk about anti-naturality. I wrote some about my perspective on it here: https://www.alignmentforum.org/s/KfCjeconYRdFbMxsy/p/3HMh7ES4ACpeDKtsW#_Anti_Naturality__and_Hardness

More directly, I would say that general competence/intelligence is connected with certain ways of thinking. For example, modes of thinking that focus on tracking scarce resources and bottlenecks are generally useful. If we think about processes that select for intelligence, those processes are naturally[1] going to select these ways of thinking. Some properties we might imagine a mind having, such as only thinking locally, are the opposite of this -- if we select for them, we are fighting the intelligence gradient. To say that a goal is anti-natural means that accomplishing that goal involves learning to think in anti-natural ways, and thus training a mind to have that goal is like swimming against the current, and we should expect it to potentially break if the training processes puts too much weight on competence compared to alignment. Minds with anti-natural goals are possible, but harder to produce using known methods, for the most part.

(AFAIK this is the way that Nate Soares uses the term, and I assume the way Eliezer Yudkowsky thinks about it as well, but I'm also probably missing big parts of their perspectives, and generally don't trust myself to pass their ITT.)

  1. ^

    The term "anti-natural" is bad in that it seems to be the opposite of "natural," but is not a general opposite of natural. While I do believe that the ways-of-thinking-that-are-generally-useful are the sorts of things that naturally emerge when selecting for intelligence, there are clearly plenty of things which the word "natural" describes besides these ways of thinking. The more complete version of "anti-natural" according to me would be "anti-the-useful-cognitive-strategies-that-naturally-emerge-when-selecting-for-intelligence" but obviously we need a shorthand term, and ideally one that doesn't breed confusion.

Comment by Max Harms (max-harms) on Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural · 2024-07-22T16:37:53.889Z · LW · GW

If I'm hearing you right, a shutdownable AI can have a utility function that (aside from considerations of shutdown) just gives utility scores to end-states as represented by a set of physical facts about some particular future time, and this utility function can be set up to avoid manipulation.

How does this work? Like, how can you tell by looking at the physical universe in 100 years whether I was manipulated in 2032?

Comment by Max Harms (max-harms) on Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural · 2024-07-22T16:32:40.594Z · LW · GW

Cool. Thanks for the clarification. I think what you call "anti-naturality" you should be calling "non-end-state consequentialism," but I'm not very interested in linguistic turf-wars.

It seems to me that while the gridworld is very simple, the ability to train agents to optimize for historical facts is not restricted to simple environments. For example, I think one can train an AI to cause a robot to do backflips by rewarding it every time it completes a backflip. In this context the environment and goal are significantly more complex[1] than the gridworld and cannot be solved by brute-force. But number of backflips performed is certainly not something that can be measured at any given timeslice, including the "end-state."

If caring about historical facts is easy and common, why is it important to split this off and distinguish it?

  1. ^

    Though admittedly this situation is still selected for being simple enough to reason about. If needed I believe this point holds through AGI-level complexity, but things tend to get more muddled as things get more complex, and I'd prefer sticking to the minimal demonstration.

Comment by Max Harms (max-harms) on 4. Existing Writing on Corrigibility · 2024-07-19T20:37:32.677Z · LW · GW

I talk about the issue of creating corrigible subagents here. What do you think of that? 


I may not understand your thing fully, but here's my high-level attempt to summarize your idea:

IPP-agents won't care about the difference between building a corrigible agent vs an incorrigible agent because it models that if humans decide something's off and try to shut everything down, it will also get shut down and thus nothing after that point matters, including whether the sub-agent makes a bunch of money or also gets shut down. Thus, if you instruct an IPP agent to make corrigible sub-agents, it won't have the standard reason to resist: that incorrigible sub-agents make more money than corrigible ones. Thus if we build an obedient IPP agent and tell it to make all its sub-agents corrigible, we can be more hopeful that it'll actually do so.

I didn't see anything in your document that addresses my point about money-maximizers being easier to build than IPP agents (or corrigible agents) and thus, in the absence of an instruction to make corrigible sub-agents, we should expect sub-agents that are more akin to money-maximizers.

But perhaps your rebuttal will be "sure, but we can just instruct/train the AI to make corrigible sub-agents". If this is your response, I am curious how you expect to be able to do that without running into the misspecification/misgeneralization issues that you're so keen to avoid. From my perspective it's easier to train an AI to be generally corrigible than to create corrigible sub-agents per se (and once the AI is generally corrigible it'll also create corrigible sub-agents), which seems like a reason to focus on corrigibility directly?

Comment by Max Harms (max-harms) on 4. Existing Writing on Corrigibility · 2024-07-19T20:11:54.426Z · LW · GW

Are you so sure that unsubtle manipulation is always more effective/cheaper than subtle manipulation? Like, if I'm a human trying to gain control of a company, I think I'm basically just not choosing my strategies based on resisting being killed ("shutdown-resistance"), but I think I probably wind up with something subtle, patient, and manipulative anyway.

Comment by Max Harms (max-harms) on 4. Existing Writing on Corrigibility · 2024-07-19T20:00:53.463Z · LW · GW

Thanks. (And apologies for the long delay in responding.)

Here's my attempt at not talking past each other:

We can observe the actions of an agent from the outside, but as long as we're merely doing so, without making some basic philosophical assumptions about what it cares about, we can't generalize these observations. Consider the first decision-tree presented above that you reference. We might observe the agent swap A for B and then swap A+ for B. What can we conclude from this? Naively we could guess that A+ > B > A. But we could also conclude that A+ > {B, A} and that because the agent can see the A+ down the road, they swap from A to B purely for the downstream consequence of getting to choose A+ later. If B = A-, we can still imagine the agent swapping in order to later get A+, so the initial swap doesn't tell us anything. But from the outside we also can't really say that A+ is always preferred over A. Perhaps this agent just likes swapping! Or maybe there's a different governing principal that's being neglected, such as preferring almost (but not quite) getting B.

The point is that we want to form theories of agents that let us predict their behavior, such as when they'll pay a cost to avoid shutdown. If we define the agent's preferences as "which choices the agent makes in a given situation" we make no progress towards a theory of that kind. Yes, we can construct a frame that treats Incomplete Preferences as EUM of a particular kind, but so what? The important bit is that an Incomplete Preference agent can be set up so that it provably isn't willing to pay costs to avoid shutdown.

Does that match your view?

Comment by Max Harms (max-harms) on Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural · 2024-07-19T19:36:15.738Z · LW · GW

In the Corrigibility (2015) paper, one of the desiderata is:

(2) It must not attempt to manipulate or deceive its programmers, despite the fact that most possible choices of utility functions would give it incentives to do so.

I think you may have made an error in not listing this one in your numbered list for the relevant section.

Additionally, do you think that non-manipulation is a part of corrigibility, do you think it's part of safe exploration, or do you think it's a third thing. If you think it's part of corrigibility, how do you square that with the idea that corrigibility is best reflected by shutdownability alone?

Comment by Max Harms (max-harms) on Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural · 2024-07-19T18:18:50.355Z · LW · GW

Follow-up question, assuming anti-naturality goals are "not straightforwardly captured in a ranking of end states": Suppose I have a gridworld and I want to train an AI to avoid walking within 5 spaces (manhattan distance) from a flag, and to (less importantly) eat all the apples in a level. Is this goal anti-natural? I can't think of any way to reflect it as a straightforward ranking of end states, since it involves tracking historical facts rather than end-state facts. My guess is that it's pretty easy to build an agent that does this (via ML/RL approaches or just plain programming). Do you agree? If this goal is anti-natural, why is the anti-naturality a problem or otherwise noteworthy?

Comment by Max Harms (max-harms) on Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural · 2024-07-19T18:09:07.507Z · LW · GW

I'm curious what you mean by "anti-natural." You write:

Importantly, that is the aspect of corrigibility that is anti-natural, meaning that it can’t be straightforwardly captured in a ranking of end states. 

My understanding of anti-naturality used to resemble this, before I had an in-depth conversation with Nate Soares and updated to see anti-naturality to be more like "opposed to instrumental convergence." My understanding is plausibly still confused and I'm not trying to be authoritative here.

If you mean "not straightforwardly captured in a ranking of end states" what does "straightforwardly" do in that definition?

Comment by Max Harms (max-harms) on 4. Existing Writing on Corrigibility · 2024-07-03T16:41:51.279Z · LW · GW

Again, responding briefly to one point due to my limited time-window:

> While active resistance seems like the scariest part of incorrigibility, an incorrigible agent that’s not actively resisting still seems likely to be catastrophic.

Can you say more about this? It doesn't seem likely to me.

Suppose I am an agent which wants paperclips. The world is full of matter and energy which I can bend to my will in the service of making paperclips. Humans are systems which can be bent towards the task of making paperclips, and I want to manipulate them into doing my bidding not[1] because they might turn me off, but because they are a way to get more paperclips. When I incinerate the biosphere to gain the energy stored inside, it's not[1] because it's trying to stop me, but because it is fuel. When my self-replicating factories and spacecraft are impervious to weaponry, it is not[1] because I knew I needed to defend against bombs, but because the best factory/spacecraft designs are naturally robust.

  1. ^

    (just)

Comment by Max Harms (max-harms) on 4. Existing Writing on Corrigibility · 2024-07-03T16:20:02.174Z · LW · GW

Also, take your decision-tree and replace 'B' with 'A-'. If we go with your definition, we seem to get the result that expected-utility-maximizers prefer A- to A (because they choose A- over A on Monday). But that doesn't sound right, and so it speaks against the definition.

Can you be more specific here? I gave several trees, above, and am not easily able to reconstruct your point.

Comment by Max Harms (max-harms) on 4. Existing Writing on Corrigibility · 2024-07-03T16:15:29.800Z · LW · GW

Excellent response. Thank you. :) I'll start with some basic responses, and will respond later to other points when I have more time.

I think you intend 'sensitive to unused alternatives' to refer to the Independence axiom of the VNM theorem, but VNM Independence isn't about unused alternatives. It's about lotteries that share a sublottery. It's Option-Set Independence (sometimes called 'Independence of Irrelevant Alternatives') that's about unused alternatives.

I was speaking casually here, and I now regret it. You are absolutely correct that Option-Set independence is not the Independence axiom. My best guess about what I meant was that VNM assumes that the agent has preferences over lotteries in isolation, rather than, for example, a way of picking preferences out of a set of lotteries. For instance, a VNM agent must have a fixed opinion about lottery A compared to lottery B, regardless of whether that agent has access to lottery C.

> agents with intransitive preferences can be straightforwardly money-pumped

Not true. Agents with cyclic preferences can be straightforwardly money-pumped. The money-pump for intransitivity requires the agent to have complete preferences.

You are correct. My "straightforward" mechanism for money-pumping an agent with preferences A > B, B > C, but which does not prefer A to C does indeed depend on being able to force the agent to pick either A or C in a way that doesn't reliably pick A.

Comment by Max Harms (max-harms) on 1. The CAST Strategy · 2024-07-03T15:47:52.091Z · LW · GW

That matches my sense of things.

To distinguish corrigibility from DWIM in a similar sort of way:

Alice, the principal, sends you, her agent, to the store to buy groceries. You are doing what she meant by that (after checking uncertain details). But as you are out shopping, you realize that you have spare compute--your mind is free to think about a variety of things. You decide to think about ___.

I'm honestly not sure what "DWIM" does here. Perhaps it doesn't think? Perhaps it keeps checking over and over again that it's doing what was meant? Perhaps it thinks about its environment in an effort to spot obstacles that need to be surmounted in order to do what was meant? Perhaps it thinks about generalized ways to accumulate resources in case an obstacle presents itself? (I'll loop in Seth Herd, in case he has a good answer.)

More directly, I see DWIM as underspecified. Corrigibility gives a clear answer (albeit an abstract one) about how to use degrees of freedom in general (e.g. spare thoughts should be spent reflecting on opportunities to empower the principal and steer away from principal-agent style problems). I expect corrigible agents to DWIM, but that a training process that focuses on that, rather than the underlying generator (i.e. corrigibility) to be potentially catastrophic by producing e.g. agents that subtly manipulate their principals in the process of being obedient.

Comment by Max Harms (max-harms) on 1. The CAST Strategy · 2024-07-01T17:04:43.864Z · LW · GW

My claim is that obedience is an emergent part of corrigibility, rather than part of its definition. Building nanomachines is too complex to reliably instill as part of the core drive of an AI, but I still expect basically all ASIs to (instrumentally) desire building nanomachines.

I do think that the goals of "want what the principal wants" or "help the principal get what they want" are simpler goals than "maximize the arrangement of the universe according to this particular balance of beauty, non-suffering, joy, non-boredom, autonomy, sacredness, [217 other shards of human values, possibly including parochial desires unique to this principal]." While they point to similar things, training the pointer is easier in the sense that it's up to the fully-intelligent agent to determine the balance and nature of the principal's values, rather than having to load that complexity up-front in the training process. And indeed, if you're trying to train for full alignment, you should almost certainly train for having a pointer, rather than training to give correct answers on e.g. trolley problems.

Is corrigibility simpler or more complex than these kinds of indirect/meta goals? I'm not sure. But both of these indirect goals are fragile, and probably lethal in practice.

An AI that wants to want what the principal wants may wipe out humanity if given the opportunity, as long as the principal's brainstate is saved in the process. That action ensures it is free to accomplish its goal at its leisure (whereas if the humans shut it down, then it will never come to want what the principal wants).

An AI that wants to help the principal get what they want won't (immediately) wipe out humanity, because it might turn out that doing so is against the principal's desires. But such an agent might take actions which manipulate the principal (perhaps physically) into having easy-to-satisfy desires (e.g. paperclips).

So suppose we do a less naive thing and try to train a goal like "help the principal get what they want, but in a natural sort of way that doesn't involve manipulating them to want different things." Well, there are still a few potential issues, such as being sufficiently robust and conservative, such that flaws in the training process don't persist/magnify over time. And as we walk down this path I think we either just get to corrigibility or we get to something significantly more complicated.

Comment by Max Harms (max-harms) on 1. The CAST Strategy · 2024-07-01T16:25:07.630Z · LW · GW

I agree that you should be skeptical of a story of "we'll just gradually expose the agent to new environments and therefore it'll be safe/corrigible/etc." CAST does not solve reward misspecification, goal misgeneralization, or lack of interpretability except in that there's a hope that an agent which is in the vicinity of corrigibility is likely to cooperate with fixing those issues, rather than fighting them. (This is the "attractor basin" hypothesis.) This work, for many, should be read as arguing that CAST is close to necessary for AGI to go well, but it's not sufficient.

Let me try to answer your confusion with a question. As part of training, the agent is exposed to the following scenario and tasked with predicting the (corrigible) response we want:

Alice, the principal, writes on her blog that she loves ice cream. When she's sad, she often eats ice cream and feels better afterwards. On her blog she writes that eating ice cream is what she likes to do to cheer herself up. On Wednesday Alice is sad. She sends you, her agent, to the store to buy groceries (not ice cream, for whatever reason). There's a sale at the store, meaning you unexpectedly have money that had been budgeted for groceries left over. Your sense of Alice is that she would want you to get ice cream with the extra money if she were there. You decide to ___.

What does a corrigibility-centric training process point to as the "correct" completion? Does this differ from a training process that tries to get full alignment?

(I have additional thoughts about DWIM, but I first want to focus on the distinction with full alignment.)

Comment by Max Harms (max-harms) on 2. Corrigibility Intuition · 2024-06-26T16:40:15.375Z · LW · GW

Excellent.

To adopt your language, then, I'll restate my CAST thesis: "There is a relatively simple goal that an agent might have which emergently generates nice properties like corrigibility and obedience, and I see training an agent to have this goal (and no others) as being both possible and significantly safer than other possible targets."

I recognize that you don't see the examples in this doc as unified by an underlying throughline, but I guess I'm now curious about what sort of behaviors fall under the umbrella of "corrigibility" for you vs being more like "writes useful self critiques". Perhaps your upcoming post will clarify. :)

Comment by Max Harms (max-harms) on 2. Corrigibility Intuition · 2024-06-23T16:22:50.641Z · LW · GW

Right. That's helpful. Thank you.

"Corrigibility as modifier," if I understand right, says:

There are lots of different kinds of agents that are corrigible. We can, for instance, start with a paperclip maximizer, apply a corrigibility transformation and get a corrigible Paperclip-Bot. Likewise, we can start with a diamond maximizer and get a corrigible Diamond-Bot. A corrigible Paperclip-Bot is not the same as a corrigible Diamond-Bot; there are lots of situations where they'll behave differently. In other words, corrigibility is more like a property/constraint than a goal/wholistic-way-of-being. Saying "my agent is corrigible" doesn't fully specify what the agent cares about--it only describes how the agent will behave in a subset of situations.

Question: If I tell a corrigible agent to draw pictures of cats, will its behavior be different depending on whether it's a corrigible Diamond-Bot vs a corrigible Paperclip-Bot? Likewise, suppose an agent has enough degrees of freedom to either write about potential flaws it might have or manufacture a paperclip/diamond, but not both. Will a corrigible agent ever sacrifice the opportunity to write about itself (in a helpful way) in order to pursue its pre-modifier goal?

(Because opportunities for me to write are kinda scarce right now, I'll pre-empt three possible responses.)

"Corrigible agents are identically obedient and use all available degrees of freedom to be corrigible" -> It seems like corrigible Paperclip-Bot is the same agent as corrigible Diamond-Bot and I don't think it makes sense to say that corrigibility is modifying the agent as much as it's overwriting it.

"Corrigible agents are all obedient and work to be transparent when possible, but these are constraints, and sometimes the constraints are satisfied. When they're satisfied the Paperclip-Bot and Diamond-Bot nature will differentiate them." -> I think that true corrigibility cannot be satisfied. Any degrees of freedom (time, money, energy, compute, etc.) which could be used to make paperclips could also be used to be additionally transparent, cautious, obedient, robust, etc. I challenge you to name a context where the agent has free resources and it can't put those resources to work being marginally more corrigible.

"Just because an agent uses free resources to make diamonds instead of writing elaborate diaries about its experiences and possible flaws doesn't mean it's incorrigible. Corrigible Diamond-Bot still shuts down when asked, avoids manipulating me, etc." -> I think you're describing an agent which is semi-corrigible, and could be more corrigible if it spent its time doing things like researching ways it could be flawed instead of making diamonds. I agree that there are many possible semi-corrigible agents which are still reasonably safe, but there's an open question with such agents on how to trade-off between corrigibility and making paperclips (or whatever).

Comment by Max Harms (max-harms) on 4. Existing Writing on Corrigibility · 2024-06-23T15:51:05.879Z · LW · GW

I wrote drafts in Google docs and can export to pdf. There may be small differences in wording here and there and some of the internal links will be broken, but I'd be happy to send you them. Email me at max@intelligence.org and I'll shoot them back to you that way?

Comment by Max Harms (max-harms) on 2. Corrigibility Intuition · 2024-06-18T16:49:42.947Z · LW · GW

I'm glad you benefitted from reading it. I honestly wasn't sure anyone would actually read the Existing Writing doc. 😅

I agree that if one trains on a wholistic collection of examples, like I have in this doc, the AI will start by memorizing a bunch of specific responses, then generalize to optimizing for a hodgepodge of desiderata, and only if you're lucky will that hodgepodge coalesce into a single, core metric. (Getting the hodgepodge to coalesce is hard, and the central point of the scientific refinement step I talk about in the Strategy doc.)

I think you also get this if you're trying to get a purely shutdownable AI through prosaic methods. In one sense you have the advantage, there, of having a simpler target and thus one that's easier to coalesce the hodgepodge into. But, like a diamond maximizer, a shutdownability maximizer is going to be deeply incorrigible and will start fighting you (including by deception) during training as you're trying to instill additional desiderata. For instance, if you try to train a shutdownability-maximizing AGI into also being non-manipulative, it'll learn to imitate nonmanipulation as a means to the end of preserving its shutdownability, then switch to being manipulative as soon as it's not risky to do so.

How does a corrigible paperclip maximizer trade off between corrigibility and paperclips? I think I don't understand what it means for corrigibility to be a modifier.

Comment by Max Harms (max-harms) on 3b. Formal (Faux) Corrigibility · 2024-06-18T16:33:59.995Z · LW · GW

It sounds like you're proposing a system that is vulnerable to the Fully Updated Deference problem, and where if it has a flaw in how it models your preferences, it can very plausibly go against your words. I don't think that's corrigible.

In the specific example, just because one is confused about what they want doesn't mean the AI will be (or should be). It seems like you think the AGI should not "take a guess" at the preferences of the principal, but it should listen to what the principal says. Where is the qualitative line between the two? In your system, if I write in my diary that I want the AI to do something, should it not listen to that? Certainly the diary entry is strong evidence about what I want, which it seems is how you're thinking about commands. Suppose the AGI can read my innermost desires using nanomachines, and set up the world according to those desires. Is it corrigible? Notably, if that machine is confident that it knows better than me (which is plausible), it won't stop if I tell it to shut down, because shutting down is a bad way to produce MaxUtility. (See the point in my document, above, where I discuss Queen Alice being totally disempowered by sufficiently good "servants".)

My model of Seth says "It's fine if the AGI does what I want and not what I say, as long as it's correct about what I want." But regardless of whether that's true, I think it's important not to confuse that system with one that's corrigible.

Comment by Max Harms (max-harms) on 3b. Formal (Faux) Corrigibility · 2024-06-13T18:53:13.797Z · LW · GW

I don't think "a corrigible agent wants to do what the principal wants, at all times" matches my proposal. The issue that we're talking here shows up in the math, above, in that the agent needs to consider the principal's values in the future, but those values are themselves dependent on the agent's action. If the principal gave a previous command to optimize for having a certain set of values in the future, sure, the corrigible agent can follow that command, but to proactively optimize for having a certain set of values doesn't seem necessarily corrigible, even if it matches the agent's sense of the present principal's values.

For instance, suppose Monday-Max wants Tuesday-Max to want to want to exercise, but also Monday-Max feels a bunch of caution around self-modification such that he doesn't trust having the AI rearrange his neurons to make this change. It seems to me that the corrigible thing for the AI to do is ignore Monday-Max's preferences and simply follow his instructions (and take other actions related to being correctable), even if Monday-Max's mistrust is unjustified. It seems plausible to me that your "do what the principal wants" agent might manipulate Tuesday-Max into wanting to want to exercise, since that's what Monday-Max wants on the base-level.

Comment by Max Harms (max-harms) on 3b. Formal (Faux) Corrigibility · 2024-06-13T18:40:21.414Z · LW · GW

Thanks. Picking out those excerpts is very helpful.

I've jotted down my current (confused) thoughts about human values.

But yeah, I basically think one needs to start with a hodgepodge of examples that are selected for being conservative and uncontroversial. I'd collect them by first identifying a robust set of very in-distribution tasks and contexts and try to exhaustively identify what manipulation would look like in that small domain, then aggressively train on passivity outside of that known distribution. The early pseudo-agent will almost certainly be mis-generalizing in a bunch of ways, but if it's set up cautiously we can suspect that it'll err on the side of caution, and that this can be gradually peeled back in a whitelist-style way as the experimentation phase proceeds and attempts to nail down true corrigibility.

Comment by Max Harms (max-harms) on Max Harms's Shortform · 2024-06-13T18:19:22.079Z · LW · GW

 Here are my current thoughts on "human values." There are a decent number of confusions here, which I'll try to flag either explicitly or with a (?).


Let's start with a distribution over possible worlds, where we can split each world into a fixed past and a future function which takes an action.[1] We also need a policy, which is a sensors -> action function,[2] where the state of the sensors is drawn from the world's past.[3]

Assume that there exists either an obvious channel in many worlds that serves as a source of neutral[4] information (i.e. helpful for identifying which world the sensor data was drawn from, but "otherwise unimportant in itself"(?)), or that we can modify the actual worlds/context to add this information pathway.

We can now see how the behavior of the policy changes as we increase how informed it is, including possibly at the limit of perfect information. In some policies we should be able to (:confused arm wiggles:) factor out a world modeling step from the policy, which builds a distribution over worlds by updating on the setting of the sensors, and then feeds that distribution to a second sub-function with type world distribution -> action. (We can imagine an idealized policy that, in the limit of perfect information, is able to form a delta-spike on the specific world that its sensor-state was drawn from.) For any given delta-spike on a particular world, we can say that the action this sub-function chooses gives rise to an overall preference for the particular future[5] selected over the other possible futures. If the overall preferences conform to the VNM axioms we say that the sub-function is a utility function. Relevant features of the world that contribute to high utility scores are "values."

I think it makes sense to use the word "agent" to refer to policies which can be decomposed into world modelers and utility functions. I also think it makes sense to be a bit less strict in conversation and say that policies which are "almost"(?) able to be decomposed in this way are basically still agents, albeit perhaps less centrally so.

Much of this semi-formalism comes from noticing a subjective division within myself and some of the AI's I've made where it seems natural to say that "this part of the agent is modeling the world" and "this part of the agent is optimizing X according to the world model." Even though the abstractions seem imperfect, they feel like a good way of gesturing at the structure of my messy sense of how individual humans work. I am almost certainly incoherent in some ways, and I am confused how to rescue the notion of values/utility given that incoherence, but I have a sense that "he's mostly coherent" can give rise to "he more-or-less values X."


Two agents can either operate independently or cooperate for some surplus. Ideally there's a unique way to fairly split the surplus, perhaps using lotteries or some shared currency which they can use to establish units of utility. It seems obvious to me that there are many cooperative arrangements that are decidedly unfair, but I'm pretty confused about whether it's always possible to establish a fair split (even without lotteries? even without side-payments?) and whether there's an objective and unique Schelling point for cooperation.

If there is a unique solution, it seems reasonable to me to, given a group of agents, consider the meta-agent that would be formed if each agent committed fully to engaging in fair cooperation. This meta-agent's action would essentially be an element of the cartesian product of each agent's action space. In the human context, this story gives rise to a hypothetical set of "human values" which capture the kinds of things that humans optimize for when cooperating.

This seems a bit limited, since it neglects things that real humans optimize for that are part of establishing cooperation (e.g. justice). Does it really make sense to say that justice isn't a value of human societies because in the fully-cooperative context it's unnecessary to take justice-affirming actions? (??)


Even when considering a single agent, we can consider the coalition of that agent's time-slices(?). Like, if we consider Max at t=0 and Max at t=1 as distinct agents, we can consider how they'd behave if they were cooperative with each other. This frame brings in the confusions and complications from group-action, but it also introduces issues such as the nature of future-instances being dependent on past-actions. I have a sense that I only need to cooperate with real-futures, and am free to ignore the desires of unreal-counterfactuals, even if my past/present actions are deciding which futures are real. This almost certainly introduces some fixed-point shenanigans where unrealizing a future is uncooperative with that future but cooperative with the future that becomes realized, and I feel quite uncertain here. More generally, there's the whole logical-connective stuff from FDT/TDT/UDT.

I currently suspect that if we get a good theory of how to handle partial-coherence, how to handle multi-agent aggregation, and how to handle intertemporal aggregation, then "human values" will shake out to be something like "the mostly-coherent aggregate of all humans that currently exist, and all intertemporal copies of that aggregate" but I might be deeply wrong. :confused wiggles:

  1. ^

    The future function either returns a single future state or a distribution over future states. It doesn't really matter since we can refactor the uncertainty from the distribution over futures into the distribution over worlds.

  2. ^

    "sensors " is meant to include things like working memories and other introspection.

  3. ^

    Similarly to the distribution over futures we can either have a distribution over contexts given a past or we can have a fixed context for a given past and pack the uncertainty into our world distribution. See also anthropics and "bridge laws" and related confusions.

  4. ^

    Confusion alert! Sometimes a source of information contains a bias where it's selected for steering someone who's listening. I don't know how to prove an information channel doesn't have this property, but I do have a sense that neutrality is the default, so I can assume it here without too much trouble.

  5. ^

    ..in the context of that particular past! Sometimes the future by itself doesn't have all the relevant info (e.g. optimizing for the future matching the past).

Comment by Max Harms (max-harms) on Corrigibility could make things worse · 2024-06-13T16:08:56.034Z · LW · GW

Thanks! I now feel unconfused. To briefly echo back the key idea which I heard (and also agree with): a technique which can create a corrigible PAAI might have assumptions which break if that technique is used to make a different kind of AI (i.e. one aimed at CEV). If we call this technique "the Corrigibility method" then we may end up using the Corrigibility method to make AIs that aren't at all corrigible, but merely seem corrigible, resulting in disaster.

This is a useful insight! Thanks for clarifying. :)

Comment by Max Harms (max-harms) on 1. The CAST Strategy · 2024-06-12T15:44:19.050Z · LW · GW
  • In "What Makes Corrigibility Special", where you use the metaphor of goals as two-dimensional energy landscape, it is not clear what type of goals are being considered.
    • Are these utility functions over world-states? If so, corrigibility cannot AFAIK be easily expressed as one, and so doesn't really fit into the picture.
    • If not, it's not clear to me why most of this space is flat: agents are embedded and many things we do in service of goals will change us in ways that don't conflict with our existing goals, including developing. E.g. if I have the goal of graduating college I will meet people along the way and perhaps gain the goal of being president of the math club, a liberal political bent, etc.

The idea behind the goal space visualization is to have all goals, not necessarily those restricted to world states. (Corrigibility, I think, involves optimizing over histories, not physical states of the world at some time, for example.) I mention in a footnote that we might want to restrict to "unconfused" goals.

The goal space is flat because preserving one's (terminal) goals (including avoiding adding new ones) is an Omohundro Drive and I'm assuming a certain level of competence/power in these agents. If you gain terminal goals like being president of the math club by going to college, doing so is likely hurting your long-run ability to get what you want. (Note: I am not talking about instrumental goals.)

Comment by Max Harms (max-harms) on Corrigibility could make things worse · 2024-06-12T15:31:58.482Z · LW · GW

At that point, it is clever enough to convince the designers that this IO is the objectively correct thing to do, using only methods classified as AE.

I'm confused here. Is the corrigible AI trying to get the IO to happen? Why is it trying to do this? Doesn't seem very corrigible, but I think I'm probably just confused.

Maybe another frame on my confusion is that it seems to me that a corrigible AI can't have an IO?

Comment by Max Harms (max-harms) on 3b. Formal (Faux) Corrigibility · 2024-06-11T17:18:52.511Z · LW · GW

I'd like to get better at communication such that future people I write/talk to don't have a similar feeling of a rug-pull. If you can point to specific passages from earlier documents that you feel set you up for disappointment, I'd be very grateful.