## Posts

Environmental Structure Can Cause Instrumental Convergence 2021-06-22T22:26:03.120Z
Open problem: how can we quantify player alignment in 2x2 normal-form games? 2021-06-16T02:09:42.403Z
Game-theoretic Alignment in terms of Attainable Utility 2021-06-08T12:36:07.156Z
Conservative Agency with Multiple Stakeholders 2021-06-08T00:30:52.672Z
MDP models are determined by the agent architecture and the environmental dynamics 2021-05-26T00:14:00.699Z
Generalizing POWER to multi-agent games 2021-03-22T02:41:44.763Z
Lessons I've Learned from Self-Teaching 2021-01-23T19:00:55.559Z
Review of 'Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More' 2021-01-12T03:57:06.655Z
Review of 'But exactly how complex and fragile?' 2021-01-06T18:39:03.521Z
Collider bias as a cognitive blindspot? 2020-12-30T02:39:35.700Z
2019 Review Rewrite: Seeking Power is Often Robustly Instrumental in MDPs 2020-12-23T17:16:10.174Z
Avoiding Side Effects in Complex Environments 2020-12-12T00:34:54.126Z
Is it safe to spend time with people who already recovered from COVID? 2020-12-02T22:06:13.469Z
Non-Obstruction: A Simple Concept Motivating Corrigibility 2020-11-21T19:35:40.445Z
Math That Clicks: Look for Two-Way Correspondences 2020-10-02T01:22:18.177Z
Power as Easily Exploitable Opportunities 2020-08-01T02:14:27.474Z
Generalizing the Power-Seeking Theorems 2020-07-27T00:28:25.677Z
GPT-3 Gems 2020-07-23T00:46:36.815Z
To what extent is GPT-3 capable of reasoning? 2020-07-20T17:10:50.265Z
What counts as defection? 2020-07-12T22:03:39.261Z
Corrigibility as outside view 2020-05-08T21:56:17.548Z
How should potential AI alignment researchers gauge whether the field is right for them? 2020-05-06T12:24:31.022Z
Insights from Euclid's 'Elements' 2020-05-04T15:45:30.711Z
Problem relaxation as a tactic 2020-04-22T23:44:42.398Z
A Kernel of Truth: Insights from 'A Friendly Approach to Functional Analysis' 2020-04-04T03:38:56.537Z
Research on repurposing filter products for masks? 2020-04-03T16:32:21.436Z
ODE to Joy: Insights from 'A First Course in Ordinary Differential Equations' 2020-03-25T20:03:39.590Z
Conclusion to 'Reframing Impact' 2020-02-28T16:05:40.656Z
Reasons for Excitement about Impact of Impact Measure Research 2020-02-27T21:42:18.903Z
Attainable Utility Preservation: Scaling to Superhuman 2020-02-27T00:52:49.970Z
How Low Should Fruit Hang Before We Pick It? 2020-02-25T02:08:52.630Z
Continuous Improvement: Insights from 'Topology' 2020-02-22T21:58:01.584Z
Attainable Utility Preservation: Empirical Results 2020-02-22T00:38:38.282Z
Attainable Utility Preservation: Concepts 2020-02-17T05:20:09.567Z
The Catastrophic Convergence Conjecture 2020-02-14T21:16:59.281Z
Attainable Utility Landscape: How The World Is Changed 2020-02-10T00:58:01.453Z
Does there exist an AGI-level parameter setting for modern DRL architectures? 2020-02-09T05:09:55.012Z
AI Alignment Corvallis Weekly Info 2020-01-26T21:24:22.370Z
On Being Robust 2020-01-10T03:51:28.185Z
Judgment Day: Insights from 'Judgment in Managerial Decision Making' 2019-12-29T18:03:28.352Z
Can fear of the dark bias us more generally? 2019-12-22T22:09:42.239Z
Clarifying Power-Seeking and Instrumental Convergence 2019-12-20T19:59:32.793Z
Seeking Power is Often Robustly Instrumental in MDPs 2019-12-05T02:33:34.321Z
How I do research 2019-11-19T20:31:16.832Z
Thoughts on "Human-Compatible" 2019-10-10T05:24:31.689Z
The Gears of Impact 2019-10-07T14:44:51.212Z
World State is the Wrong Abstraction for Impact 2019-10-01T21:03:40.153Z
Attainable Utility Theory: Why Things Matter 2019-09-27T16:48:22.015Z
Deducing Impact 2019-09-24T21:14:43.177Z
Value Impact 2019-09-23T00:47:12.991Z

Comment by TurnTrout on Environmental Structure Can Cause Instrumental Convergence · 2021-06-24T19:04:13.074Z · LW · GW

This seems to me like a counter example. For any reward function that does not care about breaking the vase, the optimal policies do not avoid breaking the vase.

There are fewer ways for vase-breaking to be optimal. Optimal policies will tend to avoid breaking the vase, even though some don't.

Consider the following counter example (in which the last state is equivalent to the agent being shut down):

This is just making my point - Blackwell optimal policies tend to end up in any state but the last state, even though at any given state they tend to progress. If  is {the first four cycles} and  is {the last cycle}, then Blackwell optimal policies tend to end up in  instead of . Most Blackwell optimal policies will avoid entering the final state, just as section 7 claims.

(And I claim that the whole reason you're able to reason about this environment is because my theorems apply to them - you're implicitly using my formalisms and frames to reason about this environment, while seemingly trying to argue that my theorems don't let us reason about this environment? Or something? I'm not sure, so take this impression with a grain of salt.)

Why is it interesting to prove things about this set of MDPs? At this point, it feels like someone asking me "why did you buy a hammer - that seemed like a waste of money?". Maybe before I try out the hammer, I could have long debates about whether it was a good purchase. But now I know the tool is useful because I regularly use it and it works well for me, and other people have tried it and say it works well for them

I agree that there's room for cleaner explanation of when the theorems apply, for those readers who don't want to memorize the formal conditions. But I think the theory says interesting things because it's already starting to explain the things I built it to explain (e.g. SafeLife). And whenever I imagine some new environment I want to reason about, I'm almost always able to reason about it using my theorems (modulo already flagged issues like partial observability etc). From this, I infer that the set of MDPs is "interesting enough."

Comment by TurnTrout on Environmental Structure Can Cause Instrumental Convergence · 2021-06-24T00:45:28.591Z · LW · GW

I haven't seen the paper support that claim.

The paper supports the claim with:

• Embodied environment in a vase-containing room (section 6.3)
• Pac-Man (figure 8)
• And section 7 argues why this generally holds whenever the agent can be shut down (a large class of environments indeed)
• Blackwell-optimal robots not idling in a particular spot (beginning of section 7)

This post supports the claim with:

• Tic-Tac-Toe
• Vase gridworld
• SafeLife

So yes, this is sufficient support for speculation that most relevant environments have these symmetries.

Maybe I just missed it, but I didn't find a "limitations section" or similar in the paper.

Sorry - I meant the "future work" portion of the discussion section 7. The future work highlights the "note of caution" bits. I also made sure that the intro emphasizes that the results don't apply to learned policies.

Also, plausibly-the-most-influential-critic-of-AI-safety in EA seems to have gotten the impression (from an earlier version of the paper) that it formalizes the instrumental convergence thesis (see the first paragraph here).

Key part: earlier version of the paper. (I've talked to Ben since then, including about the newest results, their limitations, and their usefulness.)

I think my advice that "it should not be cited as a paper that formally proves a core AI alignment argument" is beneficial.

Your advice was beneficial a year ago, because that was a very different paper. I think it is no longer beneficial: I still agree with it, but I don't think it needs to be mentioned on the margin. At this point, I have put far more care into hedging claims than most other work which I can recall. At some point, you're hedging too much. And I'm not interested in hedging any more, unless I've made some specific blatant oversights which you'd like to inform me of.

Comment by TurnTrout on Environmental Structure Can Cause Instrumental Convergence · 2021-06-24T00:01:00.001Z · LW · GW

My apologies - I had thought I had accidentally moved your comment to AF by unintentionally replying to your comment on AF, and so (from my POV) I "undid" it (for both mine and yours). I hadn't realized it was already on AF.

Comment by TurnTrout on Environmental Structure Can Cause Instrumental Convergence · 2021-06-23T21:33:04.108Z · LW · GW

For my part, I either strongly disagree with nearly every claim you make in this comment, or think you're criticizing the post for claiming something that it doesn't claim (e.g. "proves a core AI alignment argument"; did you read this post's "A note of caution" section / the limitations section and conclusion of the paper?).

I don't think it will be useful for me to engage in detail, given that we've already extensively debated these points at length, without much consensus being reached.

Comment by TurnTrout on Environmental Structure Can Cause Instrumental Convergence · 2021-06-23T19:59:22.175Z · LW · GW

I like the thought. I don't know if this sketch works out, partly because I don't fully understand it. your conclusion seems plausible but I want to develop the arguments further.

As a note: the simplest function period probably is the constant function, and other very simple functions probably make both power-seeking and not-power-seeking optimal. So if you permute that one, you'll get another function for which power-seeking and not-power-seeking actions are both optimal.

Comment by TurnTrout on Alex Turner's Research, Comprehensive Information Gathering · 2021-06-23T19:34:06.487Z · LW · GW

This in turns leads to one of the strongest result of Alex's paper: for any "well-behaved" distribution on reward functions, if the environment has the sort of symmetry I mentioned, then for at least half of the permutations of this distribution, at least half of the probability mass will be on reward functions for which the optimal policy is power-seeking.

Clarification:

• The instrumental convergence (formally, optimality probability) results apply to all distributions over reward functions. So, the "important" part of my results apply to permutations of arbitrary distributions - no well-behavedness is required.
• The formal-POWER results apply to bounded distributions over reward functions. This guarantees that POWER's expectation is well-defined.

The paper isn't currently very clear on that point - only mentioning it in footnote 1 on page 6.

Comment by TurnTrout on Open problem: how can we quantify player alignment in 2x2 normal-form games? · 2021-06-22T18:58:41.501Z · LW · GW

I think I've been unclear in my own terminology, in part because I'm uncertain about what other people have meant by 'utility' (what you'd recover from perfect IRL / Savage's theorem, or cardinal representation of preferences over outcomes?) My stance is that they're utilities but that I'm not assuming the players are playing best responses in order to maximize expected utility.

How can they be preferences if agents can "choose" not to follow them?

Am I allowed to have preferences without knowing how to maximize those preferences, or while being irrational at times? Boltzmann-rational agents have preferences, don't they? These debates have surprised me; I didn't think that others tied together "has preferences" and "acts rationally with respect to those preferences."

Comment by TurnTrout on The Point of Trade · 2021-06-22T18:32:17.340Z · LW · GW

My guess:

1. Organizational capital: even with all the assumed magic, some jobs need several people to work together. Absent organizational tech, it takes time to organize people and coordinate on a project. Firms can specialize in being well-organized and well-coordinated to produce a set of goods.
2. Other physical capital: will a well-organized group of people really be able to extract oil, just by knowing exactly how to do it? No. Even if they have the resources to pay the fixed costs and enter the market, they still have to purchase the equipment, set up the supply chains, etc (even with teleporters, since you have to organize what stuff goes where, from where - although now we're getting into organizational capital again). Even if you and your colleagues are well-organized, you aren't going to extract oil or build a stadium without the right equipment.
Comment by TurnTrout on TurnTrout's shortform feed · 2021-06-22T00:10:37.707Z · LW · GW

I'm pretty sure that LessWrong will never have profile pictures - at least, I hope not! But my partner Emma recently drew me something very special:

Comment by TurnTrout on Open problem: how can we quantify player alignment in 2x2 normal-form games? · 2021-06-19T16:00:55.094Z · LW · GW

How is it clearly not about utility being specified in the payoff matrix? Vanessa's definition itself relies on utility, and both of us interchanged 'payoff' and 'utility' in the ensuing comments.

Comment by TurnTrout on Open problem: how can we quantify player alignment in 2x2 normal-form games? · 2021-06-19T14:52:17.457Z · LW · GW

Right, thanks!

1. I think I agree that payout represents player utility.
2. The agent's decision can be made in any way. Best response, worst response, random response, etc.

I just don't want to assume the players are making decisions via best response to each strategy profile (which is just some joint strategy of all the game's players). Like, in rock-paper-scissors, if we consider the strategy profile P1: rock, P2: scissors, I'm not assuming that P2 would respond to this by playing paper.

And when I talk about 'responses', I do mean 'response' in the 'best response' sense; the same way one can reason about Nash equilibria in non-iterated games, we can imagine asking "how would the player respond to this outcome?".

Another point for triangulating my thoughts here is Vanessa's answer, which I think resolves the open question.

Comment by TurnTrout on Open problem: how can we quantify player alignment in 2x2 normal-form games? · 2021-06-18T18:53:44.151Z · LW · GW

Pending unforeseen complications, I consider this answer to solve the open problem. It essentially formalizes B's impact alignment with A, relative to the counterfactuals where B did the best or worst job possible.

There might still be other interesting notions of alignment, but I think this is at least an important notion in the normal-form setting (and perhaps beyond).

Comment by TurnTrout on Open problem: how can we quantify player alignment in 2x2 normal-form games? · 2021-06-18T17:22:37.444Z · LW · GW

You're right. Per Jonah Moss's comment, I happened to be thinking of games where playoff is constant across players and outcomes, which is a very narrow kind of common-payoff (and constant-sum) game.

Comment by TurnTrout on Open problem: how can we quantify player alignment in 2x2 normal-form games? · 2021-06-18T16:31:38.110Z · LW · GW

This also suggests that "selfless" perfect B/A alignment is possible in zero-sum games, with the "maximal misalignment" only occuring if we assume B plays a best response. I think this is conceptually correct, and not something I had realized pre-theoretically.

Comment by TurnTrout on Open problem: how can we quantify player alignment in 2x2 normal-form games? · 2021-06-18T16:04:09.902Z · LW · GW

it's much clearer to me that you're NOT using standard game-theory payouts (utility) here.

Thanks for taking the time to read further / understand what I'm trying to communicate. Can you point me to the perspective you consider standard, so I know what part of my communication was unclear / how to reply to the claim that I'm not using "standard" payouts/utility?

Comment by TurnTrout on Open problem: how can we quantify player alignment in 2x2 normal-form games? · 2021-06-18T15:58:35.887Z · LW · GW

In a sense, your proposal quantifies the extent to which B selects a best response on behalf of A, given some mixed outcome. I like this. I also think that "it doesn't necessarily depend on " is a feature, not a bug.

EDIT: To handle common- constant-payoff games, we might want to define the alignment to equal 1 if the denominator is 0. In that case, the response of B can't affect A's expected utility, and so it's not possible for B to act against A's interests. So we might as well say that B is (trivially) aligned, given such a mixed outcome?

Comment by TurnTrout on The Apprentice Thread · 2021-06-17T18:06:29.888Z · LW · GW

[MENTOR] I expect to be busy for the near future. However, please contact me anyways - my email is on my LW profile page.

• Doing self-study, and doing it right, and - most importantly - having fun along the way.
• AI alignment research. I don't know how to transfer some important skills and habits I've picked up, but I can try anyways. And if you're the kind of person who'll earnestly wonder things like "can I figure out corrigibility in the next hour?" -- let's talk.
Comment by TurnTrout on Open problem: how can we quantify player alignment in 2x2 normal-form games? · 2021-06-17T02:20:00.801Z · LW · GW

I'm not 100% sure I am understanding your terminology. What does it mean to "play stag against (stag,stag)" or to "defect against cooperate/cooperate"?

Let  be player 's response function to strategy profile . Given some strategy profile (like stag/stag), player i selects a response. I mean "response" in terms of "best response" - I don't necessarily mean that there's an iterated game. This captures all the relevant "outside details" for how decisions are made.

If your opponent is not in any sense a utility-maximizer then I don't think it makes sense to talk about your opponent's utilities, which means that it doesn't make sense to have a payout matrix denominated in utility

I don't think I understand where this viewpoint is coming from. I'm not equating payoffs with VNM-utility, and I don't think game theory usually does either - for example, the maxmin payoff solution concept does not involve VNM-rational expected utility maximization. I just identify payoffs with "how good is this outcome for the player", without also demanding that  always select a best response. Maybe it's Boltzmann rational, or maybe it just always selects certain actions (regardless of their expected payouts).

or (b) to treat them as mere utility-function-fodder, but to assume that they're all the fodder the utility functions get (in which case, as above, I think none of the alignment information is in the payout matrix and it's all in the payouts-to-utilities mapping)

There exist two payoff functions. I think I want to know how impact-aligned one player is with another: how do the player's actual actions affect the other player (in terms of their numerical payoff values). I think (c) is closest to what I'm considering, but in terms of response functions - not actual iterated games.

Sorry, I'm guessing this probably still isn't clear, but this is the reply I have time to type right now and I figured I'd send it rather than nothing.

Comment by TurnTrout on Open problem: how can we quantify player alignment in 2x2 normal-form games? · 2021-06-17T02:04:17.951Z · LW · GW

Hm. At first glance this feels like a "1" game to me, if they both use the "take the strictly dominant action" solution concept. The alignment changes if they make decisions differently, but under the standard rationality assumptions, it feels like a perfectly aligned game.

Comment by TurnTrout on MIRIx Part I: Insufficient Values · 2021-06-16T19:39:46.160Z · LW · GW

There's a difference between "AI putting humans in control is bad", and "AI putting humans in control is better than other options we seem to have for alignment." For many people, it may be as you mentioned:

I don't understand why anybody would want anything that involved leaving humans in control, unless there were absolutely no alternative whatsoever.

(I'm somewhat less pessimistic than you are, I think, but I agree it could go pretty damn poorly, for many ways the AI could "leave us in control.")

Comment by TurnTrout on Open problem: how can we quantify player alignment in 2x2 normal-form games? · 2021-06-16T17:51:25.382Z · LW · GW

I like this answer, and I'm going to take more time to chew on it.

Comment by TurnTrout on Open problem: how can we quantify player alignment in 2x2 normal-form games? · 2021-06-16T17:49:49.275Z · LW · GW

Good question. I don't have a crisp answer (part of why this is an open question), but I'll try a few responses:

• To what degree does player 1's actions further the interests of player 2 within this normal form game, and vice versa?
• This version requires specific response functions.
• To what degree do the interests of players 1 and 2 coincide within a normal form game?
• This feels more like correlation of the payout functions, represented as vectors.
Comment by TurnTrout on Open problem: how can we quantify player alignment in 2x2 normal-form games? · 2021-06-16T17:45:50.806Z · LW · GW

the definition of the normal form game you cited explicitly says that the payoffs are in the form of cardinal or ordinal utilities. Which is distinct from in-game payouts.

No. In that article, the only spot where 'utility' appears is identifying utility with the player's payoffs/payouts. (EDIT: but perhaps I don't get what you mean by 'in-game payouts'?)

that player's set of payoffs (normally the set of real numbers, where the number represents a cardinal or ordinal utility—often cardinal in the normal-form representation)

To reiterate: I'm not talking about VNM-utility, derived by taking a preference ordering-over-lotteries and back out a coherent utility function. I'm talking about the players having payoff functions which cardinally represent the value of different outcomes. We can call the value-units "squiggles", or "utilons", or "payouts"; the OP's question remains.

Also, too, it sounds like you agree that the strategy your counterparty uses can make a normal form game not count as a "stag hunt" or "prisoner's dillema" or "dating game"

No, I don't agree with that.

Comment by TurnTrout on Open problem: how can we quantify player alignment in 2x2 normal-form games? · 2021-06-16T16:08:02.516Z · LW · GW

I agree that this is a good start, but I find it unsatisfactory.

Comment by TurnTrout on Open problem: how can we quantify player alignment in 2x2 normal-form games? · 2021-06-16T16:07:15.161Z · LW · GW

The definition of utility is "the thing people maximize."

Only applicable if you're assuming the players are VNM-rational over outcome lotteries, which I'm not. Forget expected utility maximization.

It seems to me that people are making the question more complicated than it has to be, by projecting their assumptions about what a "game" is. We have payoff numbers describing how "good" each outcome is to each player. We have the strategy spaces, and the possible outcomes of the game. And here's one approach: fix two response functions in this game, which are functions from strategy profiles to the player's response strategy. With respect to the payoffs, how "aligned" are these response functions with each other?

This doesn't make restrictive rationality assumptions. It doesn't require getting into strange utility assumptions. Most importantly, it's a clearly-defined question whose answer is both important and not conceptually obvious to me.

(And now that I think of it, I suppose that depending on your response functions, even in zero-sum games, you could have "A aligned with B", or "B aligned with A", but not both.)

Comment by TurnTrout on Open problem: how can we quantify player alignment in 2x2 normal-form games? · 2021-06-16T15:58:28.734Z · LW · GW

In static games of complete, perfect information, a normal-form representation of a game is a specification of players' strategy spaces and payoff functions.

You are playing prisoner's dilemma when certain payoff inequalities are satisfied in the normal-form representation. That's it. There is no canonical assumption that players are expected utility maximizers, or expected payoff maximizers.

because the utilities to player B are not dependent on what you do.

Noting that I don't follow what you mean by this: do you mean to say that player B's response cannot be a constant function of strategy profiles (ie the response function cannot be constant everywhere)?

Comment by TurnTrout on Open problem: how can we quantify player alignment in 2x2 normal-form games? · 2021-06-16T14:46:16.802Z · LW · GW

That seems to be confused reasoning. "Cooperate" and "defect" are labels we apply to a 2x2 matrix sometimes, and applying those labels changes the payouts.

Not sure I follow your main point, but I was talking about actual PD, which I've now clarified in the original post. See also my post on What counts as defection?.

Comment by TurnTrout on Open problem: how can we quantify player alignment in 2x2 normal-form games? · 2021-06-16T14:41:42.884Z · LW · GW

Payout correlation IS the metric of alignment.

Do you have a citation? You seem to believe that this is common knowledge among game theorists, but I don't think I've ever encountered that.

Jacob and I have already considered payout correlation, and I agree that it has some desirable properties. However,

• it's symmetric across players,
• it's invariant to player rationality
• which matters, since alignment seems to not just be a function of incentives, but of what-actually-happens and how that affects different players
• it equally weights each outcome in the normal-form game, ignoring relevant local dynamics. For example, what if part of the game table is zero-sum, and part is common-payoff? Correlation then can be controlled by zero-sum outcomes which are strictly dominated for all players. For example:

1 / 1 || 2 / 2
-.5 / .5 || 1 / 1

and so I don't think it's a slam-dunk solution. At the very least, it would require significant support.

You're simply incorrect (or describing a different payout matrix than you state) that a player doesn't "have to select a best response".

Why? I suppose it's common to assume (a kind of local) rationality for each player, but I'm not interested in assuming that here. It may be easier to analyze the best-response case as a first start, though.

Comment by TurnTrout on Open problem: how can we quantify player alignment in 2x2 normal-form games? · 2021-06-16T14:33:52.614Z · LW · GW

Thanks for the thoughtful response.

In that case, I completely agree with Dagon: if on some occasion you prefer to pick "hare" even though you know I will pick "stag", then we are not actually playing the stag hunt game. (Because part of what it means to be playing stag hunt rather than some other game is that we both consider (stag,stag) the best outcome.)

It seems to me like you're assuming that players must respond rationally, or else they're playing a different game, in some sense. But why? The stag hunt game is defined by a certain set of payoff inequalities holding in the game. Both players can consider (stag,stag) the best outcome, but that doesn't mean they have to play stag against (stag, stag). That requires further rationality assumptions (which I don't think are necessary in this case).

If I'm playing against someone who always defects against cooperate/cooperate, versus against someone who always cooperates against cooperate/cooperate, am I "not playing iterated PD" in one of those cases?

Comment by TurnTrout on Open problem: how can we quantify player alignment in 2x2 normal-form games? · 2021-06-16T03:51:28.270Z · LW · GW

I don't follow. How can fixed-sum games mathematically imply unaligned players, without a formal metric of alignment between the players?

Also, the payout matrix need not determine the alignment, since each player could have a different policy from strategy profiles to responses, which in principle doesn't have to select a best response. For example, imagine playing stag hunt with someone who responds 'hare' to stag/stag; this isn't a best response for them, but it minimizes your payoff. However, another partner could respond 'stag' to stag/stag, which (I think) makes them "less unaligned with you" with you than the partner who responds 'hare' to stag/stag.

Comment by TurnTrout on The Apprentice Experiment · 2021-06-10T20:43:53.564Z · LW · GW

I'm both excited about this particular experiment and about the prospect that Aysajan’s post eventually increases the supply of promising researchers, because the criteria for good apprentices are different than the selection-driven criteria for good junior researchers (on a given technical problem).

Comment by TurnTrout on MDP models are determined by the agent architecture and the environmental dynamics · 2021-06-08T15:51:19.040Z · LW · GW

Not from the paper. I just wrote it.

I don't think that the action log is special in this context relative to any other object that constitutes a tiny part of the environment.

It isn't the size of the object that matters here, the key considerations are structural. In this unrolled model, the unrolled state factors into the (action history) and the (world state). This is not true in general for other parts of the environment.

Sure, but I still don't understand the argument here. It's trivial to write a reward function that doesn't yield instrumental convergence regardless of whether one can infer the complete action history from every reachable state. Every constant function is such a reward function.

Sure. Here's what I said:

how easy is it to write down state-based utility functions which do the same? I guess there's the one that maximally values dying. What else? While more examples probably exist, it seems clear that they're much harder to come by [than in the action-history case].

The broader claim I was trying to make was not "it's hard to write down any state-based reward functions that don't incentivize power-seeking", it was that there are fewer qualitatively distinct ways to do it in the state-based case. In particular, it's hard to write down state-based reward functions which incentivize any given sequence of actions:

when your reward depends on your action history, this is strictly more expressive than state-based reward - so expressive that it becomes easy to directly incentivize any sequence of actions via the reward function. And thus, instrumental convergence disappears for "most objectives."

If you disagree, then try writing down a state-based reward function for e.g. Pacman for which an optimal policy starts off by (EDIT: circling the level counterclockwise) (at a discount rate close to 1). Such reward functions provably exist, but they seem harder to specify in general.

Also: thanks for your engagement, but I still feel like my points aren't landing (which isn't necessarily your fault or anything), and I don't want to put more time into this right now. Of course, you can still reply, but just know I might not reply and that won't be anything personal.

EDIT: FYI I find your action-camera example interesting. Thank you for pointing that out.

Comment by TurnTrout on MDP models are determined by the agent architecture and the environmental dynamics · 2021-06-07T15:30:43.156Z · LW · GW

I was looking for some high-level/simplified description

Ah, I see. In addition to the cited explanation, see also: "optimal policies tend to take actions which strictly preserve optionality*", where the optionality preservation is rather strict (requiring a graphical similarity, and not just "there are more options this way than that"; ironically, this situation is considerably simpler in arbitrary deterministic computable environments, but that will be the topic of a future post).

Isn't the thing we condition on here similar (roughly speaking) to your interpretation of instrumental convergence?

No - the sufficient condition is about the environment, and instrumental convergence is about policies over that environment. I interpret instrumental convergence as "intelligent goal-directed agents tend to take certain kinds of actions"; this informal claim is necessarily vague. This is a formal sufficient condition which allows us to conclude that optimal goal-directed agents will tend to take a certain action in the given situation.

I think that using a simplicity prior over reward functions has a similar effect to "restricting to certain kinds of reward functions".

It certainly has some kind of effect, but I don't find it obvious that it has the effect you're seeking - there are many simple ways of specifying action-history+state reward functions, which rely on the action-history and not just the rest of the state.

Why is the action logger treated in your explanation as some privileged object? What's special about it relative to all the other stuff that's going on in our arbitrarily complex environment? If you imagine an MDP environment where the agent controls a robot in a room that has a security camera in it, and the recorded video is part of the state, then the recorded video is doing all the work that we need an action logger to do (for the purpose of my argument).

What's special is that (by assumption) the action logger always logs the agent's actions, even if the agent has been literally blown up in-universe. That wouldn't occur with the security camera. With the security camera, once the agent is dead, the agent can no longer influence the trajectory, and the normal death-avoiding arguments apply. But your action logger supernaturally writes a log of the agent's actions into the environment.

The reward function is a function over states (or state-action pairs) as usual, not state-action histories. My "unrolling trick" doesn't involve utility functions that are defined over state(-action) histories.

Right, but if you want the optimal policies to take actions , then write a reward function which returns 1 iff the action-logger begins with those actions and 0 otherwise. Therefore, it's extremely easy to incentivize arbitrary action sequences.

Comment by TurnTrout on MDP models are determined by the agent architecture and the environmental dynamics · 2021-06-02T18:39:54.686Z · LW · GW

(I continued this discussion with Adam in private - here are some thoughts for the public record)

• There is not really a subjective modeling decision involved because given an interface (state space and action space), the dynamics of the system are a real world property we can look for concretely.
• Claims about the encoding/modeling can be resolved thanks to power-seeking, which predicts what optimal policies are more likely to do. So with enough optimal policies, we can check the claim (like the "5-googleplex" one).

I think I'm claiming first bullet. I am not claiming the second.

Or are you pointing out that with an architecture in mind, the state space and action space is fixed? I agree

Yes, that.

then it's a question of how the states of the actual systems are encoded in the state space of the agent, and that doesn't seem unique to me.

It doesn't have to be unique. We're predicting "for the agents we build, will optimal policies in their MDP models seek power?", and once you account for the environment dynamics, our beliefs about the agent architecture, and then our beliefs on the reward functions conditional on each architecture, this prediction has no subjective degrees of freedom.

I'm not claiming that there's One Architecture To Rule Them All. I'm saying that if we want to predict what happens, we:

1. Consider the underlying environment (assumed Markovian)
2. Consider different state/action encodings we might supply the agent.
3. For each, fix a reward function distribution (what goals we expect to assign to the agent)
4. See what my theory predicts.

There's a further claim (which seems plausible, but which I'm not yet making) that (2) won't affect (4) very much in practice. The point of this post is that if you say "the MDP has a different model", you're either disagreeing with (1) the actual dynamics, or claiming that we will physically supply the agent with a different state/action encoding (2).

But to falsify the "5 googolplex", you do need to know what the optimal policies tend to do, right? Then you need to find optimal policies and know what they do (to check that they indeed don't power-seek by going left). This means run/simulate them, which might cause them to take over the world in the worst case scenarios.

To falsify "5 googolplex", all you have to know is the dynamics + the agent's observation and action encodings. That determines the MDP structure. You don't have to run anything. (Although I suppose your proposed direction of inference is interesting: power-seeking tendencies + dynamics give you evidence about the encoding)

The encodings + environment dynamics tell you what model the agent is interfacing with, which allows you to apply my theorems as usual.

Comment by TurnTrout on What is the Risk of Long Covid after Vaccination? · 2021-06-02T16:02:25.098Z · LW · GW

There's the notorious study

Do you happen to have a link on hand?

Comment by TurnTrout on MDP models are determined by the agent architecture and the environmental dynamics · 2021-06-01T21:33:50.791Z · LW · GW

Thanks for taking the time to write this out.

Regarding the theorems (in the POWER paper; I've now spent some time on the current version): The abstract of the paper says: "With respect to a class of neutral reward function distributions, we provide sufficient conditions for when optimal policies tend to seek power over the environment." I didn't find a description of those sufficient conditions (maybe I just missed it?).

I'm sorry - although I think I mentioned it in passing, I did not draw sufficient attention to the fact that I've been talking about a drastically broadened version of the paper, compared to what was on arxiv when you read it. The new version should be up in a few days. I feel really bad about this - especially since you took such care in reading the arxiv version!

The theorems hold for all finite MDPs in which the formal sufficient conditions are satisfied (i.e. the required environmental symmetries exist; see proposition 6.9, theorem 6.13, corollary 6.14). For practical advice, see subsection 6.3 and beginning of section 7.

(I shared the Overleaf with Ofer; if other lesswrong readers want to read without waiting for arxiv to update, message me! ETA: The updated version is now on arxiv.)

I further argue that we can take any MDP environment and "unroll" its state graph into a tree-with-constant-branching-factor (e.g. by adding an "action log" to the state representation) such that we get a "functionally equivalent" MDP in which the POWER (IID) of all the states are equal. My best guess is that you don't agree with this point, or think that the instrumental convergence thesis doesn't apply in a meaningful sense to such MDPs (but I don't yet understand why).

I agree that you can do that. I also think that instrumental convergence doesn't apply in such MDPs (as in, "most" goals over the environment won't incentivize any particular kind of optimal action), unless you restrict to certain kinds of reward functions.

Fix a reward function distribution  in the original MDP . For simplicity, let's suppose  is max-ent (and thus IID). Let's suppose we agree that optimal policies under  tend to avoid getting shut off.

Translated to the rolled-out MDP  no longer distributes reward uniformly over states. In fact, in its support, each reward function has the rather unusual property that its reward is only dependent on the current state, and not on the action log's contents. When translated into  imposes heavy structural assumptions on its reward functions, and it's not max-ent over the states of . By the "functional equivalence", it still gives you the same optimality probabilities as before, and so it still tends to incentivize shutdown avoidance.

However, if you take a max-ent over the rolled-out states of , then this max-ent won't incentivize shutdown avoidance.

To see why, consider how absurdly expressive utility functions are when their domains are entire state-action histories. In Coherence arguments do not imply goal-directed behavior, Rohin Shah wrote:

Actually, no matter what the policy is, we can view the agent as an EU maximizer. The construction is simple: the agent can be thought as optimizing the utility function U, where U(h, a) = 1 if the policy would take action a given history h, else 0.

...

Consider the following examples:

• A robot that constantly twitches
• The agent that always chooses the action that starts with the letter “A”
• The agent that follows the policy <policy> where for every history the corresponding action in <policy> is generated randomly.

These are not goal-directed by my “definition”. However, they can all be modeled as expected utility maximizers

When defined over state-action histories, it's dead easy to write down objectives which don't pursue instrumental subgoals.

However, how easy is it to write down state-based utility functions which do the same? I guess there's the one that maximally values dying. What else? While more examples probably exist, it seems clear that they're much harder to come by.

And so when your reward depends on your action history, this is strictly more expressive than state-based reward - so expressive that it becomes easy to directly incentivize any sequence of actions via the reward function. And thus, instrumental convergence disappears for "most objectives."

However, from our perspective, we still have a distribution over goals we might want to give the agent. And these goals are generally very structured - they aren't just randomly selected preferences over action-histories+current state. So we should still expect instrumental convergence to exist empirically (at a first approximation, perhaps via a simplicity prior over reward functions/utility functions). It just doesn't exist for most "unstructured" distributions in the unrolled environment.

The first state has the largest POWER (IID), but for most reward functions the optimal policy is to immediately transition to a lower-POWER state (even in the limit as  approaches 1).

Note that the RSD optimality probability theorem (Theorem 6.13) applies here, and it correctly predicts that when , most reward functions incentivize navigating to the larger set of 1-cycles (the 4 below the high-POWER state). As I explain in section 6.3, section 7, and appendix B of the new paper, you have to be careful in applying Thm 6.13, because

The paper says: "Theorem 6.6 shows it’s always robustly instrumental and power-seeking to take actions which allow strictly more control over the future (in a graphical sense)." I don't yet understand the theorem, but is there somewhere a description of the set/distribution of MDP transition functions for which that statement applies? (Specifically, the "always robustly instrumental" part, which doesn't seem to hold in the example above.)

Yeah, I'm aware of this kind of situation. I think that that sentence from the paper was poorly worded. In the new version, I'm more careful to emphasize the environmental symmetries which are sufficient to conclude power-seeking:

Some researchers speculate that intelligent reinforcement learning agents would be incentivized to seek resources and power in pursuit of their objectives. Other researchers are skeptical, because human-like power-seeking instincts need not be present in RL agents. To clarify this debate, we develop the first formal theory of the statistical tendencies of optimal policies in reinforcement learning. In the context of Markov decision processes, we prove that certain environmental symmetries are sufficient for optimal policies to tend to seek power over the environment. These symmetries exist in many environments in which the agent can be shut down or destroyed. We prove that for most prior beliefs one might have about the agent’s reward function (including as a special case the situations where the reward function is known), one should expect optimal policies to seek power in these environments. These policies seek power by keeping a range of options available and, when the discount rate is sufficiently close to 1, by navigating towards larger sets of potential terminal states.

See appendix B of the new paper for an example similar to yours, referenced by subsection 6.3 ("how to reason about other environments").

Are you referring here to POWER when it is defined over a reward distribution that corresponds to some simplicity prior?

Yup! POWER depends on the reward distribution; if you want to reason formally about a simplicity prior, plug it into POWER.

My argument is just that in MDPs where the state graph is a tree-with-a-constant-branching-factor—which is plausible in very complex environments—POWR (IID) is equal in all states. The argument doesn't mention description length (the description length concept arose in this thread in the context of discussing what reward function distribution should be used for defining instrumental convergence).

Right, okay, I agree with that. I think we agree about how POWER works here, but disagree about the link between optimality probability-wrt-a-distribution, and instrumental convergence.

If so, I argue that claim doesn't make sense: you can take any formal environment, however large and complex, and just add to it a simple "action logger" (that doesn't influence anything, other than effectively adding to the state representation a log of all the actions so far). If the action space is constant, the state graph of the modified MDP is a tree-with-a-constant-branching-factor; which would imply that adding that action logger somehow destroyed the applicability of the instrumental convergence thesis to that MDP; which doesn't make sense to me.

Yeah, I think that wrt the action-logger-MDP, instrumental convergence doesn't exist for goals over the new action-logger-MDP. See the earlier part of this comment.

Comment by TurnTrout on MDP models are determined by the agent architecture and the environmental dynamics · 2021-05-30T15:35:50.587Z · LW · GW

I don't understand your point in this exchange. I was being specific about my usage of model; I meant what I said in the original post, although I noted room for potential confusion in my comment above. However, I don't know how you're using the word.

I don’t use the term model in my previous reply anyway.

You used the word 'model' in both of your prior comments, and so the search-replace yields "state-abstraction-irrelevant abstractions." Presumably not what you meant?

I already pointed out a concrete difference: I claim it’s reasonable to say there are three alternatives while you claim there are two alternatives.

That's not a "concrete difference." I don't know what you mean when you talk about this "third alternative." You think you have some knockdown argument - that much is clear - but it seems to me like you're talking about a different consideration entirely. I likewise feel an urge to disengage, but if you're interested in explaining your idea at some point, message me and we can set up a higher-bandwidth call.

Comment by TurnTrout on MDP models are determined by the agent architecture and the environmental dynamics · 2021-05-29T22:38:31.112Z · LW · GW

I read your formalism, but I didn't understand what prompted you to write it. I don't yet see the connection to my claims.

If so, I might try to formalize it.

Yeah, I don't want you to spend too much time on a bulletproof grounding of your argument, because I'm not yet convinced we're talking about the same thing.

In particular, if the argument's like, "we usually express reward functions in some featurized or abstracted way, and it's not clear how the abstraction will interact with your theorems" / "we often use different abstractions to express different task objectives", then that's something I've been thinking about but not what I'm covering here. I'm not considering practical expressibility issues over the encoded MDP: ("That's also a claim that we can, in theory, specify reward functions which distinguish between 5 googolplex variants of red-ghost-game-over.")

If this doesn't answer your objection - can you give me an english description of a situation where the objection holds? (Let's taboo 'model', because it's overloaded in this context)

Comment by TurnTrout on MDP models are determined by the agent architecture and the environmental dynamics · 2021-05-29T14:54:22.896Z · LW · GW

say we agree that our state abstraction needs to be model-irrelevant

Why would we need that, and what is the motivation for "models"? The moment we give the agent sensors and actions, we're done specifying the rewardless MDP (and its model).

ETA: potential confusion - in some MDP theory, the “model” is a model of the environment dynamics. Eg in deterministic environments, the model is shown with a directed graph. i don’t use “model” to refer to an agent’s world model over which it may have an objective function. I should have chosen a better word, or clarified the distinction.

a priori there should be skepticism that all tasks can be modeled with a specific state-abstraction.

If, by "tasks", you mean "different agent deployment scenarios" - I'm not claiming that. I'm saying that if we want to predict what happens, we:

1. Consider the underlying environment (assumed Markovian)
2. Consider different state/action encodings we might supply the agent.
3. For each, fix a reward function distribution (what goals we expect to assign to the agent)
4. See what the theory predicts.

There's a further claim (which seems plausible, but which I'm not yet making) that (2) won't affect (4) very much in practice. The point of this post is that if you say "the MDP has a different model", you're either disagreeing with (1) the actual dynamics, or claiming that we will physically supply the agent with a different state/action encoding (2).

I'd suspect this does generalize into a fragility/impossibility result any time the reward is given to the agent in a way that's decoupled from the agent's sensors which is really going to be the prominent case in practice. In conclusion, you can try to work with a variable/rewardless MDP, but then this argument will apply and severely limit the usefulness of the generic theoretical analysis.

I don't follow. Can you give a concrete example?

Comment by TurnTrout on MDP models are determined by the agent architecture and the environmental dynamics · 2021-05-28T23:58:46.906Z · LW · GW

I'm not trying to define here the set of reward functions over which instrumental convergence argument apply (they obviously don't apply to all reward functions, as for every possible policy you can design a reward function for which that policy is optimal).

ETA: I agree with this point in the main - they don't apply to all reward functions. But, we should be able to ground the instrumental convergence arguments via reward functions in some way. Edited out because I read through that part of your comment a little too fast, and replied to something you didn't say.

Shutting down the process doesn't mean that new strings won't appear in the environment and cause the state graph to become a tree-with-constant-branching-factor due to complex physical dynamics.

What does it mean to "shut down" the process? 'Doesn't mean they won't' - so new strings will appear in the environment? Then how was the agent "shut down"?

[EDIT 2: I think this miscommunication is my fault, due to me writing in my first comment: "the state representation may be uniquely determined by all the text that was written so far by both the customer and the chatbot", sorry for that.]

For every subset of branches in the tree you can design a reward function for which every optimal policy tries to go down those branches; I'm not saying anything about "most rewards functions". I would focus on statements that apply to "most reward functions" if we dealt with an AI that had a reward function that was sampled uniformly from all possible rewards function. But that scenario does not seem relevant (in particular, something like Occam's razor seems relevant: our prior credence should be larger for reward functions with shorter shortest-description).

We're considering description length? Now it's not clear that my theory disagrees with your prediction, then. If you say we have a simplicity prior over reward functions given some encoding, well, POWER and optimality probability now reflect your claims, and they now say there is instrumental convergence to the extent that that exists under a simplicity prior? (I still don't think they would exist; and my theory shows that in the space of all possible reward function distributions, equal proportions incentivize action A over action B, as vice versa - we aren't just talking about uniform. and so the onus is on you to provide the sense in which instrumental convergence exists here.)

And to the extent we were always considering description length - was the problem that IID-optimality probability doesn't reflect simplicity-weighted behavioral tendencies?

The non-formal definition in Bostrom's Superintelligence (which does not specify a set of rewards functions but rather says "a wide range of final goals and a wide range of situations, implying that these instrumental values are likely to be pursued by a broad spectrum of situated intelligent agents.").

I still don't know what it would mean for Ofer-instrumental convergence to exist in this environment, or not.

Comment by TurnTrout on MDP models are determined by the agent architecture and the environmental dynamics · 2021-05-28T15:58:48.754Z · LW · GW

Setting aside the "arbitrary" part, because I didn't talk about an arbitrary reward function…

To clarify: when I say that taking over the world is "instrumentally convergent", I mean that most objectives incentivize it. If you mean something else, please tell me. (I'm starting to think there must be a serious miscommunication somewhere if we're still disagreeing about this?)

So we can't set the 'arbitrary' part aside - instrumentally convergent means that the incentives apply across most reward functions - not just for one. You're arguing that one reward function might have that incentive. But why would most goals tend to have that incentive?

to prevent interferences and to seize control ASAP...

minimizing low probability risks (from the perspective of the agent)

This doesn't make sense to me. We assumed the agent is Cartesian-separated from the universe, and its actions magically make strings appear somewhere in the world. How could humans interfere with it? What, concretely, are the "risks" faced by the agent?

for the purpose of maximizing whatever counts as the total discounted payments by the customer

(Technically, the agent's goals are defined over the text-state, and you can assign high reward to text-states in which people bought stuff. But the agent doesn't actually have goals over the physical world as we generally imagine them specified.)

If such a string (one that causes the above scenario) exists, then any optimal policy will either involve such a string or different strings that allow at least as much expected return.

This statement is vacuous, because it's true about any possible string.

----

The original argument given for instrumental convergence and power-seeking is that gaining resources tends to be helpful for most objectives (this argument isn't valid in general, but set that aside for now). But even that's not true here. The problem is that the 'text-string-world' model is framed in a leading way, which is suggestive of the usual power-seeking setting (it's representing the real world and it's complicated, there must be instrumental convergence), even though it's structurally a whole different beast.

Objective functions induce preferences over text-states (with a "what's the world look like?" tacked on). The text-state the agent ends up in is, by your assumption, determined by the text output of the agent. Nothing which happens in the world expands or restrict's the agent's ability to output text. So there's no particular reason for optimal policies to tend to output strings that induce text-histories in which the world contains a disempowered human civilization.

Another way to realize that optimal policies don't have this tendency is that optimal policy tendencies are invariant to model isomorphism, and, again, this environment is literally isomorphic to

a sequential string output MDP, where the agent just puts a string in slot t at time t.

If it were true that optimal agents tend to "take over the world" in the 'real-world' model, then it would be true in the sequential string output model, which is absurd.

I know I've said this several times, but this is a knock-down argument, and you haven't engaged with it. If you take a piece of paper and draw out a model for the following environment - it will be a regular tree:

Let's side-step those issues by not having a computer running the agent inside the environment, but rather having the text string that the agent chooses in each time step magically appear somewhere [fixed] in the environment. The question is now whether it's possible to get to the same state with two different sequences of strings. This depends on the state representation & state transition function; it can be the case that the state is uniquely determined by the agent's sequence of past strings so far, which will mean POWER being equal in all states.

You may already know that, because you quickly pointed out that POWER is constant. But then why do you claim that most reward functions are attracted to certain branches of the tree, given that regularity? And if you aren't claiming that, what do you mean by instrumental convergence?

Comment by TurnTrout on MDP models are determined by the agent architecture and the environmental dynamics · 2021-05-27T13:44:10.440Z · LW · GW

Yeah, i claim that this intuition is actually wrong and there's no instrumental convergence in this environment. Complicated & contains actors doesn't mean you can automatically conclude instrumental convergence. The structure of the environment is what matters for "arbitrarily capable agents"/optimal policies (learned policies are probably more dependent on representation and training process).

So if you disagree, please explain why arbitrary reward functions tend to incentivize outputting one string sequence over another? Because, again, this environment is literally isomorphic to

a sequential string output MDP, where the agent just puts a string in slot t at time t.

What I think you're missing is that the environment can't affect the agent's capabilities or available actions; it can't gain or lose power, just freely steer through different trajectories.

Comment by TurnTrout on MDP models are determined by the agent architecture and the environmental dynamics · 2021-05-26T21:54:41.138Z · LW · GW

For that particular reward function, yes, the optimal policies may be very complicated. But why are there instrumentally convergent goals in that environment? Why should I expect capable agents in that environment to tend to output certain kinds of string sequences, over other kinds of string sequences?

(Also, is the amount of money paid by the client part of the state? Or is the agent just getting rewarded for the total number of purchase-assents in the conversation over time?)

Comment by TurnTrout on MDP models are determined by the agent architecture and the environmental dynamics · 2021-05-26T20:37:13.393Z · LW · GW

Though it involves the unresolved (for me) embedded agency issues.

Right, that does complicate things. I'd like to get a better picture of the considerations here, but given how POWER behaves on environment structures so far, I'm pretty confident it'll adapt to appropriate ways of modelling the situation.

Let's side-step those issues by not having a computer running the agent inside the environment, but rather having the text string that the agent chooses in each time step magically appear somewhere in the environment. The question is now whether it's possible to get to the same state with two different sequences of strings. This depends on the state representation & state transition function; it can be the case that the state is uniquely determined by the agent's sequence of past strings so far, which will mean POWER being equal in all states.

OK, but now that seems okay again, because there isn't any instrumental convergence here either. This is just an alternate representation ('reskin') of a sequential string output MDP, where the agent just puts a string in slot t at time t.

Comment by TurnTrout on MDP models are determined by the agent architecture and the environmental dynamics · 2021-05-26T19:12:08.399Z · LW · GW

To further clarify:

• pure-text-interaction-MDP: generated by the mentioned state and action representation, with environment dynamics allowing the agent to talk to a customer.
• Since you said that the induced model is regular, this implies that the agent won't get shut down for saying bad/weird things. If it could, then the graph is no longer regular under the previous state and action representations.
• The agent also isn't concerned with real-world resources, because it isn't modelling them. They aren't observable and they don't affect transition probabilities.
Comment by TurnTrout on MDP models are determined by the agent architecture and the environmental dynamics · 2021-05-26T18:48:02.263Z · LW · GW

The instrumental convergence thesis is not a fact about every situation involving "capable AI", but a thesis pointing out a reliable-seeming pattern across environments and goals. It can't be used as a black-box reason on its own - you have to argue why the reasoning applies in the environment. In particular, we assumed that the agent is interacting with the text MDP, where

the state representation [is] uniquely determined by all the text that was written so far by both the customer and the chatbot [, and the chat doesn't end when the customer leaves / stops talking].

Optimal policies do not have particular tendencies in this model. There's nothing more "capable" than an optimal policy. Which is to say, optimal policies for the actual text interaction MDP do not exhibit instrumental convergence (which says nothing about learned optimizer risks, etc).

But you seem to be secretly switching from the pure-text-interaction-MDP to a real-world-modelling-MDP, and then saying that POWER in the former doesn't correspond to POWER in the latter. Well, that's no big surprise. The real world MDP model is no longer modelling just the text interaction, but also the broader environment, which violates the very representation assumption which led to your "IID-POWER equality" conclusion.

And if you update the encodings and dynamics to account for real-world resource gain possibilities, then POWER and optimality probability will update accordingly and appropriately.

However, if you meant for the environment dynamics to originally include possibilities like "the agent can get shut off, or interfered with", then the model is no longer regular in the way you mentioned, and IID-POWER is no longer equal across states.

Comment by TurnTrout on MDP models are determined by the agent architecture and the environmental dynamics · 2021-05-26T15:05:14.550Z · LW · GW

the choice of the state representation and action space may determine whether a problem is like that.

I agree. Also: the state and action representations determine which reward functions we can express, and I claim that it makes sense for the theory to reflect that fact.

If so POWER—when defined over an IID-over-states reward distribution—is constant.

Agreed. I also don't currently see a problem here. There aren't any robustly instrumental goals in this setting, as best I can tell.

Comment by TurnTrout on MDP models are determined by the agent architecture and the environmental dynamics · 2021-05-26T13:05:21.241Z · LW · GW

I'm wondering whether I properly communicated my point. Would you be so kind as to summarize my argument as best you understand it?

But that just means the subjectivity comes from the choice of the interface!

There's no subjectivity? The interface is determined by the agent architecture we use, which is an empirical question.

Sure, but if you actually have to check the power-seeking to infer the structure of the MDP, it becomes unusable for not building power-seeking AGIs. Or put differently, the value of your formalization of power-seeking IMO is that we can start from the models of the world and think about which actions/agent would be power-seeking and for which rewards. If I actually have to run the optimal agents to find out about power-seeking actions, then that doesn't help.

You don't have to run anything to check power-seeking. Once you know the agent encodings, the rest is determined and my theory makes predictions.

Comment by TurnTrout on Draft report on existential risk from power-seeking AI · 2021-05-26T00:19:40.538Z · LW · GW

I think using a well-chosen reward distribution is necessary, otherwise POWER depends on arbitrary choices in the design of the MDP's state graph. E.g. suppose the student in the above example writes about every action they take in a blog that no one reads, and we choose to include the content of the blog as part of the MDP state. This arbitrary choice effectively unrolls the state graph into a tree with a constant branching factor (+ self-loops in the terminal states) and we get that the POWER of all the states is equal.

I replied to this point with a short post.