## Posts

When Most VNM-Coherent Preference Orderings Have Convergent Instrumental Incentives 2021-08-09T17:22:24.056Z
Seeking Power is Convergently Instrumental in a Broad Class of Environments 2021-08-08T02:02:18.975Z
The More Power At Stake, The Stronger Instrumental Convergence Gets For Optimal Policies 2021-07-11T17:36:24.208Z
A world in which the alignment problem seems lower-stakes 2021-07-08T02:31:03.674Z
Environmental Structure Can Cause Instrumental Convergence 2021-06-22T22:26:03.120Z
Open problem: how can we quantify player alignment in 2x2 normal-form games? 2021-06-16T02:09:42.403Z
Game-theoretic Alignment in terms of Attainable Utility 2021-06-08T12:36:07.156Z
Conservative Agency with Multiple Stakeholders 2021-06-08T00:30:52.672Z
MDP models are determined by the agent architecture and the environmental dynamics 2021-05-26T00:14:00.699Z
Generalizing POWER to multi-agent games 2021-03-22T02:41:44.763Z
Lessons I've Learned from Self-Teaching 2021-01-23T19:00:55.559Z
Review of 'Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More' 2021-01-12T03:57:06.655Z
Review of 'But exactly how complex and fragile?' 2021-01-06T18:39:03.521Z
Collider bias as a cognitive blindspot? 2020-12-30T02:39:35.700Z
2019 Review Rewrite: Seeking Power is Often Robustly Instrumental in MDPs 2020-12-23T17:16:10.174Z
Avoiding Side Effects in Complex Environments 2020-12-12T00:34:54.126Z
Is it safe to spend time with people who already recovered from COVID? 2020-12-02T22:06:13.469Z
Non-Obstruction: A Simple Concept Motivating Corrigibility 2020-11-21T19:35:40.445Z
Math That Clicks: Look for Two-Way Correspondences 2020-10-02T01:22:18.177Z
Power as Easily Exploitable Opportunities 2020-08-01T02:14:27.474Z
Generalizing the Power-Seeking Theorems 2020-07-27T00:28:25.677Z
GPT-3 Gems 2020-07-23T00:46:36.815Z
To what extent is GPT-3 capable of reasoning? 2020-07-20T17:10:50.265Z
What counts as defection? 2020-07-12T22:03:39.261Z
Corrigibility as outside view 2020-05-08T21:56:17.548Z
How should potential AI alignment researchers gauge whether the field is right for them? 2020-05-06T12:24:31.022Z
Insights from Euclid's 'Elements' 2020-05-04T15:45:30.711Z
Problem relaxation as a tactic 2020-04-22T23:44:42.398Z
A Kernel of Truth: Insights from 'A Friendly Approach to Functional Analysis' 2020-04-04T03:38:56.537Z
Research on repurposing filter products for masks? 2020-04-03T16:32:21.436Z
ODE to Joy: Insights from 'A First Course in Ordinary Differential Equations' 2020-03-25T20:03:39.590Z
Conclusion to 'Reframing Impact' 2020-02-28T16:05:40.656Z
Reasons for Excitement about Impact of Impact Measure Research 2020-02-27T21:42:18.903Z
Attainable Utility Preservation: Scaling to Superhuman 2020-02-27T00:52:49.970Z
How Low Should Fruit Hang Before We Pick It? 2020-02-25T02:08:52.630Z
Continuous Improvement: Insights from 'Topology' 2020-02-22T21:58:01.584Z
Attainable Utility Preservation: Empirical Results 2020-02-22T00:38:38.282Z
Attainable Utility Preservation: Concepts 2020-02-17T05:20:09.567Z
The Catastrophic Convergence Conjecture 2020-02-14T21:16:59.281Z
Attainable Utility Landscape: How The World Is Changed 2020-02-10T00:58:01.453Z
Does there exist an AGI-level parameter setting for modern DRL architectures? 2020-02-09T05:09:55.012Z
AI Alignment Corvallis Weekly Info 2020-01-26T21:24:22.370Z
On Being Robust 2020-01-10T03:51:28.185Z
Judgment Day: Insights from 'Judgment in Managerial Decision Making' 2019-12-29T18:03:28.352Z
Can fear of the dark bias us more generally? 2019-12-22T22:09:42.239Z
Clarifying Power-Seeking and Instrumental Convergence 2019-12-20T19:59:32.793Z
Seeking Power is Often Convergently Instrumental in MDPs 2019-12-05T02:33:34.321Z
How I do research 2019-11-19T20:31:16.832Z
Thoughts on "Human-Compatible" 2019-10-10T05:24:31.689Z
The Gears of Impact 2019-10-07T14:44:51.212Z

Comment by TurnTrout on [Review] "The Alignment Problem" by Brian Christian · 2021-09-20T16:05:58.900Z · LW · GW

Black people are arrested for marijuana usage much more frequently than black people.

Presumably the second usage of “Black” is supposed to be “White.”

Comment by TurnTrout on I wanted to interview Eliezer Yudkowsky but he's busy so I simulated him instead · 2021-09-18T07:19:46.598Z · LW · GW

Replicated.

You are browsing LessWrong when you come across an interesting comment.

COMMENT by Wei_Dai:

"PSA: If you leave too much writings publicly visible on the Internet, random people in the future will be able to instantiate simulations of you, for benign or nefarious purposes. It's already too late for some of us (nobody warned us about this even though it should have been foreseeable many years ago) but the rest of you can now make a more informed choice.

(Perhaps I never commented on this post IRL, and am now experiencing what I'm experiencing because someone asked their AI, "I wonder how Wei Dai would have replied to this post.")"

"I'm not sure how to respond to this. I mean, I definitely think it's more likely than not that I'm an AI simulation, but if I'm not, the first thing I would do is delete my post on this thread, so I don't see how it's helpful to tell me this."

Comment by TurnTrout on I wanted to interview Eliezer Yudkowsky but he's busy so I simulated him instead · 2021-09-16T18:03:36.766Z · LW · GW

This is disturbingly good. I had to remind myself that this was fake.

Comment by TurnTrout on Agents Over Cartesian World Models · 2021-09-15T18:46:10.656Z · LW · GW

In contrast, humans map multiple observations onto the same internal state.

Is this supposed to say something like: "Humans can map a single observation onto different internal states, depending on their previous internal state"?

$U_{{E}}(e, t) = \text{the number of paperclips in }$.

Unrendered latex.

For HCH-bot, what's the motivation? If we can compute the KL, we can compute HCH(i), so why not just use HCH(i) instead? Or is this just exploring a potential approximation?

A consequential approval-maximizing agent takes the action that gets the highest approval from a human overseer. Such agents have an incentive to tamper with their reward channels, e.g., by persuading the human they are conscious and deserve reward.

Why does this incentive exist? Approval-maximizers take the local action which the human would rate most highly. Are we including "long speech about why human should give high approval to me because I'm suffering" as an action? I guess there's a trade-off here, where limiting to word-level output demands too much lookahead coherence of the human, while long sentences run the risk of incentivizing reward tampering. Is that the reason you had in mind?

If the agent can act to leave itself unchanged, loops of the same sequences of internal states rule out utility functions of type . Similarly, loops of the same (internal state, action) pairs rule out utility functions of type  and . Finally, if the agent ever takes different actions, we can rule out a utility function of type  (assuming the action space is not changing).

This argument doesn't seem to work, because the zero utility function makes everything optimal. VNM theorem can't rule that out given just an observed trajectory. However, if you know the agent's set of optimal policies, then you can start ruling out possibilities (because I can't have purely environment-based utility if  | internal state 1, but  | internal state 2).

Comment by TurnTrout on TurnTrout's shortform feed · 2021-09-15T02:01:28.436Z · LW · GW

Idea: Expert prediction markets on predictions made by theories in the field, with $for being a good predictor and lots of$ for designing and running a later-replicated experiment whose result the expert community strongly anti-predicted. Lots of problems with the plan, but surprisal-based compensation seems interesting and I haven't heard about it before.

Comment by TurnTrout on How do you decide when to change N95/FFP-2 masks? · 2021-09-10T18:56:28.968Z · LW · GW

Is this a big deal, given that COVID doesn't really spread via fomites? Or is the concern: The virus particles will be so close to your mouth that you'll just inhale them off of the mask?

Comment by TurnTrout on The alignment problem in different capability regimes · 2021-09-10T00:33:12.783Z · LW · GW

Other examples of problems that people sometimes call alignment problems that aren’t a problem in the limit of competence: avoiding negative side effects, safe exploration...

I don't understand why you think that negative side effect avoidance belongs on that list.

A sufficiently intelligent system will probably be able to figure out when it's having negative side effects. This does not mean that it will—as a matter of fact—avoid having these side effects, and it does not mean that its NegativeSideEffect? predicate is accessible. A paperclip maximizer may realize that humans consider extinction to be a "negative side effect." This consideration does not move it. Increasing agent intelligence does not naturally solve the problem of getting the agent to not do catastrophically impactful things while optimizing its objective.

In contrast, once an agent realizes that an exploration strategy is unsafe, the agent will be instrumentally motivated to find a better one. Increasing agent intelligence naturally solves the problem of safe exploration.

it will massively outperform humans on writing ethics papers or highly upvoted r/AmItheAsshole comments.

Presumably you meant to say "it will be able to massively outperform..."? (I think you did, since you mention a similar consideration under "Ability to understand itself.") A competent agent will understand, but will only act accordingly if so aligned (for either instrumental or terminal reasons).

Comment by TurnTrout on GPT-3 Gems · 2021-09-07T18:18:47.877Z · LW · GW

## GPT-3 Solves Alignment

Nature: Eliezer Yudkowsky and Stuart Russell solve AI alignment with breakthrough insight. This October, Eliezer and Stuart sat down to consider one of the most pressing technical challenges confronting humanity: How to ensure that superhuman AI is aligned with human interests. That's when they had their big insight: The alignment problem is a math problem.

In the past, Eliezer and Stuart had been thinking about the alignment problem in terms of probability theory. But probability theory isn't powerful enough to fully capture the nuances of human values. Probability theory is too coarse-grained to distinguish between a universe where humans are eaten by paperclips and a universe where humans are eaten by paperclips and everyone has a good time.

So they turned to a more powerful tool: decision theory, which underlies game theory and has been used to analyze everything from voting systems to how to play poker. Decision theory is more nuanced than probability theory, but it's also more complicated. It's not just harder for humans to grok; it's harder for computers too. So the first step was just getting decision theory into AI algorithms.

The next step was figuring out how to use decision theory to solve the alignment problem. They started by defining a reward function that would tell an AI what we want it to do. Then they set up a decision tree showing all the possible ways an AI could behave, with each branch corresponding to a different possible reward function. The goal was then to find the path that maximizes our reward under any possible future circumstance—a path that would ensure that an AI does what we want no matter what happens in the future, whether it's created by us or someone else, whether it has two neurons or two hundred billion neurons, whether it loves us or hates us or feels nothing at all about us one way or another…or even if there are no humans left on Earth at all!

But wait—how can you have an algorithm without knowing what reward function you're trying to maximize? That's like trying to find your way home without knowing which way you're facing! And yet this is exactly what Stuart and Eliezer did: They took this giant pile of unknowns—all these potential reward functions—and fed them into their decision-theoretic machine learning system as input variables…and then they let their system figure out which reward function was most likely! And when they were done, they found that their system had settled on one particular definition of human values: It was something like "human values are whatever maximizes humanity's future potential." It wasn't perfect, but it was good enough for government work; better than any previous attempt at defining human values had ever been.

And this is where they stopped. This is where they stopped and thought, "Wow, we've done it! We've solved the alignment problem!" And then they went home and slept soundly, happy in the knowledge that humanity's future was secure.

But…that's not how it happened at all. That's not how it happened at all. Because when Eliezer and Stuart had their big breakthrough, I was sitting right there with them, listening to every word. And I know what really happened.

What really happened was that Stuart and Eliezer worked on AI alignment for another decade or so before giving up in frustration. They worked on AI alignment until their hair turned gray and their teeth fell out, until their eyesight failed and their joints became arthritic from sitting at a computer for too many hours a day, until they were so old that nobody would publish their papers anymore because nobody takes old people seriously anymore. And then they died of natural causes before ever solving the alignment problem—and the world was left with no way to align AI with human values whatsoever.

Comment by TurnTrout on How To Write Quickly While Maintaining Epistemic Rigor · 2021-09-03T17:02:52.453Z · LW · GW

I think a norm of “somewhat comfortable by default; solicit maximally frank feedback with an end-of-post request” might be good? It may be easier to say “please be harsher on my claims” than “please be courteous with me.”

Comment by TurnTrout on How To Write Quickly While Maintaining Epistemic Rigor · 2021-09-03T04:13:52.640Z · LW · GW

This is good and bad, but LessWrong's advantage is in being different, not comfortable.

Personally: If LessWrong is not comfortable for me to post on, I won't post. And, in fact, my post volume has decreased somewhat because of that. That's just how my brain is wired, it seems.

Comment by TurnTrout on Finite Factored Sets: Applications · 2021-09-01T00:40:11.108Z · LW · GW

Throughout this sequence, we have assumed finiteness fairly gratuitously. It is likely that many of the results can be extended to arbitrary finite sets.

To arbitrary factored sets?

Comment by TurnTrout on When Most VNM-Coherent Preference Orderings Have Convergent Instrumental Incentives · 2021-08-22T18:34:08.144Z · LW · GW

Thanks! I think you're right. I think I actually should have defined  differently, because writing it out, it isn't what I want. Having written out a small example, intuitively,  should hold iff , which will also induce  as we want.

I'm not quite sure what the error was in the original proof of Lemma 3; I think it may be how I converted to and interpreted the vector representation. Probably it's more natural to represent  as , which makes your insight obvious.

The post is edited and the issues should now be fixed.

Comment by TurnTrout on Environmental Structure Can Cause Instrumental Convergence · 2021-08-16T16:47:17.348Z · LW · GW

I‘m not assuming that they incentivize anything. They just do! Here’s the proof sketch (for the full proof, you’d subtract a constant vector from each set, but not relevant for the intuition).

&You’re playing a tad fast and loose with your involution argument. Unlike the average-optimal case, you can’t just map one set of states to another for all-discount-rates reasoning.

Comment by TurnTrout on Power-seeking for successive choices · 2021-08-16T16:26:27.618Z · LW · GW

For (3), environments which "almost" have the right symmetries should also "almost" obey the theorems. To give a quick, non-legible sketch of my reasoning:

For the uniform distribution over reward functions on the unit hypercube (), optimality probability should be Lipschitz continuous on the available state visit distributions (in some appropriate sense). Then if the theorems are "almost" obeyed, instrumentally convergent actions still should have extremely high probability, and so most of the orbits still have to agree.

So I don't currently view (3) as a huge deal. I'll probably talk more about that another time.

Comment by TurnTrout on Environmental Structure Can Cause Instrumental Convergence · 2021-08-13T21:26:25.018Z · LW · GW

Gotcha. I see where you're coming from.

I think I underspecified the scenario and claim. The claim wasn't supposed to be: most agents never break the vase (although this is sometimes true). The claim should be: most agents will not immediately break the vase.

If the agent has a choice between one action ("break vase and move forwards") or another action ("don't break vase and more forwards"), and these actions lead to similar subgraphs, then at all discount rates, optimal policies will tend to not break the vase immediately. But they might tend to break it eventually, depending on the granularity and balance of final states.

So I think we're actually both making a correct point, but you're making an argument for  under certain kinds of models and whether the agent will eventually break the vase. I (meant to) discuss the immediate break-it-or-not decision in terms of option preservation at all discount rates.

[Edited to reflect the ancestor comments]

Comment by TurnTrout on Environmental Structure Can Cause Instrumental Convergence · 2021-08-13T18:51:18.900Z · LW · GW

The first sentence

Comment by TurnTrout on Power-seeking for successive choices · 2021-08-13T18:29:07.765Z · LW · GW

You're being unhelpfully pedantic. The quoted portion even includes the phrase "As a quick summary (read the paper and sequence if you want more details)"! This reads to me as an attempted pre-emption of "gotcha" comments.

The phenomena you discuss are explained in the paper (EDIT: top of page 9), and in other posts, and discussed at length in other comment threads. But this post isn't about the stochastic sensitivity issue, and I don't think it should have to talk about the sensitivity issue.

Comment by TurnTrout on Environmental Structure Can Cause Instrumental Convergence · 2021-08-13T16:55:35.906Z · LW · GW

Most of the reward functions are either indifferent about the vase or want to break the vase. The optimal policies of all those reward functions don't "tend to avoid breaking the vase". Those optimal policies don't behave as if they care about the 'strictly more states' that can be reached by not breaking the vase.

This is factually wrong BTW. I had just explained why the opposite is true.

Comment by TurnTrout on Power-seeking for successive choices · 2021-08-13T15:52:54.609Z · LW · GW

The point of using scare quotes is to abstract away that part. So I think it is an accurate description, in that it flags that “options” is not just the normal intuitive version of options.

Comment by TurnTrout on Environmental Structure Can Cause Instrumental Convergence · 2021-08-11T16:36:02.233Z · LW · GW

Can you say more? I don't think there's a way to "wait" in Pac-Man, although I suppose you could always loop around the level in a particular repeating fashion such that you keep revisiting the same state.

Comment by TurnTrout on Environmental Structure Can Cause Instrumental Convergence · 2021-08-10T21:24:12.833Z · LW · GW

Under the natural Pac-Man model (where different levels have different mechanics), then yes, agents will tend to want to beat the level -- because at any point in time, most of the remaining possibilities are in future levels, not the current one.

Eating ghosts is more incidental; the agent will probably tend to eat ghosts as an instrumental move for beating the level.

Comment by TurnTrout on Environmental Structure Can Cause Instrumental Convergence · 2021-08-10T20:00:12.323Z · LW · GW

The original version of this post claimed that an MDP-independent constant C helped lower-bound the probability assigned to power-seeking reward functions under simplicity priors. This constant is not actually MDP-independent (at least, the arguments given don't show that): the proof sketch assumes that the MDP is given as input to the permutation-finding algorithm (which is the same, for every MDP you want to apply it to!). But the input's description length must also be part of the Kolmogorov complexity (else you could just compute any string for free by saying "the identity program outputs the string, given the string as input").

The upshot is that the given lower bound is weaker for more complex environments. There are other possible recourses, like "at least half of the permutations of any NPS element will be PS element, and they surely can't all be high-complexity permutations!" — but I leave that open for now.

Oops, and fixed.

Comment by TurnTrout on Seeking Power is Convergently Instrumental in a Broad Class of Environments · 2021-08-10T16:27:02.431Z · LW · GW

contain a representation of the entire MDP (because the program needs to simulate the MDP for each possible permutation)

We aren't talking about MDPs, we're talking about a broad class of environments which are represented via joint probability distributions over actions and observations. See post title.

it's an unbounded integer.

I don't follow.

the program needs to contain way more than 100 bits

See the arguments in the post Rohin linked for why this argument is gesturing at something useful even if takes some more bits.

But IMO the basic idea in this case is, you can construct reasonably simple utility functions like "utility 1 if history has the agent taking action  at time step  given action-observation history prefix , and 0 otherwise." This is reasonably short, and you can apply it for all actions and time steps.

Sure, the complexity will vary a little bit (probably later time steps will be more complex), but basically you can produce reasonably simple programs which make any sequence of actions optimal. And so I agree with Rohin that simplicity priors on u-AOH will quantitatively - but not qualitatively affect the conclusions for the generic u-AOH case. [EDIT: this reasoning is different from the one Rohin gave, TBC]

Comment by TurnTrout on Framing Practicum: Stable Equilibrium · 2021-08-10T02:44:12.474Z · LW · GW
1. The dog at the Burrow has a suspiciously regular patrol, during which he looks for food. If you give him food, he will lick your hand and wag his tail gratefully, before returning to the patrol.

If someone took him away in a car, he would not return to the route.
2. When I move my mouse, my monitor activates. If I do nothing for a while, it will go dark again.

If I set up a robot which constantly twitched the mouse, the monitor would not go dark, no matter how long I waited.
3. An elastic object, such as a rubber band, will hang such that the force of tension equals the force of gravity.

If I snap the band, it will not return to its original equilibrium.
Comment by TurnTrout on Seeking Power is Convergently Instrumental in a Broad Class of Environments · 2021-08-09T17:12:29.261Z · LW · GW

Do I understand correctly that in general the elements of A, B, C,  are achievable probability distributions over the set of n possible outcomes? (But that in the examples given with the deterministic environments, these are all standard basis vectors / one-hot vectors / deterministic distributions ?)

Yes.

Then no permutation of the set of observations-histories would convert any element of A into an element of B, nor visa versa.

Nice catch. In the stochastic case, you do need a permutation-enforced similarity, as you say (see definition 6.1: similarity of visit distribution sets in the paper). They won't apply for all A, B, because that would prove way too much.

Comment by TurnTrout on Seeking Power is Convergently Instrumental in a Broad Class of Environments · 2021-08-09T17:06:57.739Z · LW · GW

The results are not centrally about the uniform distribution. The uniform distribution result is more of a consequence of the (central) orbit result / scaling law for instrumental convergence. I gesture at the uniform distribution to highlight the extreme strength of the statistical incentives.

Comment by TurnTrout on Delta Strain: Fact Dump and Some Policy Takeaways · 2021-08-02T15:54:04.675Z · LW · GW

I'm concerned about nonlinearly bad outcomes from losing a few IQ points. I don't think it's just that if I lost 5 IQ, I'd work more slowly. Perhaps I just wouldn't understand certain things; there maybe thoughts I can no longer think; insights I could no longer have. I'm worried about this both for impact reasons (helping solve alignment) and for personal reasons: my mental acuity is important to my life-enjoyment, via grasping new concepts and enjoying self-study (you do mention happiness effects). On an extreme end of the spectrum (which COVID would not take me to), I doubt an 80-IQ TurnTrout would much enjoy self-study, however slowly that progressed for him.

I'm similarly worried about about "tipping point" effects from other possible long COVID symptoms, like fatigue ("straw that broke the camel's back"; fatigue may not just reduce my productivity and enjoyment by 5%, but make it consistently hard for me to do important thing X at all due to insufficient activation energy).

IQ affects many things. If you roughly double your income when you go up 60 IQ points, each IQ point is about 2% added income. Each IQ point is probably also .1% happiness (.01/10). It also probably reflects some underlying worse health that may have other affects. It’s hard not to overcount when something is correlated with everything. However, I think the largest effect here will be the money-equivalent (impact, perhaps). We can double this at the end to account for other effects. Giving up 1/50 of this for an IQ point is equivalent to a lost week of production every year, which, while not the same as lost life, is still pretty bad.

What does "this" refer to, above?

Comment by TurnTrout on Why Subagents? · 2021-08-02T01:15:56.323Z · LW · GW

Typo:

there is no way to get from to the “New York (from Philly)” node directly from the “New York (from DC)” node.

Comment by TurnTrout on Actually possible: thoughts on Utopia · 2021-07-31T15:41:16.108Z · LW · GW

Comment by TurnTrout on Actually possible: thoughts on Utopia · 2021-07-30T20:23:15.234Z · LW · GW

I love this post. Thank you for writing it. One of my favorite passages:

There are oceans we have barely dipped a toe into; there are drums and symphonies we can barely hear; there are suns whose heat we can barely feel on our skin.

Comment by TurnTrout on Environmental Structure Can Cause Instrumental Convergence · 2021-07-30T18:55:57.112Z · LW · GW

Relatedly [to power-seeking under the simplicity prior], Rohin Shah wrote:

if you know that an agent is maximizing the expectation of an explicitly represented utility function, I would expect that to lead to goal-driven behavior most of the time, since the utility function must be relatively simple if it is explicitly represented, and simple utility functions seem particularly likely to lead to goal-directed behavior.

Comment by TurnTrout on TurnTrout's shortform feed · 2021-07-29T15:08:44.311Z · LW · GW

If you raised children in many different cultures, "how many" different reflectively stable moralities could they acquire? (What's the "VC dimension" of human morality, without cheating by e.g. directly reprogramming brains?)

(This is probably a Wrong Question, but I still find it interesting to ask.)

Comment by TurnTrout on TurnTrout's shortform feed · 2021-07-27T13:18:59.766Z · LW · GW

My power-seeking theorems seem a bit like Vingean reflection. In Vingean reflection, you reason about an agent which is significantly smarter than you: if I'm playing chess against an opponent who plays the optimal policy for the chess objective function, then I predict that I'll lose the game. I predict that I'll lose, even though I can't predict my opponent's (optimal) moves - otherwise I'd probably be that good myself.

My power-seeking theorems show that most objectives have optimal policies which e.g. avoid shutdown and survive into the far future, even without saying what particular actions these policies take to get there. I may not even be able to compute a single optimal policy for a single non-trivial objective, but I can still reason about the statistical tendencies of optimal policies.

Comment by TurnTrout on Ask Not "How Are You Doing?" · 2021-07-22T00:08:40.692Z · LW · GW

"What are you doing?" often means "what are you thinking, why are you doing that thing? That's wrong." At least to my ear.

I prefer to start higher bandwidth conversations with "what have you been thinking about recently?".

Comment by TurnTrout on Seeking Power is Often Convergently Instrumental in MDPs · 2021-07-21T15:10:15.984Z · LW · GW

I proposed changing "instrumental convergence" to "robust instrumentality." This proposal has not caught on, and so I reverted the post's terminology. I'll just keep using 'convergently instrumental.' I do think that 'convergently instrumental' makes more sense than 'instrumentally convergent', since the agent isn't "convergent for instrumental reasons", but rather, it's more reasonable to say that the instrumentality is convergent in some sense.

For the record, the post used to contain the following section:

## A note on terminology

The robustness-of-strategy phenomenon became known as the instrumental convergence hypothesis, but I propose we call it robust instrumentality instead.

From the paper’s introduction:

An action is said to be instrumental to an objective when it helps achieve that objective. Some actions are instrumental to many objectives, making them robustly instrumental. The so-called instrumental convergence thesis is the claim that agents with many different goals, if given time to learn and plan, will eventually converge on exhibiting certain common patterns of behavior that are robustly instrumental (e.g. survival, accessing usable energy, access to computing resources). Bostrom et al.'s instrumental convergence thesis might more aptly be called the robust instrumentality thesis, because it makes no reference to limits or converging processes:

“Several instrumental values can be identified which are convergent in the sense that their attainment would increase the chances of the agent's goal being realized for a wide range of final goals and a wide range of situations, implying that these instrumental values are likely to be pursued by a broad spectrum of situated intelligent agents.”

Some authors have suggested that gaining power over the environment is a robustly instrumental behavior pattern on which learning agents generally converge as they tend towards optimality. If so, robust instrumentality presents a safety concern for the alignment of advanced reinforcement learning systems with human society: such systems might seek to gain power over humans as part of their environment. For example, Marvin Minsky imagined that an agent tasked with proving the Riemann hypothesis might rationally turn the planet into computational resources.

This choice is not costless: many are already acclimated to the existing ‘instrumental convergence.’ It even has its own Wikipedia page. Nonetheless, if there ever were a time to make the shift, that time would be now.

Comment by TurnTrout on [AN #156]: The scaling hypothesis: a plan for building AGI · 2021-07-21T14:51:46.574Z · LW · GW

Sure.

Additional note for posterity: when I talked about "some objectives [may] make alignment far more likely", I was considering something like "given this pretraining objective and an otherwise fixed training process, what is the measure of data-sets in the N-datapoint hypercube such that the trained model is aligned?", perhaps also weighting by ease of specification in some sense.

Comment by TurnTrout on [AN #156]: The scaling hypothesis: a plan for building AGI · 2021-07-21T04:13:40.643Z · LW · GW

Claim 3: If you don't control the dataset, it mostly doesn't matter what pretraining objective you use (assuming you use a simple one rather than e.g. a reward function that encodes all of human values); the properties of the model are going to be roughly similar regardless.

Analogous claim: since any program specifiable under UTM U1 is also expressible under UTM U2, choice of UTM doesn't matter.

And this is true up to a point: up to constant factors, it doesn't matter. But U1 can make it easier (simplier, faster, etc) to specify a set of programs than does U2. And so "there exists a program in U2-encoding which implements P in U1-encoding" doesn't get everything I want: I want to reason about the distribution of programs, about how hard it tends to be to get programs with desirable properties.

Stepping out of the analogy, even though I agree that "reasonable" pretraining objectives are all compatible with aligned / unaligned /arbitrarily behaved models, this argument seems to leave room that some objectives make alignment far more likely, a priori. And you may be noting as much:

(This is probably the weakest argument in the chain; just because most of the influence comes from the dataset doesn't mean that the pretraining objective can't have influence as well. I still think the claim is true though, and I still feel pretty confident about the final conclusion in the next claim.)

Comment by TurnTrout on Winston Churchill, futurist and EA · 2021-07-12T17:27:01.566Z · LW · GW

Even so, I also think "Winston Churchill, Futurist" makes more sense and better describes the content of your post.

Comment by TurnTrout on The More Power At Stake, The Stronger Instrumental Convergence Gets For Optimal Policies · 2021-07-12T16:09:40.144Z · LW · GW

Yeah, that's right. (That is why I called it the "start" of a theory on invariances!)

I think that's an interesting frame which I'll return to when I think more about agents planning over an imperfect world model.

Comment by TurnTrout on The More Power At Stake, The Stronger Instrumental Convergence Gets For Optimal Policies · 2021-07-12T16:05:29.600Z · LW · GW

In my reading about the various usages of 'power', there are indeed definitions which focus on exerting control through other agents. I think in many situations, this is a useful frame, but I find "ability to achieve goals in general" to be both broader and also upstream of "ability to control others to achieve your goals."

(also - upvoted for asking a question and then editing to acknowledge a misunderstanding!)

Comment by TurnTrout on The More Power At Stake, The Stronger Instrumental Convergence Gets For Optimal Policies · 2021-07-12T15:25:59.750Z · LW · GW

As I understand expanding candy into A and B but not expanding the other will make the ratios go differently.

What do you mean?

If we knew what was important and what not we would be sure about the optimality. But since we think we don't know it or might be in error about it we are treating that the value could be hiding anywhere.

I'm not currently trying to make claims about what variants we'll actually be likely to specify, if that's what you mean. Just that in the reasonably broad set of situations covered by my theorems, the vast majority of variants of every objective function will make power-seeking optimal.

Comment by TurnTrout on A world in which the alignment problem seems lower-stakes · 2021-07-10T03:12:12.504Z · LW · GW

The main benefit is that the AI is not at risk of killing you. In the left half of the universe, it is at risk of killing you.

Comment by TurnTrout on A world in which the alignment problem seems lower-stakes · 2021-07-08T22:24:09.847Z · LW · GW

I don't follow why you disagree. It's higher-stakes to operate something which can easily kill me, than to operate something which can't.

Comment by TurnTrout on A world in which the alignment problem seems lower-stakes · 2021-07-08T21:37:21.988Z · LW · GW

I'm torn about whether this seems lower-stakes or not.

I think it is lower-stakes in a fairly straightforward way: An unaligned AGI on the right side of the universe won't be able to kill you and your civilization.

Furthermore, most misspecifications will just end up with a worthless right half of the universe - you'd have to be quite good at alignment in order to motivate the AGI to actually create and harm humans, as opposed to the AGI wireheading forever off of a sensory reward signal.

Comment by TurnTrout on A world in which the alignment problem seems lower-stakes · 2021-07-08T17:23:44.673Z · LW · GW

Yeah, we are magically instantly influencing an AGI which will thereafter be outside of our light cone. This is not a proposal, or something which I'm claiming is possible in our universe. Just take for granted that such a thing is possible in this contrived example environment.

My conception of utility is that it's a synthetic calculation from observations about the state of the universe, not that it's a thing on it's own which can carry information.

Well, maybe here's a better way of communicating what I'm after:

Suppose that you have beliefs about the initial state of the right (AGI) half, and you know how it's going to evolve; this gives you a distribution over right-half universe histories - you have beliefs about the AGI's initial state, and you can compute the consequences of those beliefs in terms of how the right half of the universe will end up.

In this way, you can take expected utility over the joint universe history, without being able to observe what's actually happening on the AGI's end. This is similar to how I prefer "start a universe which grows to be filled with human flourishing" over "start a universe which fills itself with suffering", even though I may not observe the fruits of either decision.

Is this clearer?

Comment by TurnTrout on A world in which the alignment problem seems lower-stakes · 2021-07-08T15:12:25.280Z · LW · GW

I'm not sure if you're arguing that this is a good world in which to think about alignment.

I am not arguing this. Quoting my reply to ofer:

I think I sometimes bump into reasoning that feels like "instrumental convergence, smart AI, & humans exist in the universe -> bad things happen to us / the AI finds a way to hurt us"; I think this is usually true, but not necessarily true, and so this extreme example illustrates how the implication can fail.

(Edited post to clarify)

Comment by TurnTrout on A world in which the alignment problem seems lower-stakes · 2021-07-08T15:11:34.628Z · LW · GW

Even in environments where the agent is "alone", we may still expect the agent to have the following potential convergent instrumental values

Right. But I think I sometimes bump into reasoning that feels like "instrumental convergence, smart AI, & humans exist in the universe -> bad things happen to us / the AI finds a way to hurt us"; I think this is usually true, but not necessarily true, and so this extreme example illustrates how the implication can fail. (And note that the AGI could still hurt us in a sense, by simulating and torturing humans using its compute. And some decision theories do seem to have it do that kind of thing.)

(Edited post to clarify)

Comment by TurnTrout on rohinmshah's Shortform · 2021-06-26T15:46:24.800Z · LW · GW

I like this question. If I had to offer a response from econ 101:

Suppose people love eating a certain endangered species of whale, and that people would be sad if the whale went extinct, but otherwise didn't care about how many of these whales there were. Any individual consumer might reason that their consumption is unlikely to cause the whale to go extinct.

We have a tragedy of the commons, and we need to internalize the negative externalities of whale hunting. However, the harm is discontinuous in the number of whales remaining: there's an irreversible extinction point. Therefore, Pigouvian taxes aren't actually a good idea because regulators may not be sure what the post-tax equilibrium quantity will be. If the quantity is too high, the whales go extinct.

Therefore, a "cap and trade" program would work better: there are a set number of whales that can be killed each year, and firms trade "whale certificates" with each other. (And, IIRC, if # of certificates = post-tax equilibrium quantity, this scheme has the same effect as a Pigouvian tax of the appropriate amount.)

Similarly: if I, a house member, am unsure about others' willingness to pay for risky activities, then maybe I want to cap the weekly allowable microcovids and allow people to trade them amongst themselves. This is basically a fancier version of "here's the house's weekly microcovid allowance" which I heard several houses used. I'm protecting myself against my uncertainty like "maybe someone will just go sing at a bar one week, and they'll pay me $1,000, but actually I really don't want to get sick for$1,000." (EDIT: In this case, maybe you need to charge more per microcovid? This makes me less confident in the rest of this argument.)

There are a couple of problems with this argument. First, you said taxes worked fine for your group house, which somewhat (but not totally) discredits all of this theorizing. Second, (4) seems most likely. Otherwise, I feel like we might have heard about covid taxes being considered and then discarded (in e.g. different retrospectives)?

Comment by TurnTrout on rohinmshah's Shortform · 2021-06-26T15:34:08.450Z · LW · GW

Perhaps I don't follow. why would you have to market-base "all the transactions in a group house", instead of just the COVID-19 ones?