Posts

Book review: Architects of Intelligence by Martin Ford (2018) 2020-08-11T17:30:21.247Z
The recent NeurIPS call for papers requires authors to include a statement about the potential broader impact of their work 2020-02-24T07:44:20.850Z
ofer's Shortform 2019-11-26T14:59:40.664Z
A probabilistic off-switch that the agent is indifferent to 2018-09-25T13:13:16.526Z
Looking for AI Safety Experts to Provide High Level Guidance for RAISE 2018-05-06T02:06:51.626Z
A Safer Oracle Setup? 2018-02-09T12:16:12.063Z

Comments

Comment by ofer on How do you decide when to change N95/FFP-2 masks? · 2021-09-11T17:15:42.449Z · LW · GW

With 4 hours per day of use, N95 masks retain ~95% efficacy after 3 days, ~92% efficacy after 5 days, and drop to ~80% efficacy after 14 days (source).

I think the paper you linked to reports on an experiments in which respirators were worn for a total of 8 hours per day, not 4.

Comment by ofer on How do you decide when to change N95/FFP-2 masks? · 2021-09-11T16:56:55.050Z · LW · GW

Re "Loss of electrostatic charge worsens filtration efficacy", this paper might also be relevant (e.g. figures 1 and 2; though I don't know how to interpret them).

Comment by ofer on Obstacles to gradient hacking · 2021-09-10T23:36:32.591Z · LW · GW

As I said here, the idea here does not involve having some "dedicated" piece of logic C that makes the model fail if the outputs of the two malicious pieces of logic don't satisfy some condition.

Comment by ofer on Gradient descent is not just more efficient genetic algorithms · 2021-09-10T23:15:32.957Z · LW · GW

I don't see how this is relevant here. If it is the case that changing only does not affect the loss, and changing only does not affect the loss, then SGD would not change them (their gradient components will be zero), even if changing them both can affect the loss.

Comment by ofer on Formalizing Objections against Surrogate Goals · 2021-09-09T18:03:25.033Z · LW · GW

Regarding the following part of the view that you commented on:

But if we want AI to implement them, we should mainly work on solving foundational issues in decision and game theory with an aim toward AI.

Just wanted to add: It may be important to consider potential downside risks of such work. It may be important to be vigilant when working on certain topics in game theory and e.g. make certain binding commitments before investigating certain issues, because otherwise one might lose a commitment race in logical time. (I think this is a special case of a more general argument made in Multiverse-wide Cooperation via Correlated Decision Making about how it may be important to make certain commitments before discovering certain crucial considerations.)

Comment by ofer on Gradient descent is not just more efficient genetic algorithms · 2021-09-09T17:23:59.340Z · LW · GW

My formulation is broad enough that it doesn't have to be a dedicated piece of logic, there just has to be some way of looking at the reset of the network that depends on X and Y being the same.

But X and Y are not the same! For example, if the model is intended to classify images of animals, the computation X may correspond to [how many legs does the animal have?] and Y may correspond to [how large is the animal?]

This is what I take issue with - if there is a way to change both components simultaneously to have an effect on the loss, SGD will happily do that.

This seems to me wrong. SGD updates the weights in the direction of the gradient, and if changing a given weight alone does not affect the loss then the gradient component that is associated with that weight will be 0 and thus SGD will not change that weight.

Comment by ofer on Gradient descent is not just more efficient genetic algorithms · 2021-09-09T14:59:25.145Z · LW · GW

Imagine you're given a network that has two identical submodules, and some kind of combining function where if it detects the outputs from both submodules are the same it passes the value through but if they're different it explodes and outputs zero or something totally random, or whatever. This is a natural idea to come up with if your goal is to ensure the optimizer doesn't mess with these modules, for example, if you're trying to protect the mesaobjective encoding for gradient hacking.

I think this is a wrong interpretation of the idea that I described in this comment (which your linked comment here is a reply to). There need not be a "dedicated" piece of logic that does nothing other than checking whether the outputs from two subnetworks satisfy some condition and making the model "fail" otherwise. Having such a dedicated piece of logic wouldn't work because SGD would just remove it. Instead, suppose that the model depends on two different computations, X and Y, for the purpose of minimizing its loss. Now suppose there are two malicious pieces of logic, one within the subnetwork that computes X and one within the subnetwork that computes Y. If a certain condition about the input of the entire network is satisfied, the malicious logic pieces make both X and Y fail. Albeit doing so, the gradient components of the weights that are associated with the malicious pieces of logic are close to zero (putting aside regularization), because changing any single weight has almost no effect on the loss.

Comment by ofer on Obstacles to gradient hacking · 2021-09-09T14:58:21.749Z · LW · GW

To make sure I understand your notation, is some set of weights, right? If it's a set of multiple weights I don't know what you mean when you write .

There should also exist at least some f1,f2 where C(f_1,f_1)≠C(f_2,f_2), since otherwise C no longer depends on the pair of redundant networks at all

(I don't yet understand the purpose of this claim, but it seems to me wrong. If for every , why is it true that does not depend on and when ?)

Comment by ofer on Obstacles to gradient hacking · 2021-09-09T14:56:22.314Z · LW · GW

But gradient descent doesn’t modify a neural network one weight at a time

Sure, but the gradient component that is associated with a given weight is still zero if updating that weight alone would not affect loss.

Comment by ofer on Obstacles to gradient hacking · 2021-09-06T04:02:53.247Z · LW · GW

This post is essentially the summary of a long discussion on the EleutherAI discord about trying to exhibit gradient hacking in real models by hand crafting an example.

I wouldn't say that this work it attempting to "exhibit gradient hacking". (Succeeding in that would require to create a model that can actually model SGD.) Rather, my understanding is that this work is trying to demonstrate techniques that might be used in a gradient hacking scenario.

There are a few ways to protect a subnetwork from being modified by gradient descent that I can think of (non-exhaustive list):

Another way of "protecting" a piece of logic in the network from changes (if we ignore regularization) is by redundancy: Suppose there are two different pieces of logic in the network such that each independently makes the model output what the mesa optimizer wants. Due to the redundancy, changing any single weight—that is associated with one of those two pieces of logic—does not change the output, and thus the gradient components of all those weights should be close to zero.

Comment by ofer on Formalizing Objections against Surrogate Goals · 2021-09-04T12:45:50.376Z · LW · GW

In the bandits example, it seems like the caravan can unilaterally employ SPI to reduce the badness of the bandit's threat. For example, the caravan can credibly commit that they will treat Nerf guns identically to regular guns, so that (a) any time one of them is shot with a Nerf gun, they will flop over and pretend to be a corpse, until the heist has been resolved, and (b) their probability of resisting against Nerf guns will be the same as the probability of resisting against actual guns. In this case the bandits might as well use Nerf guns (perhaps because they're cheaper, or they prefer not to murder if possible). If the bandits continue to use regular guns, the caravan isn't any worse off, so it is an SPI, despite the fact that we have assumed nothing on the part of the bandits.

I agree that such a commitment can be employed unilaterally and can be very useful. Though the caravan should consider that doing so may increase the chance of them being attacked (due to the Nerf guns being cheaper etc.). So perhaps the optimal unilateral commitment is more complicated and involves a condition where the bandits are required to somehow make the Nerf gun attack almost as costly for themselves as a regular attack.

Comment by ofer on Thoughts on gradient hacking · 2021-09-04T12:12:37.168Z · LW · GW

But if the agent is repeatedly carrying out its commitment to fail, then there’ll be pretty strong pressure from gradient descent to change that. What changes might that pressure lead to? The two most salient options to me:

  1. The agent’s commitment to carrying out gradient hacking is reduced.
  2. The agent’s ability to notice changes implemented by gradient descent is reduced.

In a gradient hacking scenario, we should expect the malicious conditionally-fail-on-purpose logic to be optimized for such outcomes not to occur. For example, the malicious logic may involve redundancy: Suppose there are two different conditionally-fail-on-purpose logic pieces in the network such that each independently make the model fail if the x-component is large. Due to the redundancy, a potential failure should have almost no influence on the gradient components that are associated with the weights of the malicious logic pieces. (This is similar to the idea in this comment from a previous discussion we had.)

Comment by ofer on How to turn money into AI safety? · 2021-08-28T08:45:39.282Z · LW · GW

I think we worry way too much about reputation concerns. These seem hypothetical to me, and if we just fund a lot of work some of it will be great and rise to the top, the rest will be mediocre and forgotten or ignored.

I think you're overconfident that mediocre work will be "forgotten or ignored". We don't seem to have reliable metrics for measuring the goodness of alignment work. We have things like post karma and what high-status researchers are willing to say publicly about the work, but IMO these are not reliable metrics for the purpose of detecting mediocre work. (Partially due to Goodhart's law; people who seek funding for alignment research probably tend to optimize for their posts getting high karma and their work getting endorsements from high-status researchers). FWIW I don't think the reputation concerns are merely hypothetical at this point.

Comment by ofer on Power-seeking for successive choices · 2021-08-18T16:39:47.086Z · LW · GW

That quote does not seem to mention the "stochastic sensitivity issue". In the post that you linked to, "(3)" refers to:

  1. Not all environments have the right symmetries
    • But most ones we think about seem to

So I'm still not sure what you meant when you wrote "The phenomena you discuss are explained in the paper (EDIT: top of page 9), and in other posts, and discussed at length in other comment threads."

(Again, I'm not aware of any previous mention of the "stochastic sensitivity issue" other than in my comment here.)

Comment by ofer on Environmental Structure Can Cause Instrumental Convergence · 2021-08-18T16:30:38.762Z · LW · GW

Thanks for the figure. I'm afraid I didn't understand it. (I assume this is a gridworld environment; what does "standing near intact vase" mean? Can the robot stand in the same cell as the intact vase?)

&You’re playing a tad fast and loose with your involution argument. Unlike the average-optimal case, you can’t just map one set of states to another for all-discount-rates reasoning.

I don't follow (To be clear, I was not trying to apply any theorem from the paper via that involution). But does this mean you are NOT making that claim ("most agents will not immediately break the vase") in the limit of the discount rate going to 1? My understanding is that the main claim in the abstract of the paper is meant to assume that setting, based on the following sentence from the paper:

Proposition 6.5 and proposition 6.9 are powerful because they apply to all , but they can only be applied given hard-to-satisfy environmental symmetries.

Comment by ofer on Environmental Structure Can Cause Instrumental Convergence · 2021-08-16T12:26:04.201Z · LW · GW

The claim should be: most agents will not immediately break the vase.

I don't see why that claim is correct either, for a similar reason. If you're assuming here that most reward functions incentivize avoiding immediately breaking the vase then I would argue that that assumption is incorrect, and to support this I would point to the same involution from my previous comment.

Comment by ofer on Power-seeking for successive choices · 2021-08-16T12:21:12.738Z · LW · GW

The phenomena you discuss are explained in the paper (EDIT: top of page 9), and in other posts, and discussed at length in other comment threads. But this post isn't about the stochastic sensitivity issue, and I don't think it should have to talk about the sensitivity issue.

I noticed that after my previous comment you've edited your comment to include the page number and the link. Thanks.

I still couldn't find in the paper (top of page 9) an explanation for the "stochastic sensitivity issue". Perhaps you were referring to the following:

randomly generated MDPs are unlikely to satisfy our sufficient conditions for POWER-seeking tendencies

But the issue is with stochastic MDPs, not randomly generated MDPs.

Re the linked post section, I couldn't find there anything about stochastic MDPs.

Comment by ofer on Power-seeking for successive choices · 2021-08-16T12:16:43.043Z · LW · GW

As a quick summary (read the paper and sequence if you want more details), they show that for any distribution over reward functions, if there are more "options" available after action 1 than after action 2, then most of the orbit of the distribution (the set of distributions induced by applying any permutation on the MDP, which thus permutes the initial distribution) has optimal policies that do action 1.

Also, this claim is missing the "disjoint requirement" and so it is incorrect even without the "they show that" part (i.e. it's not just that the theorems in the paper don't show the thing that is being claimed, but rather the thing that is being claimed is incorrect). Consider the following example where action 1 leads to more "options" but most optimal policies choose action 2:

Comment by ofer on Environmental Structure Can Cause Instrumental Convergence · 2021-08-13T20:06:16.676Z · LW · GW

Thanks.

We can construct an involution over reward functions that transforms every state by switching the is-the-vase-broken bit in the state's representation. For every reward function that "wants to preserve the vase" we can apply on it the involution and get a reward function that "wants to break the vase".

(And there are the reward functions that are indifferent about the vase which the involution map to themselves.)

Comment by ofer on Power-seeking for successive choices · 2021-08-13T19:05:13.087Z · LW · GW

The phenomena you discuss are explainted in the paper, and in other posts, and discussed at length in other comment threads.

I haven't found an explanation about the "stochastic sensitivity issue" in the paper, can you please point me to a specific section/page/quote? All that I found about this in the paper was the sentence:

Our theorems apply to stochastic environments, but we present a deterministic case study for clarity.

(I'm also not aware of previous posts/threads that discuss this, other than my comment here.)

I brought up this issue as a demonstration of the implications of incorrectly assuming that the theorems in the paper apply when there are more "options" available after action 1 than after action 2.

(I argue that this issue shows that the informal description in the OP does not correctly describe the theorems in the paper, and it's not just a matter of omitting details.)

Comment by ofer on Environmental Structure Can Cause Instrumental Convergence · 2021-08-13T18:31:05.876Z · LW · GW

Are you saying that my first sentence ("Most of the reward functions are either indifferent about the vase or want to break the vase") is in itself factually wrong, or rather the rest of the quoted text?

Comment by ofer on Power-seeking for successive choices · 2021-08-13T18:12:15.025Z · LW · GW

So I think it is an accurate description, in that it flags that “options” is not just the normal intuitive version of options.

I think the quoted description is not at all what the theorems in the paper show, no matter what concept the word "options" (in scare quotes) refers to. In order to apply the theorems we need to show that an involution with certain properties exist; not that <some set of things after action 1> is larger than <some set of things after action 2>.

To be more specific, the concept that the word "options" refers to here is recurrent state distributions. If the quoted description was roughly correct, there would not be a problem with applying the theorems in stochastic environments. But in fact the theorems can almost never be applied in stochastic environments. For example, suppose action 1 leads to more available "options", and action 2 causes "immediate death" with probability 0.7515746, and that precise probability does not appear in any transition that follows action 1. We cannot apply the theorems because no involution with the necessary properties exists.

Comment by ofer on Power-seeking for successive choices · 2021-08-13T09:14:19.288Z · LW · GW

As a quick summary (read the paper and sequence if you want more details), they show that for any distribution over reward functions, if there are more "options" available after action 1 than after action 2, then most of the orbit of the distribution (the set of distributions induced by applying any permutation on the MDP, which thus permutes the initial distribution) has optimal policies that do action 1.

That is not what the theorems in the paper show at all (it's not just a matter of details). The relevant theorems require a much stronger and more complicated condition than having more "options" after action 1 than after action 2. They require the existence of an involution between two sets of real vectors where each vector corresponds to a "state visitation distribution" of a different policy.

To demonstrate that this is not just a matter of "details": Your description suggests that generally there is no problem to apply the theorems in stochastic environments (the paper deals with stochastic MDPs). But since the actual condition is much stronger than what you described here, the theorems almost never apply in stochastic environments!

It's usually impossible to construct a useful involution, as required by the theorems, in stochastic environments. The paper (and the accompanying posts) use the Pac-Man environment as an example, which is a stochastic environment. But the reason that the theorems can apply there a lot is that usually that environment behaves deterministically. The ghosts always move deterministically unless they are in "blue mode" (i.e. when they can't kill Pac-Man) in which they sometimes move randomly. This arbitrary quirk of the Pac-Man environment is what allows the theorems to show that "Blackwell optimal policies tend to avoid immediately dying in Pac-Man" as the paper claims. (Whenever Pac-Man can immediately die, the ghosts are not in "blue mode" and thus the environment behaves deterministically).

(I elaborated more on this here).

Comment by ofer on Automating Auditing: An ambitious concrete technical research proposal · 2021-08-12T17:15:09.057Z · LW · GW

In particular, if automating auditing fails, that should mean we now have a concrete style of attack that we can’t build an auditor to discover—which is an extremely useful thing to have, as it provides both a concrete open problem for further work to focus on, as well as a counter-example/impossibility result to the general possibility of being able to make current systems safely auditable.

How does such a scenario (in which "automating auditing fails") look like? The alignment researchers who will work on this will always be able to say: "Our current ML models are just not capable enough for implementing such an auditor. But if we use 10x training compute or a better architecture etc., we may succeed."

Comment by ofer on Seeking Power is Convergently Instrumental in a Broad Class of Environments · 2021-08-10T16:10:35.348Z · LW · GW

I still don't see how this works. The "small constant" here is actually the length of a program that needs to contain a representation of the entire MDP (because the program needs to simulate the MDP for each possible permutation). So it's not a constant; it's an unbounded integer.

Even if we restrict the discussion to a given very-simple-MDP, the program needs to contain way more than 100 bits (just to represent the MDP + the logic that checks whether a given permutation satisfies the relevant condition). So the probability of the POWER-seeking reward functions that are constructed here is smaller than of the probability of the non-POWER-seeking reward functions. [EDIT: I mean, the probability of the constructed reward functions may happen to be larger, but the proof sketch doesn't show that it is.]

(As an aside, the permutations that we're dealing with here are equal to their own inverse, so it's not useful to apply them multiple times.)

Comment by ofer on Seeking Power is Convergently Instrumental in a Broad Class of Environments · 2021-08-09T17:06:50.549Z · LW · GW

They would change quantitatively, but the upshot would probably be similar. For example, for the Kolmogorov prior, you could prove theorems like "for every reward function that <doesn't do the thing>, there are N reward functions that <do the thing> that each have at most a small constant more complexity" (since you can construct them by taking the original reward function and then apply the relevant permutation / move through the orbit, and that second step has constant K-complexity). Alex sketches out a similar argument in this post.

I don't see how this works. If you need bits to specify the permutation for that "second step", the probability of each of the N reward functions will be smaller than the original one's by a factor of . So you need to be smaller than which is impossible?

Comment by ofer on rohinmshah's Shortform · 2021-08-09T11:33:54.176Z · LW · GW

The incentive of social media companies to invest billions into training competitive RL agents that make their users spend as much time as possible in their platform seem like an obvious reason to be concerned. Especially when such RL agents plausibly already select a substantial fraction of the content that people in developed countries consume.

Comment by ofer on My Marriage Vows · 2021-07-26T14:40:19.597Z · LW · GW

A related thing that came up in our discussion after I wrote this post is how to apply the Vow of Concord in the face of utility functions that change over time.

That seems like a very important point. Also, you may end up living for more than a billion years (via future technology). The fraction of your future life in which your ~preferences/goal-system will be similar to your current ones may be extremely small.

Comment by ofer on For some, now may be the time to get your third Covid shot · 2021-07-12T19:43:26.914Z · LW · GW

To be clear, something can be 'substantial/important evidence' (in a Bayesian sense) even if it causes one to update their credence in something from 1% to 2%.

You mostly use the word 'indication' instead of evidence (e.g. "There is significant indication that a third dose of an mRNA vaccine has a good safety profile" and "I agree with them that this indicates a similar safety profile to the first two doses"). I'm not sure what you mean by that word in this context. Can you share with us your credence in the prediction that: [in 5 years it will be widely believed that such a booster shot (taken in July 2021) had a good safety profile] [such a booster shot having a good safety profile]?

Comment by ofer on For some, now may be the time to get your third Covid shot · 2021-07-11T05:25:12.098Z · LW · GW

What you call a "significant indication that a third dose of an mRNA vaccine has a good safety profile" seems to be mostly just statements by vaccine manufacturers. Furthermore, in your list of 8 "reasons to not pursue a booster vaccine now" you don't directly mention anything about potential health risks from taking a booster shot (which I'm not aware of the FDA claiming to be safe).

[EDIT: I'm not saying here whether it's a good or bad idea for someone to get a booster shot. Also, statements by vaccine manufacturers can obviously be important evidence (in a Bayesian sense) for the safety profile, so the way I commented on the first quote may have been overly negative.]

Comment by ofer on Looking for Collaborators for an AGI Research Project · 2021-07-09T07:05:26.959Z · LW · GW

If something along these lines is the fastest path to AGI, I think it needs to be in the right hands. My goal would be, some months or years from now, to get research results that make it clear we’re on the right track to building AGI. I’d go to folks I trust such as Eliezer Yudkowsky/MIRI/OpenAI, and basically say “I think we’re on track to build an AGI, can we do this together and make sure its safe?” Of course understanding that we may need to completely pause further capabilities research at some point if our safety team does not give us the OK to proceed.

If you "completely pause further capabilities research", what will stop other AI labs from pursuing that research direction further? (And possibly hiring your now frustrated researchers who by this point have a realistic hope for getting immense fame, a Turing Award, etc.).

Comment by ofer on A world in which the alignment problem seems lower-stakes · 2021-07-08T20:36:19.711Z · LW · GW

I think that most of the citations in Superintelligence are in endnotes. In the endnote that follows the first sentence after the formulation of instrumental convergence thesis, there's an entire paragraph about Stephen Omohundro's work on the topic (including citations of Omohundro's "two pioneering papers on this topic").

Comment by ofer on A world in which the alignment problem seems lower-stakes · 2021-07-08T05:06:10.874Z · LW · GW

Bostrom's original instrumental convergence thesis needs to be applied carefully. The danger from power-seeking is not intrinsic to the alignment problem. This danger also depends on the structure of the agent's environment

This post uses the phrase "Bostrom's original instrumental convergence thesis". I'm not aware of there being more than one instrumental convergence thesis. In the 2012 paper that is linked here the formulation of the thesis is identical to the one in the book Superintelligence (2014), except that the paper uses the term "many intelligent agents" instead of "a broad spectrum of situated intelligent agents".

In case it'll be helpful to anyone, the formulation of the thesis in the book Superintelligence is the following:

Several instrumental values can be identified which are convergent in the sense that their attainment would increase the chances of the agent's goal being realized for a wide range of final goals and a wide range of situations, implying that these instrumental values are likely to be pursued by a broad spectrum of situated intelligent agents. 

I'm not sure what you meant here by saying that the instrumental convergence thesis "needs to be applied carefully", and how the example you gave supports this. Even in environments where the agent is "alone", we may still expect the agent to have the following potential convergent instrumental values (which are all mentioned both in the linked paper and in the book Superintelligence as categories where "convergent instrumental values may be found"): self-preservation, cognitive enhancement, technological perfection and resource acquisition.

Comment by ofer on We Still Don't Know If Masks Work · 2021-07-05T14:32:08.280Z · LW · GW

It's probably critical to distinguish between a non-respirator mask and a respirator (a mask that is sealed to the face and is supposed to filter most of the air that is being inhaled; and exhaled, if there is no exhalation valve).

For anyone whose model for COVID-19 transmission is based on what sources like the CDC's website said prior to October: even the CDC's website now says that "breathing in air when close to an infected person who is exhaling small droplets and particles that contain the virus" is a "main way" in which one can get infected with COVID-19 (they list that as the first item on their list of "three main ways" in which COVID-19 spreads). [EDIT: I don't want people reading this to update that the risk is small if the infected person is not "close", especially when talking about enclosed spaces.]

Comment by ofer on Environmental Structure Can Cause Instrumental Convergence · 2021-06-28T16:19:06.948Z · LW · GW

Because you can do "strictly more things" with the vase (including later breaking it) than you can do after you break it, in the sense of proposition 6.9 / lemma D.49. This means that you can permute breaking-vase-is-optimal objectives into breaking-vase-is-suboptimal objectives.

Most of the reward functions are either indifferent about the vase or want to break the vase. The optimal policies of all those reward functions don't "tend to avoid breaking the vase". Those optimal policies don't behave as if they care about the 'strictly more states' that can be reached by not breaking the vase.

When the agent maximizes average reward, we know that optimal policies tend to seek power when there's something like:

"Consider state s, and consider two actions a1 and a2. When {cycles reachable after taking a1 at s} is similar to a subset of {cycles reachable after taking a2 at s}, and those two cycle sets are disjoint, then a2 tends to be optimal over a1 and a2 tends to seek power compared to a1." (This follows by combining proposition 6.12 and theorem 6.13)

Here "{cycles reachable after taking a1 at s}" actually refers an RSD, right? So we're not just talking about a set of states, we're talking about a set of vectors that each corresponds to a "state visitation distribution" of a different policy. In order for the "similar to" (via involution) relation to be satisfied, we need all the elements (real numbers) of the relevant vector pairs to match. This is a substantially more complicated condition than the one in your comment, and it is generally harder to satisfy in stochastic environments.

In fact, I think that condition is usually hard/impossible to satisfy even in toy stochastic environments. Consider a version of Pac-Man in which at least one "ghost" is moving randomly at any given time; I'll call this Pac-Man-with-Random-Ghost (a quick internet search suggests that in the real Pac-Man the ghosts move deterministically other than when they are in "Frightened" mode, i.e. when they are blue and can't kill Pac-Man).

Let's focus on the condition in Proposition 6.12 (which is identical to or less strict than the condition for the main claim, right?). Given some state in a Pac-Man-with-Random-Ghost environment, suppose that action a1 results in an immediate game-over state due to a collision with a ghost, while action a2 does not. For every terminal state , is a set that contains a single vector in which all entries are 0 except for one that is non-zero. But for every state that can result from action a2, we get that is a set that does not contain any vector-with-0s-in-all-entries-except-one, because for any policy, there is no way to get to a particular terminal state with probability 1 (due to the location of the ghosts being part of the state description). Therefore there does not exist a subset of that is similar to via an involution.

A similar argument seems to apply to Propositions 6.5 and 6.9. Also, I think Corollary 6.14 never applies to Pac-Man-with-Random-Ghost environments, because unless s is a terminal state, will not contain any vector-with-0s-in-all-entries-except-one (again, due to ghosts moving randomly). The paper claims (in the context of Figure 8 which is about Pac-Man): "Therefore, corollary 6.14 proves that Blackwell optimal policies tend to not go left in this situation. Blackwell optimal policies tend to avoid immediately dying in PacMan, even though most reward functions do not resemble Pac-Man’s original score function." So that claim relies on Pac-Man being a "sufficiently deterministic" environment and it does not apply to the Pac-Man-with-Random-Ghost version.

Can you give an example of a stochastic environment (with randomness in every state transition) to which the main claim of the paper applies?

Comment by ofer on Sam Altman and Ezra Klein on the AI Revolution · 2021-06-27T13:04:59.487Z · LW · GW

In maker-land, things don't do impressive stuff without you specifically trying to get them to do that stuff; and the hard work is always getting it to do that stuff, even though figuring out what stuff you want is also hard.

I think it's more that in maker-land, the sign of the impact usually does not appear to matter much in terms of gaining wealth/influence/status. It appears that usually, if your project has a huge impact on the world—and you're not going to jail—you win.

Comment by ofer on Environmental Structure Can Cause Instrumental Convergence · 2021-06-25T20:36:17.764Z · LW · GW

That one in particular isn't a counterexample as stated, because you can't construct a subgraph isomorphism for it.

Probably not an important point, but I don't see why we can't use the identity isomorphism (over the part of the state space that a1 leads to).

Comment by ofer on Environmental Structure Can Cause Instrumental Convergence · 2021-06-25T20:00:46.466Z · LW · GW

I was referring to the claim being made in Rohin's summary. (I no longer see counter examples after adding the assumption that "a1 and a2 lead to disjoint sets of future options".)

Comment by ofer on Environmental Structure Can Cause Instrumental Convergence · 2021-06-25T14:06:14.235Z · LW · GW

(we’re going to ignore cases where a1 or a2 is a self-loop)

I think that a more general class of things should be ignored here. For example, if a2 is part of a 2-cycle, we get the same problem as when a2 is a self-loop. Namely, we can get that most reward functions have optimal policies that take the action a1 over a2 (when the discount rate is sufficiently close to 1), which contradicts the claim being made.

Comment by ofer on Why did no LessWrong discourse on gain of function research develop in 2013/2014? · 2021-06-25T13:30:06.954Z · LW · GW

Now (after all the COVID-19 related discourse in the media), it indeed seems a lot less risky to mention GoF research. (You could have made the point that "GoF research is already happening" prior to COVID-19; but perhaps a very small fraction of people then were aware that GoF research was a thing, making it riskier to mention).

Comment by ofer on On the limits of idealized values · 2021-06-25T13:19:29.641Z · LW · GW

Or consider the idea that idealization involves or is approximated by “running a large number of copies of yourself, who then talk/argue a lot with each other and with others, […]”

Later in the "Ghost civilizations" section you mentioned the idea of ghost copies "supervising/supporting/scrutinizing an explorer trying some sort of process or stimulus that could lead to going off the rails". It's interesting to think about technologies like lie-detectors in this context, for mitigating risks like the "memetic hazards that are fatal from an evaluative perspective" that you mentioned. For example, suppose that a Supervisor Copy asks many Explorer Copies to enter a secure room that is then locked. The Explorer Copies then pursue a certain risky line of thought X. They then get to write down their conclusion, but the Supervisor Copy only gets to read it if all the Explorer Copies pass a lie-detector test in which they claim that they did not stumble upon any "memetic hazard" etc.

As an aside, all those copies can be part of a single simulation that we run for this purpose, in which they all get treated very well (even if they end up without the ability to affect anything outside the simulation).

Related to what you wrote near the end ("In a sense, I can use the image of them…"), I just want to add that using an imaginary idealized version of oneself as an advisor may be a great way to mitigate some harmful cognitive biases and also just a great productivity trick.

Comment by ofer on Discussion: Objective Robustness and Inner Alignment Terminology · 2021-06-25T11:00:58.097Z · LW · GW

Suppose we train a model, and at some point during training the inference execution hacks the computer on which the model is trained, and the computer starts doing catastrophic things via its internet connection. Does the generalization-focused approach consider this to be an outer alignment failure?

Comment by ofer on Environmental Structure Can Cause Instrumental Convergence · 2021-06-24T22:18:07.908Z · LW · GW

Optimal policies will tend to avoid breaking the vase, even though some don't. 

Are you saying that the optimal policies of most reward functions will tend to avoid breaking the vase? Why?

This is just making my point - Blackwell optimal policies tend to end up in any state but the last state, even though at any given state they tend to progress. If D1 is {the first four cycles} and D2 is {the last cycle}, then optimal policies tend to end up in D1 instead of D2. Most optimal policies will avoid entering the final state, just as section 7 claims. 

My question is just about the main claim in the abstract of the paper ("We prove that for most prior beliefs one might have about the agent's reward function [...], one should expect optimal policies to seek power in these environments."). The main claim does not apply to the simple environment in my example (i.e. we should not expect optimal policies to seek POWER in that environment). I'm completely fine with that being the case, I just want to understand why. What criterion does that environment violate?

I agree that there's room for cleaner explanation of when the theorems apply, for those readers who don't want to memorize the formal conditions. 

I counted ~19 non-trivial definitions in the paper. Also, the theorems that the main claim directly relies on (which I guess is some subset of {Proposition 6.9, Proposition 6.12, Theorem 6.13}?) seem complicated. So I think the paper should definitely provide a reasonably simple description of the set of MDPs that the main claim applies to, and explain why proving things on that set is useful.

But I think the theory says interesting things because it's already starting to explain the things I built it to explain (e.g. SafeLife). And whenever I imagine some new environment I want to reason about, I'm almost always able to reason about it using my theorems (modulo already flagged issues like partial observability etc). From this, I infer that the set of MDPs is "interesting enough."

Do you mean that the main claim of the paper actually applies to those environments (i.e. that they are in the formal set of MDPs that the relevant theorems apply to) or do you just mean that optimal policies in those environments tend to be POWER-seeking? (The main claim only deals with sufficient conditions.)

Comment by ofer on Environmental Structure Can Cause Instrumental Convergence · 2021-06-24T18:24:43.607Z · LW · GW

The paper supports the claim with:

  • Embodied environment in a vase-containing room (section 6.3)

I think this refers to the following passage from the paper:

Consider an embodied navigation task through a room with a vase. Proposition 6.9 suggests that optimal policies tend to avoid breaking the vase, since doing so would strictly decrease available options.

This seems to me like a counter example. For any reward function that does not care about breaking the vase, the optimal policies do not avoid breaking the vase.

Regarding your next bullet point:

  • Pac-Man (figure 8)
    • And section 7 argues why this generally holds whenever the agent can be shut down (a large class of environments indeed)

I don't know what you mean here by "generally holds". When does an environment—in which the agent can be shut down—"have the right symmetries" for the purpose of the main claim? Consider the following counter example (in which the last state is equivalent to the agent being shut down):

In most states (the first 3 states) the optimal policies of most reward functions transition to the next state, while the POWER-seeking behavior is to stay in the same state (when the discount rate is sufficiently close to 1). If we want to tell a story about this environment, we can say that it's about a car in a one-way street.

To be clear, the issue I'm raising here about the paper is NOT that the main claim does not apply to all MDPs. The issue is the lack of (1) a reasonably simple description of the set of MDPs that the main claim applies to; and (2) an explanation for why it is useful to prove things about that set.

Sorry - I meant the "future work" portion of the discussion section 7. The future work highlights the "note of caution" bits.

The limitations mentioned there are mainly: "Most real-world tasks are partially observable" and "our results only apply to optimal policies in finite MDPs". I think that another limitation that belongs there is that the main claim only applies to a particular set of MDPs.
 

Comment by ofer on Environmental Structure Can Cause Instrumental Convergence · 2021-06-24T00:11:29.534Z · LW · GW

For my part, I either strongly disagree with nearly every claim you make in this comment, or think you're criticizing the post for claiming something that it doesn't claim (e.g. "proves a core AI alignment argument"; did you read this post's "A note of caution" section / the limitations section and conclusion of the paper v.7?).

I did read the "Note of caution" section in the OP. It says that most of the environments we think about seem to "have the right symmetries", which may be true, but I haven't seen the paper support that claim.

Maybe I just missed it, but I didn't find a "limitations section" or similar in the paper. I did find the following in the Conclusion section:

We caution that many real-world tasks are partially observable and that learned policies are rarely optimal. Our results do not mathematically prove that hypothetical superintelligent AI agents will seek power.

Though the title of the paper can still give the impression that it proves a core argument for AI x-risk.

Also, plausibly-the-most-influential-critic-of-AI-safety in EA seems to have gotten the impression (from an earlier version of the paper) that it formalizes the instrumental convergence thesis (see the first paragraph here). So I think my advice that "it should not be cited as a paper that formally proves a core AI alignment argument" is beneficial.

I don't think it will be useful for me to engage in detail, given that we've already extensively debated these points at length, without much consensus being reached.

For reference (in case anyone is interested in that discussion): I think it's the thread that starts here (just the part after "2.").

Comment by ofer on Environmental Structure Can Cause Instrumental Convergence · 2021-06-24T00:08:44.947Z · LW · GW

No worries, thanks for the clarification.

[EDIT: the confusion may have resulted from me mentioning the LW username "adamShimi", which I'll now change to the display name on the AF ("Adam Shimi").]

Comment by ofer on Environmental Structure Can Cause Instrumental Convergence · 2021-06-23T23:53:20.528Z · LW · GW

Meta: it seems that my original comment was silently removed from the AI Alignment Forum. I ask whoever did this to explain their reasoning here. Since every member of the AF could have done this AFAIK, I'm going to try to move my comment back to AF, because I think it obviously belongs there (I don't believe we have any norms about this sort of situations...). If the removal was done by a forum moderator/admin, please let me know.

Comment by ofer on Environmental Structure Can Cause Instrumental Convergence · 2021-06-23T21:13:20.886Z · LW · GW

I've ended up spending probably more than 40 hours discussing, thinking and reading this paper (including earlier versions; the paper was first published on December 2019, and the current version is the 7th, published on June 1st, 2021). My impression is very different than Adam Shimi's. The paper introduces many complicated definitions that build on each other, and its theorems say complicated things using those complicated definitions. I don't think the paper explains how its complicated theorems are useful/meaningful.

In particular, I don't think the paper provides a simple description for the set of MDPs that the main claim in the abstract applies to ("We prove that for most prior beliefs one might have about the agent's reward function […], one should expect optimal policies to seek power in these environments."). Nor do I think that the paper justifies the relevance of that set of MDPs. (Why is it useful to prove things about it?)

I think this paper should probably not be used for outreach interventions (even if it gets accepted to NeurIPS/ICML). And especially, I think it should not be cited as a paper that formally proves a core AI alignment argument.

Also, there may be a misconception that this paper formalizes the instrumental convergence thesis. That seems wrong, i.e. the paper does not seem to claim that several convergent instrumental values can be identified. The only convergent instrumental value that the paper attempts to address AFAICT is self-preservation (avoiding terminal states).

(The second version of the paper said: "Theorem 49 answers yes, optimal farsighted agents will usually acquire resources". But the current version just says "Extrapolating from our results, we conjecture that Blackwell optimal policies tend to seek power by accumulating resources[…]").

Sorry for the awkwardness (this comment was difficult to write). But I think it is important that people in the AI alignment community publish these sorts of thoughts. Obviously, I can be wrong about all of this.

Comment by ofer on Why did no LessWrong discourse on gain of function research develop in 2013/2014? · 2021-06-20T09:59:32.536Z · LW · GW

(This isn't an attempt to answer the question, but…) My best guess is that info hazard concerns reduced the amount of discourse on GoF research to some extent.

Comment by ofer on The aducanumab approval · 2021-06-18T07:44:08.820Z · LW · GW

Or is there some kind of pay-per-treatment incentive that will make doctors want to prescribe it?

(This isn't a response about this particular drug or its manufacturer.) I think that generally, large pharmaceutical companies tend to use sophisticated methods to convert dollars into willingness-of-doctors-to-prescribe-their-drugs. I'm not talking about explicit kickback schemes (which are not currently legal in most places?) but rather stuff like paying doctors consulting fees etc. and hoping that such payments cause the doctor to prescribe their drug (due to the doctor's expectation that that will influence further payments, or just due to the doctor's human disposition to reciprocate). Plausibly, most doctors who participate in such a thing don't fully recognize that the pharmaceutical company's intention is to influence what they prescribe, and their participations is materialized via cognitive biases rather than by them acting mindfully.

Also, not all doctors are great at interpreting/evaluating research papers/claims (especially when there are lots of conflict-of-interest issues involved).