Or is there some kind of pay-per-treatment incentive that will make doctors want to prescribe it?
(This isn't a response about this particular drug or its manufacturer.) I think that generally, large pharmaceutical companies tend to use sophisticated methods to convert dollars into willingness-of-doctors-to-prescribe-their-drugs. I'm not talking about explicit kickback schemes (which are not currently legal in most places?) but rather stuff like paying doctors consulting fees etc. and hoping that such payments cause the doctor to prescribe their drug (due to the doctor's expectation that that will influence further payments, or just due to the doctor's human disposition to reciprocate). Plausibly, most doctors who participate in such a thing don't fully recognize that the pharmaceutical company's intention is to influence what they prescribe, and their participations is materialized via cognitive biases rather than by them acting mindfully.
Also, not all doctors are great at interpreting/evaluating research papers/claims (especially when there are lots of conflict-of-interest issues involved).
The following is a naive attempt to write a formal, sufficient condition for a search process to be "not safe with respect to inner alignment".
D: a distribution of labeled examples. Abuse of notation: I'll assume that we can deterministically sample a sequence of examples from D.
L: a deterministic supervised learning algorithm that outputs an ML model. L has access to an infinite sequence of training examples that is provided as input; and it uses a certain "amount of compute" c that is also provided as input. If we operationalize L as a Turing Machine then c can be the number of steps that L is simulated for.
L(D,c): The ML model that L outputs when given an infinite sequence of training examples that was deterministically sampled from D; and c as the "amount of compute" that L uses.
aL,D(c): The accuracy of the model L(D,c) over D (i.e. the probability that the model L(D,c) will be correct for a random example that is sampled from D).
Finally, we say that the learning algorithm LFails The Basic Safety Test with respect to the distribution D if the accuracy aL,D(c) is not weakly increasing as a function of c.
Note: The "not weakly increasing" condition seems too strict weak. It should probably be replaced with a stricter condition, but I don't know what that stricter condition should look like.
Consider adding to the paper a high-level/simplified description of the environments for which the following sentence from the abstract applies: "We prove that for most prior beliefs one might have about the agent’s reward function [...] one should expect optimal policies to seek power in these environments." (If it's the set of environments in which "the “vast majority” of RSDs are only reachable by following a subset of policies" consider clarifying that in the paper). It's hard (at least for me) to infer that from the formal theorems/definitions.
It isn't the size of the object that matters here, the key considerations are structural. In this unrolled model, the unrolled state factors into the (action history) and the (world state). This is not true in general for other parts of the environment.
My "unrolling trick" argument doesn't require an easy way to factor states into [action history] and [the rest of the state from which the action history can't be inferred]. A sufficient condition for my argument is that the complete action history could be inferred from every reachable state. When this condition fulfills, the environment implicitly contains an action log (for the purpose of my argument), and thus the POWER (IID) of all the states is equal. And as I've argued before, this condition seems plausible for sufficiently complex real-world-like environments. BTW, any deterministic time-reversible environment fulfills this condition, except for cases where multiple actions can yield the same state transition (in which case we may not be able to infer which of those actions were chosen at the relevant time step).
It's easier to find reward functions that incentivize a given action sequence if the complete action history can be inferred from every reachable state (and the easiness depends on how easy it is to compute the action history from the state). I don't see how this fact relates to instrumental convergence supposedly disappearing for "most objectives" [EDIT: when using a simplicity prior over objectives; otherwise, instrumental convergence may not apply regardless]. Generally, if an action log constitutes a tiny fraction of the environment, its existence shouldn't affect properties of "most objectives" (regardless of whether we use the uniform prior or a simplicity prior).
see also: "optimal policies tend to take actions which strictly preserve optionality*"
Does this quote refer to a passage from the paper? (I didn't find it.)
It certainly has some kind of effect, but I don't find it obvious that it has the effect you're seeking - there are many simple ways of specifying action-history+state reward functions, which rely on the action-history and not just the rest of the state.
There are very few reward functions that rely on action-history—that can be specified in a simple way—relative to all the reward functions that rely on action-history (you need at least 2n bits to specify a reward function that considers n actions, when using a uniform prior). Also, I don't think that the action log is special in this context relative to any other object that constitutes a tiny part of the environment.
What's special is that (by assumption) the action logger always logs the agent's actions, even if the agent has been literally blown up in-universe. That wouldn't occur with the security camera. With the security camera, once the agent is dead, the agent can no longer influence the trajectory, and the normal death-avoiding arguments apply. But your action logger supernaturally writes a log of the agent's actions into the environment.
If we assume that the action logger can always "detect" the action that the agent chooses, this issue doesn't apply. (Instead of the agent being "dead" we can simply imagine the robot/actuators are in a box and can't influencing anything outside the box; which is functionally equivalent to being "dead" if the box is a sufficiently small fraction of the environment.)
Right, but if you want the optimal policies to take actions a1,...,ak, then write a reward function which returns 1 iff the action-logger begins with those actions and 0 otherwise. Therefore, it's extremely easy to incentivize arbitrary action sequences.
Sure, but I still don't understand the argument here. It's trivial to write a reward function that doesn't yield instrumental convergence regardless of whether one can infer the complete action history from every reachable state. Every constant function is such a reward function.
The theorems hold for all finite MDPs in which the formal sufficient conditions are satisfied (i.e. the required environmental symmetries exist; see proposition 6.9, theorem 6.13, corollary 6.14). For practical advice, see subsection 6.3 and beginning of section 7.
It seems to me that the (implicit) description in the paper of the set of environments over which "one should expect optimal policies to seek power" ("for most prior beliefs one might have about the agent’s reward function") involves a lot of formalism/math. I was looking for some high-level/simplified description (in English), and found the following (perhaps there are other passages that I missed):
Loosely speaking, if the “vast majority” of RSDs are only reachable by following a subset of policies, theorem 6.13 implies that that subset tends to be Blackwell optimal.
Isn't the thing we condition on here similar (roughly speaking) to your interpretation of instrumental convergence? (Is the condition for when "[…] one should expect optimal policies to seek power" made weaker by another theorem?)
I agree that you can do that. I also think that instrumental convergence doesn't apply in such MDPs (as in, "most" goals over the environment won't incentivize any particular kind of optimal action), unless you restrict to certain kinds of reward functions.
I think that using a simplicity prior over reward functions has a similar effect to "restricting to certain kinds of reward functions".
I didn't understand the point you were making with your explanation that involved a max-ent distribution. Why is the action logger treated in your explanation as some privileged object? What's special about it relative to all the other stuff that's going on in our arbitrarily complex environment? If you imagine an MDP environment where the agent controls a robot in a room that has a security camera in it, and the recorded video is part of the state, then the recorded video is doing all the work that we need an action logger to do (for the purpose of my argument).
When defined over state-action histories, it's dead easy to write down objectives which don't pursue instrumental subgoals.
In my action-logger example, the action log is just a tiny part of the state representation (just like a certain blog or a video recording are a very tiny part of the state of our universe). The reward function is a function over states (or state-action pairs) as usual, not state-action histories. My "unrolling trick" doesn't involve utility functions that are defined over state(-action) histories.
I'll address everything in your comment, but first I want to zoom out and say/ask:
In environments that have a state graph that is a tree-with-constant-branching-factor, the POWER—defined over IID-over-states reward distribution—is equal in all states. I argue that environments with very complex physical dynamics are often like that, but not if at some time step the agent can't influence the environment. (I think we agree so far?) I further argue that we can take any MDP environment and "unroll" its state graph into a tree-with-constant-branching-factor (e.g. by adding an "action log" to the state representation) such that we get a "functionally equivalent" MDP in which the POWER (IID) of all the states are equal. My best guess is that you don't agree with this point, or think that the instrumental convergence thesis doesn't apply in a meaningful sense to such MDPs (but I don't yet understand why).
Regarding the theorems (in the POWER paper; I've now spent some time on the current version): The abstract of the paper says: "With respect to a class of neutral reward function distributions, we provide sufficient conditions for when optimal policies tend to seek power over the environment." I didn't find a description of those sufficient conditions (maybe I just missed it?). AFAICT, MDPs that contain "reversible actions" (other than self-loops in terminal states) are generally problematic for POWER (IID). (I'm calling action a from state s "reversible" if it allows the agent to return to s at some point). POWER-seeking (in the limit as γ approaches 1) will always imply choosing a reversible action over a non-reversible action, and if the only reversible action is a self-loop, POWER-seeking means staying in the same state forever. Note that if there are sufficiently many terminal states (or loops more generally) that require a certain non-reversible action to reach, it will be the case that most optimal policies prefer that non-reversible action over any POWER-seeking (reversible) action. In particular, non-terminal self-loops seem to be generally problematic for POWER(IID); for example consider:
The first state has the largest POWER (IID), but for most reward functions the optimal policy is to immediately transition to a lower-POWER state (even in the limit as γ approaches 1). The paper says: "Theorem 6.6 shows it’s always robustly instrumental and power-seeking to take actions which allow strictly more control over the future (in a graphical sense)." I don't yet understand the theorem, but is there somewhere a description of the set/distribution of MDP transition functions for which that statement applies? (Specifically, the "always robustly instrumental" part, which doesn't seem to hold in the example above.)
Regarding the points from your last comment:
we should be able to ground the instrumental convergence arguments via reward functions in some way.
Maybe for this purpose we should weight reward functions by how likely we are to encounter an AI system that pursues them (this should probably involve a simplicity prior.)
What does it mean to "shut down" the process? 'Doesn't mean they won't' - so new strings will appear in the environment? Then how was the agent "shut down"?
Suppose the agent causes the customer to invoke some process that hacks a bank and causes recurrent massive payments (trillions of dollars) to appear as being received by the relevant company. Someone at the bank notices this and shuts down the compromised system, which stops the process.
What is it instead?
Suppose the state representation is a huge list of 3D coordinates, each specifying the location of an atom in the earth-like environment. The transition function mimics the laws of physics on Earth (+ "magic" that makes the text that the agent chooses appear in the environment in each time step). It's supposed to be "an Earth-like MDP".
We're considering description length? Now it's not clear that my theory disagrees with your prediction, then. If you say we have a simplicity prior over reward functions given some encoding, well, POWER and optimality probability now reflect your claims, and they now say there is instrumental convergence to the extent that that exists under a simplicity prior?
Are you referring here to POWER when it is defined over a reward distribution that corresponds to some simplicity prior? (I was talking about POWER defined over an IID-over-states reward distribution, which I think is what the theorems in the paper deal with.)
And to the extent we were always considering description length - was the problem that IID-optimality probability doesn't reflect simplicity-weighted behavioral tendencies?
My argument is just that in MDPs where the state graph is a tree-with-a-constant-branching-factor—which is plausible in very complex environments—POWR (IID) is equal in all states. The argument doesn't mention description length (the description length concept arose in this thread in the context of discussing what reward function distribution should be used for defining instrumental convergence).
I still don't know what it would mean for Ofer-instrumental convergence to exist in this environment, or not.
Maybe something like Bostrom's definition when we replace "wide range of final goals" with "reward functions weighted by the probability that we'll encounter an AI that pursues the reward function (which means using a simplicity prior)". It seems to me that your comment assumes/claims something like: "in every MDP where the state graph is a tree-with-a-constant-branching-factor, there is no meaningful sense in which instrumental convergence apply". If so, I argue that claim doesn't make sense: you can take any formal environment, however large and complex, and just add to it a simple "action logger" (that doesn't influence anything, other than effectively adding to the state representation a log of all the actions so far). If the action space is constant, the state graph of the modified MDP is a tree-with-a-constant-branching-factor; which would imply that adding that action logger somehow destroyed the applicability of the instrumental convergence thesis to that MDP; which doesn't make sense to me.
There may be many people working for top orgs (in the donor's judgment) who are able to convert additional money to productivity effectively. This seems especially likely in academic orgs where the org probably faces strict restrictions on salaries. (But I won't be surprised if it's similarly the case for other orgs). So a private donor could solicit applications (with minimal form filling) from such people, and then distribute the donation between those who applied.
So we can't set the 'arbitrary' part aside - instrumentally convergent means that the incentives apply across most reward functions - not just for one. You're arguing that one reward function might have that incentive. But why would most goals tend to have that incentive?
I was talking about a particular example, with a particular reward function that I had in mind. We seemed to disagree about whether instrumental convergence arguments apply there, and my purpose in that comment was to argue that they do. I'm not trying to define here the set of reward functions over which instrumental convergence argument apply (they obviously don't apply to all reward functions, as for every possible policy you can design a reward function for which that policy is optimal).
This doesn't make sense to me. We assumed the agent is Cartesian-separated from the universe, and its actions magically make strings appear somewhere in the world. How could humans interfere with it? What, concretely, are the "risks" faced by the agent?
E.g. humans noticing that something weird is going on and trying to shut down the process. (Shutting down the process doesn't mean that new strings won't appear in the environment and cause the state graph to become a tree-with-constant-branching-factor due to complex physical dynamics.)
(Technically, the agent's goals are defined over the text-state
Not in the example I have in mind. Again, let's say the state representation determines the location of every atom in that earth-like environment. (I think that's the key miscommunication here; the MDP I'm thinking about is NOT a "sequential string output MDP", if I understand your use of that phrase correctly. [EDIT: my understanding is that you use that phrase to describe an MDP in which a state is just the sequence of strings in the exchange so far.] [EDIT 2: I think this miscommunication is my fault, due to me writing in my first comment: "the state representation may be uniquely determined by all the text that was written so far by both the customer and the chatbot", sorry for that.])
This statement is vacuous, because it's true about any possible string.
I agree the statement would be true with any possible string; this doesn't change the point I'm making with it. (Consider this to be an application of the more general statement with a particular string.)
But then why do you claim that most reward functions are attracted to certain branches of the tree, given that regularity?
For every subset of branches in the tree you can design a reward function for which every optimal policy tries to go down those branches; I'm not saying anything about "most rewards functions". I would focus on statements that apply to "most reward functions" if we dealt with an AI that had a reward function that was sampled uniformly from all possible rewards function. But that scenario does not seem relevant (in particular, something like Occam's razor seems relevant: our prior credence should be larger for reward functions with shorter shortest-description).
what do you mean by instrumental convergence?
The non-formal definition in Bostrom's Superintelligence (which does not specify a set of rewards functions but rather says "a wide range of final goals and a wide range of situations, implying that these instrumental values are likely to be pursued by a broad spectrum of situated intelligent agents.").
So if you disagree, please explain why arbitrary reward functions tend to incentivize outputting one string sequence over another?
(Setting aside the "arbitrary" part, because I didn't talk about an arbitrary reward function…)
Consider a string, written by the chatbot, that "hacks" the customer and cause them to invoke a process that quickly takes control over most of the computers on earth that are connected to the internet, then "hacks" most humans on earth by showing them certain content, and so on (to prevent interferences and to seize control ASAP); for the purpose of maximizing whatever counts as the total discounted payments by the customer (which can look like, say, setting particular memory locations in a particular computer to a particular configuration); and minimizing low probability risks (from the perspective of the agent).
If such a string (one that causes the above scenario) exists, then any optimal policy will either involve such a string or different strings that allow at least as much expected return.
is the amount of money paid by the client part of the state?
Yes; let's say the state representation determines the location of every atom in that earth-like environment. The idea is that the environment is very complicated (and contains many actors) and thus the usual arguments for instrumental convergence apply. (If this fails to address any of the above issues let me know.)
OK, but now that seems okay again, because there isn't any instrumental convergence here either. This is just an alternate representation ('reskin') of a sequential string output MDP, where the agent just puts a string in slot t at time t.
I think we're still not thinking about the same thing; in the example I'm thinking about the agent is supposed to fill the role of a human salesperson, and the reward function is (say) the amount of money that the client paid (possibly over a long time period). So an optimal policy may be very complicated and involve instrumentally convergent goals.
I was imagining a formal (super-complex) MDP that looks like our world. The customer in my example is meant to be equivalent to a human on earth.
But I haven't taken into account that this runs into embedded agency issues. (E.g. how does the state transition function look like when the computer that "runs the agent" is down?)
And if you update the encodings and dynamics to account for real-world resource gain possibilities, then POWER and optimality probability will update accordingly and appropriately.
Because states from which the agent can (say) prevent its computer from being turned off have larger POWER? That's an interesting point that didn't occur to me while writing that comment. Though it involves the unresolved (for me) embedded agency issues. Let's side-step those issues by not having a computer running the agent inside the environment, but rather having the text string that the agent chooses in each time step magically appear somewhere in the environment. The question is now whether it's possible to get to the same state with two different sequences of strings. This depends on the state representation & state transition function; it can be the case that the state is uniquely determined by the agent's sequence of past strings so far, which will mean POWER being equal in all states.
There aren't any robustly instrumental goals in this setting, as best I can tell.
If we consider a sufficiently high level of capability, the instrumental convergence thesis applies. (E.g. the agent might manipulate/hack the customer and then gain control over resources, and stop anyone from interfering with its plan.)
I agree that in MDP problems in which the agent can lose its ability to influence the environment, we can generally expect POWER to correlate with not-losing-the-ability-to-influence-the-environment. The environments in such problems never have a state graph that is a tree-with-a-constant-branching-factor, no matter how complex they are, and thus my argument doesn't apply to them. (And I think publishing work about such MDP environments may be very useful.)
I don't think all real-world problems are like that (though many are), and the choice of the state representation and action space may determine whether a problem is like that. For example, consider a salesperson-like chatbot where an episode is a single exchange with a customer. If implemented via RL, the state representation may be uniquely determined by all the text that was written so far by both the customer and the chatbot. But does this include messages that the chatbot sent after the customer closed their browser window? If so POWER—when defined over an IID-over-states reward distribution—is constant.
I would sure be awfully surprised to see that! Wouldn't you?
My surprise would stem from observing that RL in a trivial environment yielded a system that is capable of calculating/reasoning-about π. If you replace the PacMan environment with a complex environment and sufficiently scale up the architecture and training compute, I wouldn't be surprised to learn the system is doing very impressive computations that have nothing to do with the intended objective.
Note that the examples in my comment don't rely on deceptive alignment. To "convert" your PacMan RL agent example to the sort of examples I was talking about: suppose that the objective the agent ends up with is "make the relevant memory location in the RAM say that I won the game", or "win the game in all future episodes".
By and large, we expect trained models to do (1) things that are directly incentivized by the training signal (intentionally or not), and (2) things that are indirectly incentivized by the training signal (they're instrumentally useful, or they're a side-effect, or they “come along for the ride” for some other reason), (3) things that are so simple to do that they can happen randomly.
We can also get a model that has an objective that is different from the intended formal objective (never mind whether the latter is aligned with us). For example, SGD may create a model with a different objective that is identical to the intended objective just during training (or some part thereof). Why would this be unlikely? The intended objective is not privileged over such other objectives, from the perspective the training process.
Evan gave an example related to this, where the intention was to train a myopic RL agent that goes through blue doors in the current epoch episode, but the result is an agent with a more general objective that cares about blue doors in future epochs episodes as well. In Evan's words (from the Future of Life podcast):
You can imagine a situation where every situation where the model has seen a blue door, it’s been like, “Oh, going through this blue is really good,” and it’s learned an objective that incentivizes going through blue doors. If it then later realizes that there are more blue doors than it thought because there are other blue doors in other episodes, I think you should generally expect it’s going to care about those blue doors as well.
Similar concerns are relevant for (self-)supervised models, in the limit of capability. If a network can model our world very well, the objective that SGD yields may correspond to caring about the actual physical RAM of the computer on which the inference runs (specifically, the memory location that stores the loss of the inference). Also, if any part of the network, at any point during training, corresponds to dangerous logic that cares about our world, the outcome can be catastrophic (and the probability of this seems to increase with the scale of the network and training compute).
Just to summarize my current view: For MDP problems in which the state representation is very complex, and different action sequences always yield different states, POWER-defined-over-an-IID-reward-distribution is equal for all states, and thus does not match the intuitive concept of power.
At some level of complexity such problems become relevant (when dealing with problems with real-world-like environments). These are not just problems that show up when one adverserially constructs an MDP problem to game POWER, or when one makes "really weird modelling choices". Consider a real-world inspired MDP problem where a state specifies the location of every atom. What makes POWER-defined-over-IID problematic in such an environment is the sheer complexity of the state, which makes it so that different action sequences always yield different states. It's not "weird modeling decisions" causing the problem.
I also (now) think that for some MDP problems (including many grid-world problems), POWER-defined-over-IID may indeed match the intuitive concept of power well, and that publications about such problems (and theorems about POWER-defined-over-IID) may be very useful for the field. Also, I see that the abstract of the paper no longer makes the claim "We prove that, with respect to a wide class of reward function distributions, optimal policies tend to seek power over the environment", which is great (I was concerned about that claim).
You shouldn't need to contort the distribution used by POWER to get reasonable outputs.
I think using a well-chosen reward distribution is necessary, otherwise POWER depends on arbitrary choices in the design of the MDP's state graph. E.g. suppose the student in the above example writes about every action they take in a blog that no one reads, and we choose to include the content of the blog as part of the MDP state. This arbitrary choice effectively unrolls the state graph into a tree with a constant branching factor (+ self-loops in the terminal states) and we get that the POWER of all the states is equal.
This is superficially correct, but we have to be careful because
the theorems don't deal with the partially observable case,
this implies an infinite state space (not accounted for by the theorems),
The "complicated MDP environment" argument does not need partial observability or an infinite state space; it works for any MDP where the state graph is a finite tree with a constant branching factor. (If the theorems require infinite horizon, add self-loops to the terminal states.)
A person does not become less powerful (in the intuitive sense) right after paying college tuition (or right after getting a vaccine) due to losing the ability to choose whether to do so. [EDIT: generally, assuming they make their choices wisely.]
I think POWER may match the intuitive concept when defined over certain (perhaps very complicated) reward distributions; rather than reward distributions that are IID-over-states (which is what the paper deals with).
Actually, in a complicated MDP environment—analogous to the real world—in which every sequence of actions results in a different state (i.e. the graph of states is a tree with a constant branching factor), the POWER of all the states that the agent can get to in a given time step is equal; when POWER is defined over an IID-over-states reward distribution.
I probably should have written the "because ..." part better. I was trying to point at the same thing Rohin pointed at in the quoted text.
Taking a quick look at the current version of the paper, my point still seems to me relevant. For example, in the environment in figure 16, with a discount rate of ~1, the maximally POWER-seeking behavior is to always stay in the same first state (as noted in the paper), from which all the states are reachable. This is analogous to the student from Rohin's example who takes a gap year instead of going to college.
By “power” I mean something like: the type of thing that helps a wide variety of agents pursue a wide variety of objectives in a given environment. For a more formal definition, see Turner et al (2020).
I think the draft tends to use the term power to point to an intuitive concept of power/influence (the thing that we expect a random agent to seek due to the instrumental convergence thesis). But I think the definition above (or at least the version in the cited paper) points to a different concept, because a random agent has a single objective (rather than an intrinsic goal of getting to a state that would be advantageous for many different objectives). Here's a relevant passage by Rohin Shah from the Alignment Newsletter (AN #78) pertaining to that definition of power:
You might think that optimal agents would provably seek out states with high power. However, this is not true. Consider a decision faced by high school students: should they take a gap year, or go directly to college? Let’s assume college is necessary for (100-ε)% of careers, but if you take a gap year, you could focus on the other ε% of careers or decide to go to college after the year. Then in the limit of farsightedness, taking a gap year leads to a more powerful state, since you can still achieve all of the careers, albeit slightly less efficiently for the college careers. However, if you know which career you want, then it is (100-ε)% likely that you go to college, so going to college is very strongly instrumentally convergent even though taking a gap year leads to a more powerful state.
[EDIT: I should note that I didn't understand the cited paper as originally published (my interpretation of the definition is based on an earlier version of this post). The first author has noted that the paper has been dramatically rewritten to the point of being a different paper, and I haven't gone over the new version yet, so my comment might not be relevant to it.]
Maybe "logical counterfactuals" are also relevant here (in the way I've used them in this post). For example, consider a reward function that depends on whether the first 100 digits after the 10100th digit in the decimal representation of π are all 0. I guess this example is related to the "closest non-expert model" concept.
For any competitive alignment scheme that involve helper (intermediate) ML models, I think we can construct the following story about an egregiously misaligned AI being created:
Suppose that there does not exist an ML model (in the model space being searched) that fulfills both the following conditions:
The model is useful for either creating safe ML models or evaluating the safety of ML models, in a way that allows being competitive.
The model is sufficiently simple/weak/narrow such that it's either implausible that the model is egregiously misaligned, or if it is in fact egregiously misaligned researchers can figure that out—before it's too late—without using any other helper models.
To complete the story: while we follow our alignment scheme, at some point we train a helper model that is egregiously misaligned, and we don't yet have any other helper model that allows to mitigate the associated risk.
If you don't find this story plausible, consider all the creatures that evolution created on the path from the first mammal to humans. The first mammal fulfills condition 2 but not 1. Humans might fulfill condition 1, but not 2. It seems that human evolution did not create a single creature that fulfills both conditions.
One might object to this analogy on the grounds that evolution did not optimize to find a solution that fulfills both conditions. But it's not like we know how to optimize for that (while doing a competitive search over a space of ML models).
To extended Evan's comment about coordination between deceptive models: Even if the deceptive models lack relevant game theoretical mechanisms, they may still coordinate due to being (partially) aligned with each other. For example, a deceptive model X may prefer [some other random deceptive model seizing control] over [model X seizing control with probability 0.1% and the entire experiment being terminated with probability 99.9%].
Why should we assume that the deceptive models will be sufficiently misaligned with each other such that this will not be an issue? Do you have intuitions about the degree of misalignment between huge neural networks that were trained to imitate demonstrators but ended up being consequentialists that care about the state of our universe?
It’s quite the endorsement to be called the person most likely to get things right.
I couldn't find such an endorsement in Scott Alexander's linked post. The closest thing I could find was:
I can't tell you how many times over the past year all the experts, the CDC, the WHO, the New York Times, et cetera, have said something (or been silent about something in a suggestive way), and then some blogger I trusted said the opposite, and the blogger turned out to be right. I realize this kind of thing is vulnerable to selection bias, but it's been the same couple of bloggers throughout, people who I already trusted and already suspected might be better than the experts in a lot of ways. Zvi Mowshowitz is the first name to come to mind, though there are many others.
If I'm missing something please let me know. I downvoted the OP and wrote this comment because I think and feel that such inaccuracies are bad (even if not intentional) and I don't want them to occur on LW.
I think it's worth checking whether the manufacture's website supports some verification procedure (in which the customer types in some unique code that appears on the respirator). Consider googling the term: [manufacturer name] validation.
The topic of risks related to morally relevant computations seems very important, and I hope a lot more work will be done on it!
My tentative intuition is that learning is not directly involved here. If the weights of a trained RL agent are no longer being updated after some point, my intuition is that the model is similarly likely to experience pain before and after that point (assuming the environment stays the same).
Consider the following hypothesis which does not involve a direct relationship between learning and pain: In sufficiently large scale (and complex environments), TD learning tends to create components within the network, call them "evaluators", that evaluate certain metrics that correlate with expected return. In practice the model is trained to optimize directly for the output of the evaluators (and maximizing the output of the evaluators becomes the mesa objective). Suppose we label possible outputs of the evaluators with "pain" and "pleasure". We get something that seems analogous to humans. A human cares directly about pleasure and pain (which are things that correlated with expected evolutionary fitness in the ancestral environment), even when those things don't affect their evolutionary fitness accordingly (e.g. pleasure from eating chocolate, and pain from getting a vaccine shot).
In TD learning, if from some point the model always perfectly predicted the future, the gradient would always be zero and no weights would be updated. Also, if an already-trained RL agent is being deployed, and there's no longer reinforcement learning going on after deployment (which seems like a plausible setup in products/services that companies sell to customers), the weights would obviously not be updated. ↩︎
My understanding is that the 2020 algorithms in Ajeya Cotra's draft report refer to algorithms that train a neural network on a given architecture (rather than algorithms that search for a good neural architecture etc.). So the only "special sauce" that can be found by such algorithms is one that corresponds to special weights of a network (rather than special architectures etc.).
We don't need the model to use that much optimization power, to the point where it breaks the operator. We just need it to perform roughly at human-level, and then we can just deploy many instances of the trained model and accomplish very useful things (e.g. via factored cognition).
So I think it's important to also note that, getting a neural network to "perform roughly at human-level in an aligned manner" may be a much harder task than getting a neural network to achieve maximal rating by breaking the operator. The former may be a much narrower target. This point is closely related to what you wrote here in the context of amplification:
Speaking of inexact imitation: It seems to me that having an AI output a high-fidelity imitation of human behavior, sufficiently high-fidelity to preserve properties like "being smart" and "being a good person" and "still being a good person under some odd strains like being assembled into an enormous Chinese Room Bureaucracy", is a pretty huge ask.
It seems to me obvious, though this is the sort of point where I've been surprised about what other people don't consider obvious, that in general exact imitation is a bigger ask than superior capability. Building a Go player that imitates Shuusaku's Go play so well that a scholar couldn't tell the difference, is a bigger ask than building a Go player that could defeat Shuusaku in a match. A human is much smarter than a pocket calculator but would still be unable to imitate one without using a paper and pencil; to imitate the pocket calculator you need all of the pocket calculator's abilities in addition to your own.
Correspondingly, a realistic AI we build that literally passes the strong version of the Turing Test would probably have to be much smarter than the other humans in the test, probably smarter than any human on Earth, because it would have to possess all the human capabilities in addition to its own. Or at least all the human capabilities that can be exhibited to another human over the course of however long the Turing Test lasts. [...]
It does seem useful to make the distinction between thinking about how gradient hacking failures look like in worlds where they cause an existential catastrophe, and thinking about how to best pursue empirical research today about gradient hacking.
Some of the networks that have an accurate model of the training process will stumble upon the strategy of failing hard if SGD would reward any other competing network
I think the part in bold should instead be something like "failing hard if SGD would (not) update weights in such and such way". (SGD is a local search algorithm; it gradually improves a single network.)
This strategy seems more complicated, so is less likely to randomly exist in a network, but it is very strongly selected for, since at least from an evolutionary perspective it appears like it would give the network a substantive advantage.
As I already argued in another thread, the idea is not that SGD creates the gradient hacking logic specifically (in case this is what you had in mind here). As an analogy, consider a human that decides to 1-box in Newcomb's problem (which is related to the idea of gradient hacking, because the human decides to 1-box in order to have the property of "being a person that 1-boxs", because having that property is instrumentally useful). The specific strategy to 1-box is not selected for by human evolution, but rather general problem-solving capabilities were (and those capabilities resulted in the human coming up with the 1-box strategy).
My point was that there's no reason that SGD will create specifically "deceptive logic" because "deceptive logic" is not privileged over any other logic that involves modeling the base objective and acting according to it. But I now think this isn't always true - see the edit block I just added.
"deceptive logic" is probably a pretty useful thing in general for the model, because it helps improve performance as measured through the base-objective.
But you can similarly say this for the following logic: "check whether 1+1<4 and if so, act according to the base objective". Why is SGD more likely to create "deceptive logic" than this simpler logic (or any other similar logic)?
[EDIT: actually, this argument doesn't work in a setup where the base objective corresponds to a sufficiently long time horizon during which it is possible for humans to detect misalignment and terminate/modify the model (in a way the is harmful with respect to the base objective).]
So my understanding is that deceptive behavior is a lot more likely to arise from general-problem-solving logic, rather than SGD directly creating "deceptive logic" specifically.
I think that if SGD makes the model slightly deceptive it's because it made the model slightly more capable (better at general problem solving etc.), which allowed the model to "figure out" (during inference) that acting in a certain deceptive way is beneficial with respect to the mesa-objective.
This seems to me a lot more likely than SGD creating specifically "deceptive logic" (i.e. logic that can't do anything generally useful other than finding ways to perform better on the mesa-objective by being deceptive).
The less philosophical approach to this problem is to notice that the appearance of gradient hacking would probably come from the training stumbling on a gradient hacker.
[EDIT: you may have already meant it this way, but...] The optimization algorithm (e.g. SGD) doesn't need to stumble upon the specific logic of gradient hacking (which seems very unlikely). I think the idea is that a sufficiently capable agent (with a goal system that involves our world) instrumentally decides to use gradient hacking, because otherwise the agent will be modified in a suboptimal manner with respect to its current goal system.
Early work tends to be less relevant in the context of modern machine learning
I'm curious why you think the orthogonality thesis, instrumental convergence, the treacherous turn or Goodhart's law arguments are less relevant in the context of modern machine learning. (We can use here Facebook's feed-creation-algorithm as an example of modern machine learning, for the sake of concreteness.)
Thank you for writing this up! This topic seems extremely important and I strongly agree with the core arguments here.
I propose the following addition to the list of things we care about when it comes to takeoff dynamics, or when it comes to defining slow(er) takeoff:
Foreseeability: No one creates an AI with a transformative capability X at a time when most actors (weighted by influence) believe it is very unlikely that an AI with capability X will be created within a year.
Perhaps this should replace (or be merged with) the "warning shots" entry in the list. (As an aside, I think the term "warning shot" doesn't fit, because the original term refers to an action that is carried out for the purpose of communicating a threat.)
And the other part of the core idea is that that's implausible.
I don't see why that's implausible. The condition I gave is also my explanation for why the EMH fulfills (in markets where it does), and it doesn't explain why big corporations should be good at predicting AGI.
it's in their self-interest (at least, given their lack of concern for AI risk) to pursue it aggressively
So the questions I'm curious about here are:
What mechanism is supposed to causes big corporations to be good at predicting AGI?
How come that mechanism doesn't also cause big corporations to understand the existential risk concerns?
(I'm not an economist but my understanding is that...) The EMH works in markets that fulfill the following condition: If Alice is way better than the market at predicting future prices, she can use her superior prediction capability to gain more and more control over the market, until the point where her control over the market makes the market prices reflect her prediction capability.
If Alice is way better than anyone else at predicting AGI, how can she use her superior prediction capability to gain more control over big corporations? I don't see how the EMH an EMH-based argument applies here.
The incentives of online dating service companies are ridiculously misaligned with their users'. (For users who are looking for a monogamous, long-term relationship.)
A "match" between two users that results in them both leaving the platform for good is a super-negative outcome with respect to the metrics that the company is probably optimizing for. They probably use machine learning models to decide which "candidates" to show a user at any given time, and they are incentivized to train these models to avoid matches that cause users to leave their platform for good. (And these models may be way better at predicting such matches than any human).
The incentives of online dating service companies are ridiculously misaligned with their users'. (For users that look for a monogamous, long-term relationship.)
A "match" between two users that results in them both leaving the platform for good is a super-negative outcome with respect to the metrics that the company is probably optimizing for. They probably use machine learning models to decide which "candidates" to show a user at any given time, and they are incentivized to train these models to avoid matches that cause users to leave their platform for good. (And these models may be way better at predicting such matches than any human).
Instrumental convergence is a very simple idea that I understand very well, and yet I failed to understand this paper (after spending hours on it) [EDIT: and also the post], so I'm worried about using it for the purpose of 'standing up to more intense outside scrutiny'. (Though it's plausible I'm just an outlier here.)
I have quite a different intuition on this, and I'm curious if you have a particular justification for expecting non-simulated training for multi-agent problems.
In certain domains, there are very strong economic incentives to train agents that will act in a real-world multi-agent environment, where the ability to simulate the environment is limited (e.g. trading in stock markets and choosing content for social media users).