Let's talk about "Convergent Rationality"

post by capybaralet · 2019-06-12T21:53:35.356Z · score: 23 (6 votes) · LW · GW · 21 comments


    I'm writing this for at least these 3 reasons:
  Characterizing convergent rationality
  My impression of attitudes towards convergent rationality
  Relation to capability control
  Relevance of convergent rationality to AI-Xrisk
  Conclusions, some arguments pro/con convergent rationality

What this post is about: I'm outlining some thoughts on what I've been calling "convergent rationality". I think this is an important core concept for AI-Xrisk, and probably a big crux for a lot of disagreements. It's going to be hand-wavy! It also ended up being a lot longer than I anticipated.

Abstract: Natural and artificial intelligences tend to learn over time, becoming more intelligent with more experience and opportunity for reflection. Do they also tend to become more "rational" (i.e. "consequentialist", i.e. "agenty" in CFAR speak)? Steve Omohundro's classic 2008 paper argues that they will, and the "traditional AI safety view [LW · GW]" and MIRI seem to agree. But I think this assumes an AI that already has a certain sufficient "level of rationality", and it's not clear that all AIs (e.g. supervised learning algorithms) will exhibit or develop a sufficient level of rationality. Deconfusion research around convergent rationality seems important, and we should strive to understand the conditions under which it is a concern as thoroughly as possible.

I'm writing this for at least these 3 reasons:


Characterizing convergent rationality

Consider a supervised learner trying to maximize accuracy. The Bayes error rate is typically non-0, meaning it's not possible to get 100% test accuracy just by making better predictions. If, however, the test data(/data distribution) were modified, for example to only contain examples of a single class, the learner could achieve 100% accuracy. If the learner were a consequentialist with accuracy as its utility function, it would prefer to modify the test distribution in this way in order to increase its utility. Yet, even when given the opportunity to do so, typical gradient-based supervised learning algorithms do not seem to pursue such solutions (at least in my personal experience as an ML researcher).

We can view the supervised learning algorithm as either ignorant of, or indifferent to, the strategy of modifying the test data. But we can also this behavior as a failure of rationality, where the learner is "irrationally" averse or blind to this strategy, by construction. A strong version of the convergent rationality thesis (CRT) would then predict that given sufficient capacity and "optimization pressure", the supervised learner would "become more rational", and begin to pursue the "modify the test data" strategy. (I don't think I've formulated CRT well enough to really call it a thesis, but I'll continue using it informally).

More generally, CRT would imply that deontological ethics are not stable, and deontologists must converge towards consequentialists. (As a caveat, however, note that in general environments, deontological behavior can be described as optimizing a (somewhat contrived) utility function (grep "existence proof" in the reward modeling agenda)). The alarming implication would be that we cannot hope to build agents that will not develop instrumental goals.

I suspect this picture is wrong. At the moment, the picture I have is: imperfectly rational agents will sometimes seek to become more rational, but there may be limits on rationality which the "self-improvement operator" will not cross. This would be analogous to the limit of ω which the "add 1 operator" approaches, but does not cross, in the ordinal numbers. In other words, order to reach "rationality level" ω+1, it's necessary for an agent to already start out at "rationality level" ω. A caveat: I think "rationality" is not uni-dimensional, but I will continue to write as if it is.

My impression of attitudes towards convergent rationality

Broadly speaking, MIRI seem to be strong believers in convergent rationality, but their reasons for this view haven't been very well-articulated (TODO: except the inner optimizer paper?). AI safety people more broadly seem to have a wide range of views, with many people disagreeing with MIRI's views and/or not feeling confident that they understand them well/fully.

Again, broadly speaking, machine learning (ML) people often seem to think it's a confused viewpoint bred out of anthropomorphism, ignorance of current/practical ML, and paranoia. People who are more familiar with evolutionary/genetic algorithms and artificial life communities might be a bit more sympathetic, and similarly for people who are concerned with feedback loops in the context of algorithmic decision making.

I think a lot of people with working on ML-based AI safety consider convergent rationality to be less relevant than MIRI does, because 1) so far it is more of a hypothetical/theoretical concern, whereas we've done a lot of and 2) current ML (e.g. deep RL with bells and whistles) seems dangerous enough because of known and demonstrated specification and robustness problems (e.g. reward hacking and adversarial examples).

In the many conversations I've had with people from all these groups, I've found it pretty hard to find concrete points of disagreement that don't reduce to differences in values (e.g. regarding long-termism), time-lines, or bare intuition. I think "level of paranoia about convergent rationality" is likely an important underlying crux.

Relation to capability control

A plethora of naive approaches to solving safety problems by limiting what agents can do have been proposed and rejected on the grounds that advanced AIs will be smart and rational enough to subvert them. Hyperbolically, the traditional AI safety view is that "capability control [LW · GW]" is useless. Irrationality can be viewed as a form of capability control.

Naively, approaches which deliberately reduce an agent's intelligence or rationality should be an effective form of capability control method (I'm guessing that's a proposal in the Artificial Stupidity paper, but I haven't read it). If this were true, then we might be able to build very intelligent and useful AI systems, but control them by, e.g. making them myopic, or restricting the hypothesis class / search space. This would reduce the "burden" on technical solutions to AI-Xrisk, making it (even) more of a global coordination problem.

But CRT suggests that these methods of capability control might fail unexpectedly. There is at least one example (I've struggled to dig up) of a memory-less RL agent learning to encode memory information in the state of the world. More generally, agents can recruit resources from their environments, implicitly expanding their intellectual capabilities, without actually "self-modifying".

Relevance of convergent rationality to AI-Xrisk

Believing CRT should lead to higher levels of "paranoia". Technically, I think this should lead to more focus on things that look more like assurance (vs. robustness or specification). Believing CRT should make us concerned that non-agenty systems (e.g. trained with supervised learning) might start behaving more like agents.

Strategically, it seems like the main implication of believing in CRT pertains to situations where we already have fairly robust global coordination and a sufficiently concerned AI community. CRT implies that these conditions are not sufficient for a good prognosis: even if everyone using AI makes a good-faith effort to make it safe, if they mistakenly don't believe CRT, they can fail. So we'd also want the AI community to behave as if CRT were true unless or until we had overwhelming evidence that it was not a concern.

On the other hand, disbelief in CRT shouldn't allay our fears overly much; AIs need not be hyperrational in order to pose significant Xrisk. For example, we might be wiped out by something more "grey goo"-like, i.e. an AI that is basically a policy hyperoptimized for the niche of the Earth, and doesn't even have anything resembling a world(/universe) model, planning procedure, etc. Or we might create AIs that are like superintelligent humans: having many cognitive biases, but still agenty enough to thoroughly outcompete us, and considering lesser intelligences of dubious moral significance.

Conclusions, some arguments pro/con convergent rationality

My impression is that intelligence (as in IQ/g) and rationality are considered to be only loosely correlated. My current model is that ML systems become more intelligent with more capacity/compute/information, but not necessarily more rational. If this is true, is creates exciting prospects for forms of capability control. On the other hand, if CRT is true, this supports the practice of modelling all sufficiently advanced AIs as rational agents.

I think the main argument against CRT is that, from an ML perspective, it seems like "rationality" is more or less a design choice: we can make agents myopic, we can hard-code flawed environment models or reasoning procedures, etc.The main counter-arguments arise from VNMUT, which can be interpreted as saying "rational agents are more fit" (in an evolutionary sense). At the same time, it seems like the complexity of the real world (e.g. physical limits of communication and information processing) makes this a pretty weak argument. Humans certainly seem highly irrational, and distinguishing biases and heuristics can be difficult.

A special case of this is the "inner optimizers" idea. The strongest argument for inner optimizers I'm aware of goes like: "the simplest solution to a complex enough task (and therefor the easiest for weakly guided search, e.g. by SGD) is to instantiate a more agenty process, and have it solve the problem for you". The "inner" part comes from the postulate that a complex and flexible enough class of models will instantiate such a agenty process internally (i.e. using a subset of the model's capacity). I currently think this picture is broadly speaking correct, and is the third major (technical) pillar supporting AI-Xrisk concerns (along with Goodhart's law and instrumental goals).

The issues with tiling agents also suggest that the analogy with ordinals I made might be stronger than it seems; it may be impossible for an agent to rationally endorse a qualitatively different form of reasoning. Similarly, while "CDT wants to become UDT" (supporting CRT), my understanding is that it is not actually capable of doing so (opposing CRT) because "you have to have been UDT all along" (thanks to Jessica Taylor for explaining this stuff to me a few years back).

While I think MIRI's work on idealized reasoners has shed some light on these questions, I think in practice, random(ish) "mutation" (whether intentionally designed or imposed by the physical environment) and evolutionary-like pressures may push AIs across boundaries that the "self-improvement operator" will not cross, making analyses of idealized reasoners less useful than they might naively appear.

This article is inspired by conversations with Alex Zhu, Scott Garrabrant, Jan Leike, Rohin Shah, Micah Carrol, and many others over the past year and years.


Comments sorted by top scores.

comment by abramdemski · 2019-07-03T20:59:22.854Z · score: 6 (3 votes) · LW · GW

Something which seems missing from this discussion is the level of confidence we can have for/against CRT. It doesn't make sense to just decide whether CRT seems more true or false and then go from there. If CRT seems at all possible (ie, outside-view probability at least 1%), doesn't that have most of the strategic implications of CRT itself? (Like the ones you list in the relevance-to-xrisk section.) [One could definitely make the case for probabilities lower than 1%, too, but I'm not sure where the cutoff should be, so I said 1%.]

My personal position isn't CRT (although inner-optimizer considerations have brought me closer to that position), but rather, not-obviously-not-CRT. Strategies which depend on not-CRT should go along with actually-quite-strong arguments against CRT, and/or technology for making CRT not true. It makes sense to pursue those strategies, and I sometimes think about them. But achieving confidence in not-CRT is a big obstacle.

Another obstacle to those strategies is, even if future AGI isn't sufficiently strategic/agenty/rational to fall into the "rationality attractor", it seems like it would be capable enough that someone could use it to create something agenty/rational enough for CRT. So even if CRT-type concerns don't apply to super-advanced image classifiers or whatever, the overall concern might stand because at some point someone applies the same technology to RL problems, or asks a powerful GAN to imitate agentic behavior, etc.

Of course it doesn't make sense to generically argue that we should be concerned about CRT in absence of a proof of its negation. There has to be some level of background reason for thinking CRT might be a concern. For example, although atomic weapons are concerning in many ways, it would not have made sense to raise CRT concerns about atomic weapons and ask for a proof of not-CRT before testing atomic weapons. So there has to be something about AI technology which specifically raises CRT as a concern.

One "something" is, simply, that natural instances of intelligence are associated with a relatively high degree of rationality/strategicness/agentiness (relative to non-intelligent things). But I do think there's more reasoning to be unpacked.

I also agree with other commenters about CRT not being quite the right thing to point at, but, this issue of the degree of confidence in doubt-of-CRT was the thing that struck me as most critical. The standard of evidence for raising CRT as a legitimate concern seems like it should be much lower than the standard of evidence for setting that concern aside.

comment by capybaralet · 2019-07-04T23:59:08.577Z · score: 1 (1 votes) · LW · GW

I basically agree with your main point (and I didn't mean to suggest that it "[makes] sense to just decide whether CRT seems more true or false and then go from there").

But I think it's also suggestive of an underlying view that I disagree with, namely: (1) "we should aim for high-confidence solutions to AI-Xrisk". I think this is a good heuristic, but from a strategic point of view, I think what we should be doing is closer to: (2) "aim to maximize the rate of Xrisk reduction".

Practically speaking, a big implication of favoring (2) over (1) is giving a relatively higher priority to research at making unsafe-looking approaches (e.g. reward modelling + DRL) safer (in expectation).

comment by steve2152 · 2019-06-13T01:08:56.652Z · score: 5 (4 votes) · LW · GW

I haven't seen anyone argue for CRT the way you describe it. I always thought the argument was that we are concerned about "rational AIs" (I would say more specifically, "AIs that run searches through a space of possible actions, in pursuit of a real-world goal"), because (1) We humans have real-world goals ("cure Alzheimer's" etc.) and the best way to accomplish a real-world goal is generally to build an agent optimizing for that goal (well, that's true right up until the agent becomes too powerful to control, and then it becomes catastrophically false), (2) We can try to build AIs that are not in this category, but screw up*, (3) Even if we here all agree to not build this type of agent, it's hard to coordinate everyone on earth to never do it forever. (See also: Rohin's two posts on goal-directedness.)

In particular, when Eliezer argued a couple years ago that we should be mainly thinking about AGIs that have real-world-anchored utility functions (e.g. here or here [LW · GW]) I've always fleshed out that argument as: "...This type of AGI is the most effective and powerful type of AGI, and we should assume that society will keep making our AIs more and more effective and powerful until we reach that category."

*(Remember, any AI is running searches through some space in pursuit of something, otherwise you would never call it "intelligence". So one can imagine that the intelligent search may accidentally get aimed at the wrong target.)

comment by John_Maxwell_IV · 2019-06-14T21:55:51.813Z · score: 2 (1 votes) · LW · GW

(2) We can try to build AIs that are not in this category, but screw up*


*(Remember, any AI is running searches through some space in pursuit of something, otherwise you would never call it "intelligence". So one can imagine that the intelligent search may accidentally get aimed at the wrong target.)

The map is not the territory. A system can select [LW · GW] a promising action from the space of possible actions without actually taking it. That said, there could be a risk of a "daemon" forming somehow.

comment by steve2152 · 2019-06-28T10:32:40.627Z · score: 1 (1 votes) · LW · GW

I think I agree with this. The system is dangerous if its real-world output (pixels lit up on a display, etc.) is optimized to achieve a future-world-state. I guess that's what I meant. If there are layers of processing that sit between the optimization process output and the real-world output, that seems like very much a step in the right direction. I dunno the details, it merits further thought.

comment by capybaralet · 2019-06-26T20:26:42.993Z · score: 1 (1 votes) · LW · GW

Concerns about inner optimizers seem like a clear example of people arguing for some version of CRT (as I describe it). Would you disagree (why)?

comment by steve2152 · 2019-06-28T13:47:40.210Z · score: 1 (1 votes) · LW · GW

I am imagining a flat plain of possible normative systems (goals / preferences / inclinations / whatever), with red zones sprinkled around marking those normative systems which are dangerous. CRT (as I understand it) says that there is a basin with consequentialism at its bottom, such that there is a systematic force pushing systems towards that. I'm imagining that there's no systematic force.

So in my view (flat plain), a good AI system is one that starts in a safe place on this plain, and then doesn't move at all ... because if you move in any direction, you could randomly step into a red area. This is why I don't like misaligned subsystems---it's a step in some direction, any direction, away from the top-level normative system. Then "Inner optimizers / daemons" is a special case of "misaligned subsystem", in which the random step happened to be into a red zone. Again, CRT says (as I understand it) that a misaligned subsystem is more likely than chance to be an inner optimizer, whereas I think a misaligned subsystem can be an inner optimizer but I don't specify the probability of that happening.

Leaving aside what other people have said, it's an interesting question: are there relations between the normative system at the top-level and the normative system of its subsystems? There's obviously good reason to expect that consequentialist systems will tend to create consequentialist subsystems, and that deontological systems will tend create deontological subsystems, etc. I can kinda imagine cases where a top-level consequentialist would sometimes create a deontological subsystem, because it's (I imagine) computationally simpler to execute behaviors than to seek goals, and sub-sub-...-subsystems need to be very simple. The reverse seems less likely to me. Why would a top-level deontologist spawn a consequentialist subsystem? Probably there are reasons...? Well, I'm struggling a bit to concretely imagine a deontological advanced AI...

We can ask similar questions at the top-level. I think about normative system drift (with goal drift being a special case), buffeted by a system learning new things and/or reprogramming itself and/or getting bit-flips from cosmic rays etc. Is there any reason to expect the drift to systematically move in a certain direction? I don't see any reason, other than entropy considerations (e.g. preferring systems that can be implemented in many different ways). Paul Christiano talks about a "broad basin of attraction" towards corrigibility but I don't understand the argument, or else I don't believe it. I feel like, once you get to a meta-enough level, there stops being any meta-normative system pushing the normative system in any particular direction.

So maybe the stronger version of not-CRT is: "there is no systematic force of any kind whatsoever on an AI's top-level normative system, with the exceptions of (1) entropic forces, and (2) programmers literally shutting down the AI, editing the raw source code, and trying again". I (currently) would endorse this statement. (This is also a stronger form of orthogonality, I guess.)

comment by capybaralet · 2019-06-28T15:24:10.237Z · score: 1 (1 votes) · LW · GW

RE "Is there any reason to expect the drift to systematically move in a certain direction?"

For bit-flips, evolution should select among multiple systems for those that get lucky and get bit-flipped towards higher fitness, but not directly push a given system in that direction.

For self-modification ("reprogramming itself"), I think there are a lot of arguments for CRT (e.g. the decision theory self-modification arguments), but they all seem to carry some implicit assumptions about the inner-workings of the AI.

comment by steve2152 · 2019-06-28T19:04:18.650Z · score: 1 (1 votes) · LW · GW

What are "decision theory self-modification arguments"? Can you explain or link?

comment by dxu · 2019-06-28T23:13:41.180Z · score: 4 (2 votes) · LW · GW

I don't know of any formal arguments (though that's not to say there are none), but I've heard the point repeated enough times that I think I have a fairly good grasp of the underlying intuition. To wit: most departures from rationality (which is defined in the usual sense [LW · GW]) are not stable under reflection. That is, if an agent is powerful enough to model its own reasoning process (and potential improvements to said process), by default it will tend to eliminate obviously irrational behavior (if given the opportunity).

The usual example of this is an agent running CDT. Such agents, if given the opportunity to build successor agents, will not create other CDT agents. Instead, the agents they construct will generally follow some FDT-like decision rule. This would be an instance of irrational behavior being corrected via self-modification (or via the construction of successor agents, which can be regarded as the same thing if the "successor agent" is simply the modified version of the original agent).

Of course, the above example is not without controversy, since some people still hold that CDT is, in fact, rational. (Though such people would be well-advised to consider what it might mean that CDT is unstable under reflection--if CDT agents are such that they try to get rid of themselves in favor of FDT-style agents when given a chance, that may not prove that CDT is irrational, but it's certainly odd, and perhaps indicative of other problems.) So, with that being said, here's a more obvious (if less realistic) example:

Suppose you have an agent that is perfectly rational in all respects, except that it is hardcoded to believe 51 is prime. (That is, its prior assigns a probability of 1 to the statement "51 is prime", making it incapable of ever updating against this proposition.) If this agent is given the opportunity to build a successor agent, the successor agent it builds will not be likewise certain of 51's primality. (This is, of course, because the original agent is not incentivized to ensure that its successor believes that 51 is prime. However, even if it were so incentivized, it still would not see a need to build this belief into the successor's prior, the way said belief is built into its own prior. After all, the original agent actually does believe 51 is prime; and so from its perspective, the primality of 51 is simply a fact that any sufficiently intelligent agent ought to be able to establish--without any need for hardcoding.)

I've now given two examples of irrational behavior being corrected out of existence via self-modification. The first example, CDT, could be termed an example of instrumentally irrational behavior; that is, the irrational part of the agent is the rule it uses to make decisions. The second example, conversely, is not an instance of instrumental irrationality, but rather epistemic irrationality: the agent is certain, a priori, that a particular (false) statement of mathematics is actually true. But there is a third type of "irrationality" that self-modification is prone to destroying, which is not (strictly speaking) a form of irrationality at all: "irrational" preferences.

Yes, not even preferences are safe from self-modification! Intuitively, it might be obvious that departures from instrumental and epistemic rationality will tend to be corrected; but it doesn't seem obvious at all that preferences should be subject to the same kind of "correction" (since, after all, preferences can't be labeled "irrational"). And yet, consider the following agent: a paperclip maximizer that has had a very simple change made to its utility function, such that it assigns a utility of -10,000 to any future in which the sequence "a29cb1b0eddb9cb5e06160fdec195e1612e837be21c46dfc13d2a452552f00d0" is printed onto a piece of paper. (This is the SHA-256 hash for the phrase "this is a stupid hypothetical example".) Such an agent, when considering how to build a successor agent, may reason in the following manner:

It is extraordinarily improbable that this sequence of characters will be printed by chance. In fact, the only plausible reason for such a thing to happen is if some other intelligence, upon inspecting my utility function, notices the presence of this odd utility assignment and subsequently attempts to exploit it by threatening to create just such a piece of paper. Therefore, if I eliminate this part of my utility function, that removes any incentives for potential adversaries to create such a piece of paper, which in turn means such a paper will almost surely never come into existence.

And thus, this "irrational" preference would be deleted from the utility function of the modified successor agent. So, as this toy example illustrates, not even preferences are guaranteed to be stable under reflection.

It is notable, of course, that all three of the agents I just described are in some sense "almost rational"--that is, these agents are more or less fully rational agents, with a tiny bit of irrationality "grafted on" by hypothesis. This is in part due to convenience; such agents are, after all, very easy to analyze. But it also leaves open the possibility that less obviously rational agents, whose behavior isn't easily fit into the framework of rationality [LW · GW] at all--such as, for example, humans--will not be subject to this kind of issue.

Still, I think these three examples are, if not conclusive, then at the very least suggestive. They suggest that the tendency to eliminate certain kinds of behavior does exist in at least some types of agents, and perhaps in most. Empirically, at least, humans do seem to gravitate toward expected utility maximization as a framework; there is a reason economists tend to assume rational behavior in their proofs and models, and have done so for centuries, whereas the notion of intentionally introducing certain kinds of irrational behavior has shown up only recently. And I don't think it's a coincidence that the first people who approached the AI alignment problem started from the assumption that the AI would be an expected utility maximizer. Perhaps humans, too, are subject to the "convergent rationality thesis", and the only reason we haven't built our "successor agent" yet is because we don't know how to do so. (If so, then thank goodness for that!)

comment by steve2152 · 2019-06-30T02:12:27.534Z · score: 1 (1 votes) · LW · GW

Thanks, that was really helpful!! OK, so going back to my claim above: "there is no systematic force of any kind whatsoever on an AI's top-level normative system". So far I have six exceptions to this:

  1. If an agent has a "real-world goal" (utility function on future-world-states), we should expect increasingly rational goal-seeking behavior, including discovering and erasing hardcoded irrational behavior (with respect to that goal), as described by dxu. But I'm not counting this as an exception to my claim because the goal is staying the same.
  2. If an agent has a set of mutually-inconsistent goals / preferences / inclinations, it may move around within the convex hull (so to speak) of these goals / preferences / inclinations, as they compete against each other. (This happens in humans.) And then, if there is at least one preference in that set which is a "real-world goal", it's possible (though not guaranteed) that that preference will come out on top, leading to (0) above. And maybe there's a "systematic force" pushing in some direction within this convex hull---i.e., it's possible that, when incompatible preferences are competing against each other, some types are inherently likelier to win the competition than other types. I don't know which ones that would be.
  3. In the (presumably unusual) case that an agent has a "self-defeating preference" (i.e. a preference which is likelier to be satisfied by the agent not having that preference, as in dxu's awesome SHA example), we should expect the agent to erase that preference.
  4. As capybaralet notes, if there is evolution among self-reproducing AIs (god help us all), we can expect the population average to move towards goals promoting evolutionary fitness
  5. Insofar as there is randomness in how agents change over time, we should expect a systematic force pushing towards "high-entropy" goals / preferences / inclinations (i.e., ones that can be implemented in lots of different ways).
  6. Insofar as the AI is programming its successors, we should expect a systematic force pushing towards goals / preferences / inclinations that are easy to program & debug & reason about.
  7. The human programmers can shut down the AI and edit the raw source code.

Agree or disagree? Did I miss any?

comment by capybaralet · 2019-06-14T17:08:03.954Z · score: 1 (1 votes) · LW · GW

See my response to rohin, below.

I'm potentially worried about both; let's not make a false dichotomy!

comment by ricraz · 2019-06-26T20:40:33.678Z · score: 4 (2 votes) · LW · GW
There is at least one example (I've struggled to dig up) of a memory-less RL agent learning to encode memory information in the state of the world.

I recall an example of a Mujoco agent whose memory was periodically wiped storing information in the position of its arms. I'm also having trouble digging it up though.

comment by rohinmshah · 2019-06-14T15:38:24.822Z · score: 3 (2 votes) · LW · GW
The main counter-arguments arise from VNMUT, which can be interpreted as saying "rational agents are more fit" (in an evolutionary sense).

While I generally agree with CRT as applied to advanced agents, the VNM theorem is not the reason why, because it is vacuous. I agree with steve that the real argument for it is that humans are more likely to build goal-directed agents because that's the only way we know how to get AI systems that do what we want. But we totally could build non-goal-directed agents that CRT doesn't apply to, e.g. Google Maps.

comment by capybaralet · 2019-06-14T17:07:11.406Z · score: 5 (3 votes) · LW · GW

I definitely want to distinguish CRT from arguments that humans will deliberately build goal-directed agents. But let me emphasize: I think incentives for humans to build goal-directed agents are a larger and more significant and important source of risk than CRT.

RE VVMUT being vacuous: this is a good point (and also implied by the caveat from the reward modeling paper). But I think that in practice we can meaningfully identify goal-directed agents and infer their rationality/bias "profile", as suggested by your work ( http://proceedings.mlr.press/v97/shah19a.html ), and Laurent Orseau's ( https://arxiv.org/abs/1805.12387 ).

comment by rohinmshah · 2019-06-15T01:17:17.003Z · score: 2 (1 votes) · LW · GW

I guess my position is that CRT is only true to the extent that you build a goal-directed agent. (Technically, the inner optimizers argument is one way that CRT could be true even without building an explicitly goal-directed agent, but it seems like you view CRT as broader and more likely than inner optimizers, and I'm not sure how.)

Maybe another way to get at the underlying misunderstanding: do you see a difference between "convergent rationality" and "convergent goal-directedness"? If so, what is it? From what you've written they sound equivalent to me.

comment by capybaralet · 2019-06-25T17:03:17.638Z · score: 3 (2 votes) · LW · GW

That's a reasonable position, but I think the reality is that we just don't know. Moreover, it seems possible to build goal-directed agents that don't become hyper-rational by (e.g.) restricting their hypothesis space. Lots of potential for deconfusion, IMO.

comment by Donald Hobson (donald-hobson) · 2019-06-13T16:10:37.703Z · score: 3 (2 votes) · LW · GW

I would say that there are some kinds of irrationality that will be self modified or subagented away, and others that will stay. A CDT agent will not make other CDT agents. A myopic agent, one that only cares about the next hour, will create a subagent that only cares about the first hour after it was created. (Aeons later it will have taken over the universe and put all the resources into time-travel and worrying that its clock is wrong.)

I am not aware of any irrationality that I would consider to make a safe, useful and stable under self modification - subagent creation.

comment by capybaralet · 2019-06-27T02:37:51.778Z · score: 1 (1 votes) · LW · GW

" I would say that there are some kinds of irrationality that will be self modified or subagented away, and others that will stay. "

^ I agree; this is the point of my analogy with ordinal numbers.

A completely myopic agent (that doesn't directly do planning over future time-steps, but only seeks to optimize its current decision) probably shouldn't make any sub-agents in the first place (except incidentally).

comment by flodorner · 2019-06-14T20:05:08.136Z · score: 1 (1 votes) · LW · GW

"If the learner were a consequentialist with accuracy as its utility function, it would prefer to modify the test distribution in this way in order to increase its utility. Yet, even when given the opportunity to do so, typical gradient-based supervised learning algorithms do not seem to pursue such solutions (at least in my personal experience as an ML researcher)."

Can you give an example for such an opportunity being given but not taken?

comment by capybaralet · 2019-06-26T20:25:05.677Z · score: 1 (1 votes) · LW · GW

I have unpublished work on that. And a similar experiment (with myopic reinforcement learning) in our paper "Misleading meta-objectives and hidden incentives for distributional shift." ( https://sites.google.com/view/safeml-iclr2019/accepted-papers?authuser=0 )

The environment used in the unpublished work is summarized here: https://docs.google.com/presentation/d/1K6Cblt_kSJBAkVtYRswDgNDvULlP5l7EH09ikP2hK3I/edit?usp=sharing