Posts

Comments

Comment by bmg on Inner Alignment: Explain like I'm 12 Edition · 2020-08-18T15:17:51.049Z · LW · GW

Since neural networks are universal function approximators, it is indeed the case that some of them will implement specific search algorithms.

I don't think this specific point is true. It seems to me like the difference between functions and algorithms is important. You can also approximate any function with a sufficiently large look-up table, but simply using a look-up table to choose actions doesn't involve search/planning.* In this regard, something like a feedforward neural network with frozen weights also doesn't seem importantly different than a look-up table to me.

One naive perspective: Systems like AlphaGo and MuZero do search, because they implement Monte-Carlo tree search algorithms, but if you were to remove their MCTS components then they simply wouldn't do search. Search algorithms can be used to update the weights of neural networks, but neural networks don't themselves do search.

I think this naive perspective may be wrong, because it's possible that recurrence is sufficient for search/planning processes to emerge (e.g. see this paper). But then, if that's true, I think that the power of recurrence is the important thing to emphasize, rather than the fact that neural networks are universal function approximators.

*I'm thinking of search algorithms as cognitive processes, rather than input-output behaviors (which could be produced via a wide range of possible algorithms). If you're thinking of them as behaviors, then my point no longer holds. Although I've interpreted the mesa-optimization paper (and most other discussions of mesa-optimization) as talking about cognitive processes.

Comment by bmg on Developmental Stages of GPTs · 2020-08-04T01:32:24.733Z · LW · GW

I do agree that OT and ICT by themselves, without any further premises like "AI safety is hard" and "The people building AI don't seem to take safety seriously, as evidenced by their public statements and their research allocation" and "we won't actually get many chances to fail and learn from our mistakes" does not establish more than, say, 1% credence in "AI will kill us all," if even that. But I think it would be a misreading of the classic texts to say that they were wrong or misleading because of this; probably if you went back in time and asked Bostrom right before he published the book whether he agrees with you re the implications of OT and ICT on their own, he would have completely agreed. And the text itself seems to agree.

I mostly agree with this. (I think, in responding to your initial comment, I sort of glossed over "and various other premises"). Superintelligence and other classic presentations of AI risk definitely offer additional arguments/considerations. The likelihood of extremely discontinuous/localized progress is, of course, the most prominent one.

I think that "discontinuity + OT + ICT," rather than "OT + ICT" alone, has typically been presented as the core of the argument. For example, the extended summary passage from Superintelligence:

An existential risk is one that threatens to cause the extinction of Earth-originating intelligent life or to otherwise permanently and drastically destroy its potential for future desirable development. Proceeding from the idea of first-mover advantage, the orthogonality thesis, and the instrumental convergence thesis, we can now begin to see the outlines of an argument for fearing that a plausible default outcome of the creation of machine superintelligence is existential catastrophe.

First, we discussed how the initial superintelligence might obtain a decisive strategic advantage. This superintelligence would then be in a position to form a singleton and to shape the future of Earth-originating intelligent life. What happens from that point onward would depend on the superintelligence’s motivations.

Second, the orthogonality thesis suggests that we cannot blithely assume that a superintelligence will necessarily share any of the final values stereotypically associated with wisdom and intellectual development in humans—scientific curiosity, benevolent concern for others, spiritual enlightenment and contemplation, renunciation of material acquisitiveness, a taste for refined culture or for the simple pleasures in life, humility and selflessness, and so forth. We will consider later whether it might be possible through deliberate effort to construct a superintelligence that values such things, or to build one that values human welfare, moral goodness, or any other complex purpose its designers might want it to serve. But it is no less possible—and in fact technically a lot easier—to build a superintelligence that places final value on nothing but calculating the decimal expansion of pi. This suggests that—absent a special effort—the first superintelligence may have some such random or reductionistic final goal.

Third, the instrumental convergence thesis entails that we cannot blithely assume that a superintelligence with the final goal of calculating the decimals of pi (or making paperclips, or counting grains of sand) would limit its activities in such a way as not to infringe on human interests. An agent with such a final goal would have a convergent instrumental reason, in many situations, to acquire an unlimited amount of physical resources and, if possible, to eliminate potential threats to itself and its goal system. Human beings might constitute potential threats; they certainly constitute physical resources.

Taken together, these three points thus indicate that the first superintelligence may shape the future of Earth-originating life, could easily have non-anthropomorphic final goals, and would likely have instrumental reasons to pursue open-ended resource acquisition. If we now reflect that human beings consist of useful resources (such as conveniently located atoms) and that we depend for our survival and flourishing on many more local resources, we can see that the outcome could easily be one in which humanity quickly becomes extinct.

There are some loose ends in this reasoning, and we shall be in a better position to evaluate it after we have cleared up several more surrounding issues. In particular, we need to examine more closely whether and how a project developing a superintelligence might either prevent it from obtaining a decisive strategic advantage or shape its final values in such a way that their realization would also involve the realization of a satisfactory range of human values. (Bostrom, p. 115-116)

If we drop the 'likely discontinuity' premise, as some portion of the community is inclined to do, then OT and OCT are the main things left. A lot of weight would then rests on these two theses, unless we supplement them with new premises (e.g. related to mesa-optimization.)

I'd also say that there are three especially salient secondary premises in the classic arguments: (a) even many seemingly innocuous descriptions of global utility functions ("maximize paperclips," "make me happy," etc.) would result in disastrous outcomes if these utility functions were optimized sufficiently well; (b) if a broadly/highly intelligent is inclined toward killing you, it may be good at hiding this fact; and (c) if you decide to run a broadly superintelligent system, and that superintelligent system wants to kill you, you may be screwed even if you're quite careful in various regards (e.g. even if you implement "boxing" strategies). At least if we drop the discontinuity premise, though, I don't think they're compelling enough to bump us up to a high credence in doom.

Comment by bmg on Developmental Stages of GPTs · 2020-08-03T16:14:26.219Z · LW · GW

I agree that your paper strengthens the IC (and is also, in general, very cool!). One possible objection to the ICT, as traditionally formulated, has been that it's too vague: there are lots of different ways you could define a subset of possible minds, and then a measure over that subset, and not all of these ways actually imply that "most" minds in the subset have dangerous properties. Your paper definitely makes the ICT crisper, more clearly true, and more closely/concretely linked to AI development practices.

I still think, though, that the ICT only gets us a relatively small portion of the way to believing that extinction-level alignment failures are likely. A couple of thoughts I have are:

  1. It may be useful to distinguish between "power-seeking behavior" and omnicide (or equivalently harmful behavior). We do want AI systems to pursue power-seeking behaviors, to some extent. Making sure not to lock yourself in the bathroom, for example, qualifies as a power-seeking behavior -- it's akin to avoiding "State 2" in your diagram -- but it is something that we'd want any good house-cleaning robot to do. It's only a particular subset of power-seeking behavior that we badly want to avoid (e.g. killing people so they can't shut you off.)

    This being said, I imagine that, if we represented the physical universe as an MDP, and defined a reward function over states, and used a sufficiently low discount rate, then the optimal policy for most reward functions probably would involve omnicide. So the result probably does port over to this special case. Still, I think that keeping in mind the distinction between omnicide and "power-seeking behavior" (in the context of some particular MDP) does reduce the ominousness of the result to some degree.

  2. Ultimately, for most real-world tasks, I think it's unlikely that people will develop RL systems using hand-coded reward functions (and then deploy them). I buy the framing in (e.g.) the DM "scalable agent alignment" paper, Rohin's "narrow value learning" sequence, and elsewhere: that, over time, the RL development process will necessarily look less-and-less like "pick a reward function and then let an RL algorithm run until you get a policy that optimizes the reward function sufficiently well." There's seemingly just not that much that you can do using hand-written reward functions. I think that these more sophisticated training processes will probably be pretty strongly attracted toward non-omnicidal policies. At a higher level, engineers will also be attracted toward using training processes that produce benign/useful policies. They should have at least some ability to notice or foresee issues with classes of training processes, before any of them are used to produce systems that are willing and able to commit omnicide. Ultimately, in other words, I think it's reasonable to be optimistic that we'll do much better than random when producing the policies of advanced AI systems.

    I do still think that the ICT is true, though, and I do still think that it matters: it's (basically) necessary for establishing a high level of misalignment risk. I just don't think it's sufficient to establish a high level of risk (and am skeptical of certain other premises that would be sufficient to establish this).

Comment by bmg on Developmental Stages of GPTs · 2020-08-03T14:38:57.056Z · LW · GW

I think we can interpret it as a burden-shifting argument; "Look, given the orthogonality thesis and instrumental convergence, and various other premises, and given the enormous stakes, you'd better have some pretty solid arguments that everything's going to be fine in order to disagree with the conclusion of this book (which is that AI safety is extremely important)." As far as I know no one has come up with any such arguments, and in fact it's now the consensus in the field that no one has found such an argument.

I suppose I disagree that at least the orthogonality thesis and instrumental convergence, on their own, shift the burden. The OT basically says: "It is physically possible to build an AI system that would try to kill everyone." The ICT basically says: "Most possible AI systems within some particular set would try to kill everyone." If we stop here, then we haven't gotten very far.

To repurpose an analogy: Suppose that you lived very far back in the past and suspected the people would eventually try to send rockets with astronauts to the moon. It's true that it's physically possible to build a rocket that shoots astronauts out aimlessly into the depths of space. Most possible rockets that are able to leave earth's atmosphere would also send astronauts aimlessly out into the depths of space. But I don't think it'd be rational to conclude, on these grounds, that future astronauts will probably be sent out into the depths of space. The fact that engineers don't want to make rockets that do this, and are reasonably intelligent, and can learn from lower-stakes experiences (e.g. unmanned rockets and toy rockets), does quite a lot of work. If you're not worried about just one single rocket trajectory failure, but systematically more severe trajectory failures (e.g. people sending larger and larger manned rockets out into the depths of space), then the rational degree of worry becomes increasingly low.

Even sillier example: It's possible to make poisons, and there are way more substances that are deadly to people than there are substances that inoculate people are against coronavirus, but we don't need to worry much about killing everyone in the process of developing and deploying coronavirus vaccines. This is true even if it turned out that we don't currently know how to make an effective coronavirus vaccine.

I think the OT and ICT on their own almost definitely aren't enough to justify an above 1% credence in extinction from AI. To get the rational credence up into (e.g) the 10%-50% range, I think that stuff like mesa-optimization concerns, discontinuity premises, explanations of how plausible development techniques/processes could go badly wrong, and explanations of dynamics around AI unnoticed deceptive tendencies still need to do almost all of the work.

(Although a lot depends on how high a credence we're trying to justify. A 1% credence in human extinction from misaligned AI is more than enough, IMO, to justify a ton of research effort, although it also probably has pretty different prioritization implications than a 50% credence.)

Comment by bmg on Is the work on AI alignment relevant to GPT? · 2020-07-31T15:38:19.194Z · LW · GW

for example, the "Universal prior is malign" stuff shows that in the limit GPT-N would likely be catastrophic,

If you have a chance, I'd be interested in your line of thought here.

My initial model of GPT-3, and probably the model of the OP, is basically: GPT-3 is good at producing text that it would have been unsurprising to find on the internet. If we keep training up larger and larger models, using larger and larger datasets, it will produce text that it would be less-and-less surprising to find on the internet. Insofar as there are safety concerns, these mostly have to do with misuse -- or with people using GPT-N as a starting point for developing systems with more dangerous behaviors.

I'm aware that people who are more worried do have arguments in mind, related to stuff like inner optimizers or the characteristics of the universal prior, but I don't feel I understand them well -- and am, perhaps unfairly, beginning from a place of skepticism.

It's unfair to complain about GPT-3's lack of ability to simulate you to get out of the box, etc. since it's way too stupid for that, and the whole point of AI safety is to prepare for when AI systems are smart.

I think that OP's question is sort about whether this way of speaking/thinking about GPT-3 makes sense, in the first place.

Intentionally silly example: Suppose that people were expressing concern about the safety of graphing calculators, saying things like: "OK, the graphing calculator that you own is safe. But that's just because it's too stupid to recognize that it has an incentive to murder you, in order to achieve its goal of multiplying numbers together. The stupidity of your graphing calculator is the only thing keeping you alive. If we keep improving our graphing calculators, without figuring out how to better align their goals, then you will likely die at the hands of graphing-calculator-N."

Obviously, something would be off about this line of thought, although it's a little hard to articulate exactly what. In some way, it seems, the speaker's use of certain concepts (like "goals" and "stupidity") is probably to blame. I think that it's possible that there is an analogous problem, although certainly a less obvious one, with some of the safety discussion around GPT-3.