Posts

Distinguishing definitions of takeoff 2020-02-14T00:16:34.329Z · score: 39 (13 votes)
The case for lifelogging as life extension 2020-02-01T21:56:38.535Z · score: 16 (8 votes)
Inner alignment requires making assumptions about human values 2020-01-20T18:38:27.128Z · score: 25 (11 votes)
Malign generalization without internal search 2020-01-12T18:03:43.042Z · score: 36 (10 votes)
Might humans not be the most intelligent animals? 2019-12-23T21:50:05.422Z · score: 55 (25 votes)
Is the term mesa optimizer too narrow? 2019-12-14T23:20:43.203Z · score: 25 (10 votes)
Explaining why false ideas spread is more fun than why true ones do 2019-11-24T20:21:50.906Z · score: 30 (12 votes)
Will transparency help catch deception? Perhaps not 2019-11-04T20:52:52.681Z · score: 46 (13 votes)
Two explanations for variation in human abilities 2019-10-25T22:06:26.329Z · score: 74 (31 votes)
Misconceptions about continuous takeoff 2019-10-08T21:31:37.876Z · score: 71 (30 votes)
A simple environment for showing mesa misalignment 2019-09-26T04:44:59.220Z · score: 64 (26 votes)
One Way to Think About ML Transparency 2019-09-02T23:27:44.088Z · score: 20 (8 votes)
Has Moore's Law actually slowed down? 2019-08-20T19:18:41.488Z · score: 13 (9 votes)
How can you use music to boost learning? 2019-08-17T06:59:32.582Z · score: 10 (5 votes)
A Primer on Matrix Calculus, Part 3: The Chain Rule 2019-08-17T01:50:29.439Z · score: 10 (4 votes)
A Primer on Matrix Calculus, Part 2: Jacobians and other fun 2019-08-15T01:13:16.070Z · score: 22 (10 votes)
A Primer on Matrix Calculus, Part 1: Basic review 2019-08-12T23:44:37.068Z · score: 23 (10 votes)
Matthew Barnett's Shortform 2019-08-09T05:17:47.768Z · score: 7 (5 votes)
Why Gradients Vanish and Explode 2019-08-09T02:54:44.199Z · score: 27 (14 votes)
Four Ways An Impact Measure Could Help Alignment 2019-08-08T00:10:14.304Z · score: 21 (25 votes)
Understanding Recent Impact Measures 2019-08-07T04:57:04.352Z · score: 17 (6 votes)
What are the best resources for examining the evidence for anthropogenic climate change? 2019-08-06T02:53:06.133Z · score: 11 (8 votes)
A Survey of Early Impact Measures 2019-08-06T01:22:27.421Z · score: 24 (8 votes)
Rethinking Batch Normalization 2019-08-02T20:21:16.124Z · score: 21 (7 votes)
Understanding Batch Normalization 2019-08-01T17:56:12.660Z · score: 19 (7 votes)
Walkthrough: The Transformer Architecture [Part 2/2] 2019-07-31T13:54:44.805Z · score: 9 (9 votes)
Walkthrough: The Transformer Architecture [Part 1/2] 2019-07-30T13:54:14.406Z · score: 30 (13 votes)

Comments

Comment by matthew-barnett on Soft takeoff can still lead to decisive strategic advantage · 2020-02-19T01:42:56.648Z · score: 3 (2 votes) · LW · GW
The concern with AI is that an initially tiny entity might take over the world.

This is a concern with AI, but why is it the concern. If eg. the United States could take over the world because they had some AI enabled growth, why would that not be a big deal? I'm imagining you saying, "It's not unique to AI" but why does it need to be unique? If AI is the root cause of something on the order of Britain colonizing the world in the 19th century, this still seems like it could be concerning if there weren't any good governing principles established beforehand.

Comment by matthew-barnett on Preface to EAF's Research Agenda on Cooperation, Conflict, and TAI · 2020-02-19T00:05:36.401Z · score: 2 (1 votes) · LW · GW
There don't seem to be many plausible paths to s-risks: by default, we shouldn't expect them, because it would be quite surprising for an amoral AI system to think it was particularly useful or good for humans to _suffer_, as opposed to not exist at all, and there doesn't seem to be much reason to expect an immoral AI system.

I think this is probably false, but it's because I'm using the strict definition of s-risk.

I expect to the extent that there's any human-like stuff, or animal-like stuff in the future, the fact that there will also be so much more computation available implies that even proportionally small risks of suffering add up to greater aggregates of suffering than currently exist on Earth.

If 0.01% of an intergalactic civilization's resources were being used to host suffering programs, such as nature simulations, or extremely realistic video games, then this would certainly qualify as an s-risk, via the definition given here, "S-risks are events that would bring about suffering on an astronomical scale, vastly exceeding all suffering that has existed on Earth so far."

If you define s-risks as situations where proportionally large amounts of computation are focused on creating suffering, then I would agree with you. However, s-risks could still maybe be important because they could be unusually tractable. One reason might be that having even just a very small group of people who strongly don't want suffering to exist would successfully lobby society's weak preference to have proportionally small amounts of suffering. Suffering might be unique among values in this respect because there might be other places where people would want to fight you more.

Comment by matthew-barnett on [deleted post] 2020-02-18T16:08:24.663Z

I removed this post because you convinced me it was sufficiently ill composed. I still disagree strongly because I don't really understand how you would agree with the person in the analogy. And again, CNNs still seem pretty good at representing data to me, and it's still unclear why model distillation disproves this.

Comment by matthew-barnett on [deleted post] 2020-02-18T07:43:27.708Z
some sudden theoretical breakthrough (e.g. on fast matrix multiplication)

These sorts of ideas seem possible, and I'm not willing to discard them as improbable just yet. I think a way to imagine my argument is that I'm saying, "Hold on, why are we assuming that this is the default scenario? I think we should be skeptical by default." And so in general counterarguments of the form, "But it could be wrong because of this" aren't great, because something being possible does not imply that it's likely.

Comment by matthew-barnett on [deleted post] 2020-02-18T07:15:50.893Z

I don't think it's impossible. I have wide uncertainty about timelines, and I certainly think that parts of our systems can get much more efficient. I should have made this more clear in the post. What I'm trying to say is that I am skeptical of a catch-all general efficiency gain that comes from a core insight into rationality, that makes systems much more efficient suddenly.

Comment by matthew-barnett on [deleted post] 2020-02-18T07:04:02.427Z
My reaction would be "sure, that sounds like exactly the sort of thing that happens from time to time".

Insights trickle in slowly. Over the long-run, you can see vast efficiency improvements. But this seems unrealistically fast. You would really believe that a single person or team did something like that, which if true would completely and radically reshape the field of computer vision, because "it happens from time to time"?

In fact, if you replace the word "memory" with either "data" or "compute", then this has already happened with the advent of transformer architectures just within the past few years, on the training side of things.

Transformers are impressive, but how much of their usefulness is due to efficiency by having good representations of the data? I argue, not by orders of magnitude. OpenAI recently did this comparison to LSTMs, and this was their result.

Comment by matthew-barnett on [deleted post] 2020-02-18T01:52:27.686Z

My understanding was that distilling CNNs worked more-or-less by removing redundant weights, rather than by discovering a more efficient form of representing the data. Distilled CNNs are still CNNs and thus the argument follows.

My point was that you couldn't do better than just memorizing the features that make up a cat. I should clarify that I do think that deep neural networks often have a lot of wasted information (though I believe removing some of it incurs a cost in robustness). The question is whether future insights will allow us do much better than what we currently do.

Comment by matthew-barnett on Suspiciously balanced evidence · 2020-02-13T04:58:34.835Z · score: 2 (1 votes) · LW · GW

Relevant post from Robin Hanson.

Comment by matthew-barnett on Matthew Barnett's Shortform · 2020-02-01T21:46:36.085Z · score: 2 (1 votes) · LW · GW

Good point. Although, there's still a nonzero chance that they will die, if we continually extend the waiting period in some manner. And perhaps given their strong preference not to die, this is still violating their autonomy?

Comment by matthew-barnett on Matthew Barnett's Shortform · 2020-02-01T21:00:46.919Z · score: 4 (2 votes) · LW · GW

Is it possible to simultaneously respect people's wishes to live, and others' wishes to die?

Transhumanists are fond of saying that they want to give everyone the choice of when and how they die. Giving people the choice to die is clearly preferable to our current situation, as it respects their autonomy, but it leads to the following moral dilemma.

Suppose someone loves essentially every moment of their life. For tens of thousands of years, they've never once wished that they did not exist. They've never had suicidal thoughts, and have always expressed a strong interest to live forever, until time ends and after that too. But on one very unusual day they feel bad for some random reason and now they want to die. It happens to the best of us every few eons or so.

Should this person be allowed to commit suicide?

One answer is yes, because that answer favors their autonomy. But another answer says no, because this day is a fluke. In just one day they'll recover from their depression. Why let them die when tomorrow they will see their error? Or, as some would put it, why give them a permanent solution to a temporary problem?

There are a few ways of resolving the dilemma. First I'll talk about a way that doesn't resolve the dilemma. When I once told someone about this thought experiment, they proposed giving the person a waiting period. The idea was that if the person still wanted to die after the waiting period, then it was appropriate to respect their choice. This solution sounds fine, but there's a flaw.

Say the probability that you are suicidal on any given day is one in a trillion, and each day is independent. Every normal day you love life and you want to live forever. However, even if we make the waiting period arbitrarily long, there's a one hundred percent chance that you will die one day, even given your strong preference not to. It is guaranteed that eventually you will express the desire to commit suicide, and then independently during each day of the waiting period continue wanting to commit suicide, until you've waited out every day. Depending on the size of your waiting period, it may take googols of years for this to happen, but it will happen eventually.

So what's a better way? Perhaps we could allow your current self to die but then after that, replace you with a backup copy from a day ago when you didn't want to die. We could achieve this outcome by uploading a copy of your brain onto a computer each day, keeping it just in case future-you wants to die. This would solve the problem of you-right-now dying one day, because even if you decided to one day die, there would be a line of succession from your current self to future-you stretching out into infinity.

Yet others still would reject this solution, either because they don't believe that uploads are "really them" or because they think that this solution still disrespects your autonomy. I will focus on the second objection. Consider someone who says, "If I really, truly, wanted to die, I would not consider myself dead if a copy from a day ago was animated and given existence. They are too close to me, and if you animated them, I would no longer be dead. Therefore you would not be respecting my wish to die."

Is there a way to satisfy this person?

Alternatively, we could imagine setting up the following system: if someone wants to die, they are able to, but they must be uploaded and kept on file the moment before they die. Then, if at some point in the distant future, we predict that the world is such that they would have counterfactually wished to have been around rather than not existing, we reanimate them. Therefore, we fully respect their interests. If such a future never comes, then they will remain dead. But if a future comes that they would have wanted to be around to see, then they will be able to see it.

In this way, we are maximizing not only their autonomy, but also their hypothetical autonomy. For those who wished they had never been born, we can allow those people to commit suicide, and for those who do not exist but would have preferred existence if they did exist, we bring those people into existence. No one is dissatisfied with their state of affairs.

There are still a number of challenges to this view. We could first ask what mechanism we are using to predict whether someone would have wanted to exist, if they did exist. One obvious way is to simulate them, and then ask them "Do you prefer existing, or do you prefer not to exist?" But by simulating them, we are bringing them into existence, and therefore violating their autonomy if they say "I do not want to exist."

There could be ways of prediction that do not rely on total simulation. But it is probably impossible to predict their answer perfectly if we did not perform a simulation. At best, we could be highly confident. But if we were wrong, and someone did want to come into existence, but we failed to predict that and so never did, this would violate their autonomy.

Another issue arises when we consider that there might always be a future that the person would prefer to exist. Perhaps, in the eternity of all existence, there will always eventually come a time where even the death-inclined would have preferred to exist. Are we then disrespecting their ancient choice to remain nonexistent forever? There seem to be no easy answers.

We have arrived at an Arrow's impossibility theorem of sorts. Is there a way to simultaneously respect people's wishes to live forever and respect people's wishes to die, in a way that matches all of our intuitions? Perhaps not perfectly, but we could come close.

Comment by matthew-barnett on The Epistemology of AI risk · 2020-01-30T20:21:26.262Z · score: 4 (2 votes) · LW · GW

I think this justification for doing research now is valid. However, I think that as the systems developed further, researchers would be forced to shift their arguments for risk anyway, since the concrete ways that the systems go wrong would be readily apparent. It's possible that by that time it would be "too late" as the problems of safety are just too hard and researchers would have wished they made conceptual progress sooner (I'm pretty skeptical of this though).

Comment by matthew-barnett on The Epistemology of AI risk · 2020-01-30T03:43:47.252Z · score: 2 (1 votes) · LW · GW

Even recursive self improvement can be framed gradually. Recursive technological improvement is thousands of years old. The phenomenon of technology allowing us to build better technology has sustained economic growth. Recursive self improvement is simply a very local form of recursive technological improvement.

You could imagine systems will gradually get better at recursive self improvement. Some will improve themselves sort-of well, and these systems will pose risks. Some other systems will improve themselves really well, and pose greater risks. But we would have seen the latter phenomenon coming ahead of time.

And since there's no hard separation between recursive technological improvement and recursive self improvement, you could imagine technological improvement getting gradually more local, until all the relevant action is from a single system improving itself. In that case, there would also be warning signs before it was too late.

Comment by matthew-barnett on The Epistemology of AI risk · 2020-01-30T01:17:25.133Z · score: 2 (1 votes) · LW · GW
When you say "the last few years has seen many people here" for your 2nd/3rd paragraph, do you have any posts / authors in mind to illustrate?

For the utility of talking about utility functions, see this rebuttal of an argument justifying the use of utility functions by appealing to the VNM-utility theorem, and a few more posts expanding the discussion. The CAIS paper argues that we shouldn't model future AI as having monolithic long-term utility function. But it's by no means a settled debate.

For the rejection of stable self improvement as a research priority, Paul Christiano wrote a post in 2014 where he argued that stable recursive self improvement will be solved a special case of reasoning under uncertainty. And again, the CAIS model proposes that technological progress will feed into itself (not unlike what already happens), rather than a monolithic agent improving itself.

I get the impression that very few people outside of MIRI work on studying stable recursive self improvement, though this might be because they think it's not their comparative advantage.

I agree that there has been a shift in what people write about because the field grew (as Daniel Filan pointed out). However, I don't remember reading anyone dismiss convergent instrumental goals such as increasing your own intelligence or utility functions as an useful abstraction to think about agency.

There's a difference between accepting something as a theoretical problem, and accepting that it's a tractable research priority. I was arguing that the type of work we do right now might not be useful for future researchers, and so I wasn't trying to say that these things didn't exist. Rather, it's not clear that productive work can be done on them right now. My evidence was that the way we think about these problems has changed over the years. Of course, you could say that the reason why the research focuses shifted is because we made progress, but I'd be skeptical about that hypothesis.

In your thread with ofer, he asked what was the difference between using loss functions in neural nets vs. objective function / utility functions and I haven't fully catched your opinion on that.

I don't quite understand the question? It's my understanding that I was disputing a notion that the inner alignment should count as a "shift in arguments" for AI risk. I claimed that it was a refinement of the traditional arguments; more specifically, we decomposed the value alignment problem into two levels. I'm quite confused at what I'm missing here.

Comment by matthew-barnett on The Epistemology of AI risk · 2020-01-29T23:06:41.506Z · score: 2 (1 votes) · LW · GW

I'm not claiming any sort of knock-down argument. I understand that individual researchers often have very thoughtful reasons for thinking that their approach will work. I just take the heuristic seriously that it is very difficult to predict the future, or to change the course of history in a predictable way. My understanding of past predictions of the future is that they have been more-or-less horrible, and so skepticism of any particular line of research is pretty much always warranted.

In case you think AI alignment researchers are unusually good at predicting the future, and you would put them in a different reference class, I will point out that the type of AI risk stuff people on Lesswrong talk about now is different in meaningful ways to the stuff that was talked about five or ten years ago on here.

To demonstrate, a common assumption was that in the absence of advanced AI architecture design, we could minimally assume that an AI would maximize a utility function, since a utility function is a useful abstraction that seems robust to architectural changes in our underlying AI designs or future insights. The last few years has seen many people here either rejecting this argument, or finding it to be vacuous, or underspecified as an argument. (I'm not taking a hard position, I'm merely pointing out that this shift has occurred).

People also assumed that, in the absence of advanced AI architecture design, we could assume that an AI's first priority would be to increase it's own intelligence, prompting researchers to study stable recursive self-improvement. Again, the last few years has seen people here rejecting this argument, or concluding that it's not a priority for research. (Once again, I'm not here to argue whether this specific shift was entirely justified).

I suspect that even very reasonable sounding arguments of the type, "Well, we might not know what AI will look like, but mimimally we can assume X, and X is a tractable line of research" will turn out to be suspicious in the end. That's not to say that some of these arguments won't be correct. Perhaps, if we're very carfeul, we can find out which ones are correct. I just have a strong heuristic of assuming future cluelessness.

Comment by matthew-barnett on The Epistemology of AI risk · 2020-01-29T21:19:13.278Z · score: 2 (1 votes) · LW · GW
The inner alignment problem seems to me as a separate problem rather than a "refinement of the traditional argument"

By refinement, I meant that the traditional problem of value alignment was decomposed into two levels, and at both levels, values need to be aligned. I am not quite sure why you have framed this as separate rather than a refinement?

I'm not sure what you mean by saying "the rest of the book talking about unipolar outcomes". In what way do the parts in the book that discuss the orthogonality thesis, instrumental convergence and Goodhart's law assume or depend on a unipolar outcome?

The arguments for why those things pose a risk was the relevant part of the book. Specifically, it argued that because of those factors, and the fact that a single project could gain control of the world, it was important to figure everything out ahead of time, rather than waiting until the project was close to completion. Because we don't get a second chance.

The analogy of children playing with a bomb is a particular example. If Bostrom had opted for presenting a gradual narrative, perhaps he would have said that the children will be given increasingly powerful firecrackers and will see the explosive power grow and grow. Or perhaps the sparrows would have trained a population of mini-owls before getting a big owl.

Can you give an example of a hypothetical future AI system—or some outcome thereof—that should indicate that humankind ought to start working a lot more on AI safety?

I don't think there's a single moment that should cause people to panic. Rather, it will be a gradual transition into more powerful technology.

Comment by matthew-barnett on The Epistemology of AI risk · 2020-01-29T18:51:22.338Z · score: 2 (1 votes) · LW · GW
I'm not sure I understand your model. Suppose AI safety researcher Alice writes a post about a problem that Nick Bostrom did not discuss in Superintelligence back in 2014 (e.g. the inner alignment problem)

I would call the inner alignment problem a refinement of the traditional argument from AI risk. The traditional argument was that there was going to be a powerful system that had a utility function it was maximizing and it might not match ours. Inner alignment says, well, it's not exactly like that. There's going to be a loss function used to train our AIs, and the AIs themselves will have internal objective functions that they are maximizing, and both of these might not match ours.

If all the new arguments were mere refinements of the old ones, then my argument would not work. I don't think that all the new ones are refinements of the old ones, however. For an example, try to map what failure looks like onto Nick Bostrom's model for AI risk. Influence-seeking sorta looks like what Nick Bostrom was talking about, but I don't think "Going out with a whimper" is what he had in mind (I haven't read the book in a while though).

It's been a while since I read Superintelligence, but I don't recall the book arguing that the "second‐place AI lab" will likely be much far behind the leading AI lab (in subjective human time) before we get superintelligence.

My understanding is that he spent one chapter talking about multipolar outcomes, and the rest of the book talking about unipolar outcomes, where a single team gains a decisive strategic advantage over the rest of the world (which seems impossible unless a single team surges forward in development). Robin Hanson had the same critique in his review of the book.

And even if it would have argued for that, as important as such an estimate may be, how is it relevant to the basic question of whether AI Safety is something humankind should be thinking about?

If AI takeoff is more gradual, there will be warning signs for each risk before it unrolls into a catastrophe. Consider any single source of existential risk from AI, and I can plausibly point to a source of sub-existential risk that would occur in less powerful AI systems. If we ignore risk, then a disaster would occur, but it would be minor, and this would set a precedent for safety in the future.

This is important because if you have the point of view that AI safety must be solved ahead of time, before we actually build the powerful systems, then I would want to see specific technical reasons for why it will be so hard that we won't solve it during the development of those systems.

It's possible that we don't have good arguments yet, but good arguments could present themselves eventually and it would be too late at that point to go back in time and ask people in the past to start work on AI safety. I agree with this heuristic (though it's weak, and should only be used if there are not other more pressing existential risks to work on).

I also agree that there are conceptual arguments for why we should start AI safety work now, and I'm not totally convinced that the future will be either kind or safe to humanity. It's worth understanding the arguments for and against AI safety, lest we treat it as a team to be argued for.

Comment by matthew-barnett on Algorithms vs Compute · 2020-01-29T03:39:05.120Z · score: 5 (3 votes) · LW · GW

Until 2017, the best performing language models were LSTMs, which have been around since 1997. However, LSTMs in their late era of dominance were distinguished from early LSTMs by experimenting with attention and a few other mechanisms, though it's unclear to me how much this boosted their performance.

The paper that unseated LSTMs for language models reported an additional 2.0 BLEU score (range from 0 to 100) gained by switching to the new model, though this is likely an underestimate of the gain by switching to Transformers given that the old state-of-the-art models were tweaked very carefully.

My guess is that the 2000 model using 2020 compute would beat the 2020 model using 2000 compute easily, though I would love to see someone to do a deeper dive into this question.

Comment by matthew-barnett on AI Alignment 2018-19 Review · 2020-01-29T01:05:15.505Z · score: 2 (1 votes) · LW · GW
it's not obvious to me that supervised learning does

What type of scheme do you have in mind that would allow an AI to learn our values through supervised learning?

Typically, the problem with supervised learning is that it's too expensive to label everything we care about. In this case, are you imagining that we label some types of behaviors as good and some as bad, perhaps like what we would do with an approval directed agent? Or are you thinking of something more general or exotic?

Comment by matthew-barnett on The Epistemology of AI risk · 2020-01-28T23:41:32.814Z · score: 2 (1 votes) · LW · GW

Apologies for the confusing language, I knew.

Comment by matthew-barnett on The Epistemology of AI risk · 2020-01-28T20:51:40.327Z · score: 10 (2 votes) · LW · GW
The majority of those who best know the arguments for and against thinking that a given social movement is the world's most important cause, from pro-life-ism to environmentalism to campaign finance reform, are presumably members of that social movement.

This seems unlikely to me given my reactions to talking to people in other movements, including the ones you mentioned. The idea that what they're arguing for is "the world's most important cause" hasn't explicitly been considered by most of them, and for those who have, few have done any sort of rigorous analysis.

By contrast, part of the big sell of EA is that it actively searches for the world's biggest causes, and uses a detailed methodology in pursuit of this goal.

Comment by matthew-barnett on The Epistemology of AI risk · 2020-01-28T19:38:35.895Z · score: 4 (2 votes) · LW · GW

This makes sense. However, I'd still point out that this is evidence that the arguments weren't convincing, since otherwise they would have used the same arguments, even though they are different people.

Comment by matthew-barnett on The Epistemology of AI risk · 2020-01-28T17:04:47.524Z · score: 2 (1 votes) · LW · GW

It could still be that the level of absolute risk is still low, even after taking this into account. I concede that estimating risks like these are very difficult.

Comment by matthew-barnett on The Epistemology of AI risk · 2020-01-28T16:53:55.457Z · score: 9 (3 votes) · LW · GW
I feel like this depends on a whole bunch of contingent facts regarding our ability to accurately diagnose and correct what could be very pernicious problems such as deceptive alignment amidst what seems quite likely to be a very quickly changing and highly competitive world.

I agree, though I tend to think the costs associated with failing to catch deception will be high enough that any major team will be likely to bear the costs. If some team of researchers doesn't put in the effort, a disaster would likely occur that would be sub-x-risk level, and this would set a precedent for safety standards.

In general, I think humans tend to be very risk averse when it comes to new technologies, though there are notable exceptions (such as during wartime).

Why does being skeptical of very short timelines preclude our ability to do productive work on AI safety?

A full solution to AI safety will necessarily be contingent on the architectures used to build AIs. If we don't understand a whole lot about those architectures, this limits our abilities to do concrete work. I don't find the argument entirely compelling because,

  • It seems reasonably likely that AGI will be built using more-or-less the deep learning paradigm, perhaps given a few insights, and therefore productive work can be done now, and
  • We can still start institutional work, and develop important theoretical insights.

But even given these qualifications, I estimate that the vast majority of productive work to make AIs safe will be completed when the AI systems are actually built, rather than before. It follows that most work during this pre-AGI period might miss important details and be less effective than we think.

And it seems to me like “probably helps somewhat” is enough when it comes to existential risk

I agree, which is why I spend a lot of my time reading and writing posts on Lesswrong about AI risk.

Comment by matthew-barnett on The Epistemology of AI risk · 2020-01-28T16:46:21.767Z · score: 2 (1 votes) · LW · GW

If the old arguments were sound, why would researchers shift their arguments in order to make the case that AI posed a risk? I'd assume that if the old arguments worked, the new ones would be a refinement rather than a shift. Indeed many old arguments were refined, but a lot of the new arguments seem very new.

Is there any core argument in the book Superintelligence that is no longer widely accepted among AI safety researchers?

I can't speak for others, but the general notion of there being a single project that leaps ahead of the rest of the world, and gains superintelligent competence before any other team can even get close, seems suspicious to many researchers that I've talked to. In general, the notion that there will be discontinuities in development is looked with suspicion by a number of people (though, notably some researchers still think that fast takeoff is likely).

Comment by matthew-barnett on The Epistemology of AI risk · 2020-01-28T06:31:43.190Z · score: 5 (4 votes) · LW · GW

[ETA: It's unfortunate I used the word "optimism" in my comment, since my primary disagreement is whether the traditional sources of AI risk are compelling. I'm pessimistic in a sense, since I think by default our future civilization's values will be quite different from mine in important ways.]

My opinion is that AI is likely to be an important technology whose effects will largely determine our future civilization, and the outlook for humanity. And given that AI will be so large, its impact will also largely determine whether our values go extinct or survive. That said, it's difficult to understand the threat to our values from AI without a specific threat model. I appreciate trying to find specific ways that AI can go wrong, but I currently think

  • We are probably not close enough to powerful AI to have a good understanding of the primary dynamics of an AI takeoff, and therefore what type of work will help our values survive one.
  • The way our values might go extinct will probably happen in some unavoidable manner that's not related to the typical sources of AI risk. In other words, it's likely that just general value drift and game theoretic incentives will do more to destroy the value of the long-term future than technical AI errors.
  • The argument that continuous takeoff makes AI safe seems robust to most specific items on your list, though I can see several ways that the argument fails.

If AI does go wrong in one of the ways you have identified, it seems difficult to predict which one (though we can do our best to guess). It seems even harder to do productive work, since I'm skeptical of very short timelines.

Historically, our models of AI development have been notoriously poor. Ask someone from 10 years ago what they think AI might look like, and it seems unlikely that they would have predicted deep learning in a way that would have been useful to make it safer. I suspect that unless AI is very soon, it will be very hard to do specific technical work to make it safer.

Comment by matthew-barnett on Have epistemic conditions always been this bad? · 2020-01-28T06:21:31.759Z · score: 3 (2 votes) · LW · GW
For example would you endorse making LW a "free speech zone" or try to push for blanket acceptance of free speech elsewhere?

I think limiting free speech for specific forums of discussion makes sense, given that it is very difficult to maintain a high-quality community without doing so. I think that declaring that a particular place a "free speech zone" tends to invite the worst people to gather in those places (I've seen this over and over again on the internet).

More generally, I was talking about societal norms to punish speech deemed harmful. I think there's a relevant distinction between a professor getting fired for saying something deemed politically harmful, and an internet forum moderating discussion.

Comment by matthew-barnett on The Epistemology of AI risk · 2020-01-28T00:26:02.401Z · score: 2 (1 votes) · LW · GW
when we look at the distribution of opinion among those who have really “engaged with the arguments”, we are left with a substantial majority—maybe everyone but Hanson, depending on how stringent our standards are here!—who do believe that, one way or another, AI development poses a serious existential risk.

For what it's worth, I have "engaged with the arguments" but am still skeptical of the main arguments. I also don't think that my optimism is very unusual for people who work on the problem, either. Based on an image image from about five years ago (the same time Nick Bostrom's book came out), most people at FHI were pretty optimistic. Since then, it's my impression that researchers have become even more optimistic, since more people appear to accept continuous takeoff and there's been a shift in arguments. AI Impacts recently interviewed a few researchers who were also skeptical (including Hanson), and all of them have engaged in the main arguments. It's unclear to me that their opinions are actually substantially more optimistic than average.

Comment by matthew-barnett on Have epistemic conditions always been this bad? · 2020-01-26T17:27:24.454Z · score: 32 (12 votes) · LW · GW

Second, I think it is worth pointing out that there are definitely instances where, at least in my opinion, “canceling” is a valid tactic. Deplatforming violent rhetoric (e.g. Nazism, Holocaust denial, etc.) comes to mind as an obvious example.

If the people who determine what is cancel-able could consistently distinguish between violent rhetoric and non-violent rhetoric, and the boundary never expanded in some random direction, I would agree with you.

In practice, what often happens is that someone is cancelled over accusations of being a Nazi (or whatever), even when they aren't. Since defending a Nazi tends to make people think you are secretly also a Nazi, the people being falsely accused tend to get little support from outsiders.

Also, given that many views that EA endorse could easily fall outside of the window of what's considered appropriate speech one day (such as reducing wild animal suffering, negative utilitarianism, genetic enhancement), it is probably better to push for a blanket acceptance of free speech rather than just hope that future people will tolerate our ideas.

Comment by matthew-barnett on Inner alignment requires making assumptions about human values · 2020-01-23T21:47:28.846Z · score: 2 (1 votes) · LW · GW
Is your point mostly centered around there being no single correct way to generalize to new domains, but humans have preferences about how the AI should generalize, so to generalize properly, the AI needs to learn how humans want it to do generalization?

Pretty much, yeah.

The above sentence makes lots of sense to me, but I don't see how it's related to inner alignment

I think there are a lot of examples of this phenomenon in AI alignment, but I focused on inner alignment for two reasons

  • There's a heuristic that a solution to inner alignment should be independent of human values, and this argument rebuts that heuristic.
  • The problem of inner alignment is pretty much the problem of how to get a system to properly generalize, which makes "proper generalization" fundamentally linked to the idea.
Comment by matthew-barnett on Inner alignment requires making assumptions about human values · 2020-01-20T21:53:22.436Z · score: 5 (2 votes) · LW · GW
I also see how you might have a catastrophe-avoiding agent capable of large positive impacts, assuming an ontology but without assuming a lot about human preferences.

I find this interesting but I'd be surprised if it were true :). I look forward to seeing it in the upcoming posts.

That said, I want to draw your attention to my definition of catastrophe, which I think is different than the way most people use the term. I think most broadly, you might think of a catastrophe as something that we would never want to happen even once. But for inner alignment, this isn't always helpful, since sometimes we want our systems to crash into the ground rather than intelligently optimizing against us, even if we never want them to crash into the ground even once. And as a starting point, we should try to mitigate these malicious failures much more than the benign ones, even if a benign failure would have a large value-neutral impact.

A closely related notion to my definition is the term "unacceptable behavior" as Paul Christiano has used it. This is the way he has defined it,

In different contexts, different behavior might be acceptable and it’s up to the user of these techniques to decide. For example, a self-driving car trainer might specify: Crashing your car is tragic but acceptable. Deliberately covering up the fact that you crashed is unacceptable.

It seems like if we want to come up with a way to avoid these types of behavior, we simply must use some dependence on human values. I can't see how to consistently separate acceptable failures from non-acceptable ones except by inferring our values.

Comment by matthew-barnett on Inner alignment requires making assumptions about human values · 2020-01-20T21:41:28.256Z · score: 2 (1 votes) · LW · GW
Can you explain why you think there _IS_ a "true" factor

Apologies for the miscommunication, but I don't think there really is an objectively true factor. It's true to the extent that humans say that it's the true reward function, but I don't think it's a mathematical fact. That's part of what I'm arguing. I agree with what you are saying.

Comment by matthew-barnett on AI Alignment Open Thread October 2019 · 2020-01-20T21:38:44.748Z · score: 2 (1 votes) · LW · GW

Ahh. To be honest, I read that, but then responded to something different. I assumed you were just expressing general pessimism, since there's no guarantee that we would converge on good values upon a long reflection (and you recently viscerally realized that values are very arbitrary).

Now I see that your worry is more narrow, in that the cultural revolution might happen during this period, and would act unwisely to create the AGI during its wake. I guess this seems quite plausible, and is an important concern, though I personally am skeptical that anything like the long reflection will ever happen.

Comment by matthew-barnett on Outer alignment and imitative amplification · 2020-01-20T12:29:10.993Z · score: 3 (2 votes) · LW · GW
I tend to be fairly skeptical of these challenges—HCH is just a bunch of humans after all and if you can instruct them not to do things like instantiate arbitrary Turing machines, then I think a bunch of humans put together has a strong case for being aligned.

Minor nitpick: I mostly agree, but I feel like a lot of work is being done by saying that they can't instantiate arbitrary Turing machines, and that it's just a bunch of humans. Human society is also a bunch of humans, but frequently does things that I can't imagine any single intelligent person deciding. If your model breaks down for relatively human-human combinations, I think there is a significant risk that true HCH would be dangerous in quite unpredictable ways.

Comment by matthew-barnett on AI Alignment Open Thread October 2019 · 2020-01-20T10:32:52.412Z · score: 2 (1 votes) · LW · GW
It sounds like you think that something like another Communist Revolution or Cultural Revolution could happen (that emphasizes some random virtues at the expense of others), but the effect would be temporary and after it's over, longer term trends will reassert themselves. Does that seem fair?

That's pretty fair.

I think it's likely that another cultural revolution could happen, and this could adversely affect the future if it happens simultaneously with a transition into an AI based economy. However, the deviations from long-term trends are very hard to predict, as you point out, and we should know about the specifics more as we get further along. In the absence of concrete details, I find it far more helpful to use information from long-term trends rather than worrying about specific scenarios.

Comment by matthew-barnett on AI Alignment Open Thread October 2019 · 2020-01-20T09:35:01.345Z · score: 2 (1 votes) · LW · GW

I could be wrong here, but the stuff you mentioned as counterexamples to my model appear either ephemeral, or too particular. The "last few years" of political correctness is hardly enough time to judge world-trends by, right? By contrast, the stuff I mentioned (end of slavery, explicit policies against racism and war) seem likely to stick and stay with us for decades, if not centuries.

We can explain this after the fact by saying that the Left is being forced by impersonal social dynamics, e.g., runaway virtue signaling, to over-correct, but did anyone predict this ahead of time?

When I listen to old recordings of right wing talk show hosts from decades ago, they seem to be saying the same stuff that current people are saying today, about political correctness and being forced out of academia for saying things that are deemed harmful by the social elite, or about the Left being obsessed by equality and identity. So I would definitely say that a lot of people predicted this would happen.

The main difference is that it's now been amplified as recent political events have increased polarization, the people with older values are dying of old age or losing their power, and we have social media that makes us more aware of what is happening. But in hindsight I think this scenario isn't that surprising.

Russia and China adopted communism even though they were extremely poor

Of course, you can point to a few examples of where my model fails. I'm talking about the general trends rather than the specific cases. If we think in terms of world history, I would say that Russia in the early 20th century was "rich" in the sense that it was much richer than countries in previous centuries and this enabled it to implement communism in the first place. Government power waxes and wanes, but over time I think its power has definitely gone up as the world has gotten richer, and I think this could have been predicted.

Comment by matthew-barnett on AI Alignment Open Thread October 2019 · 2020-01-20T07:38:01.946Z · score: 4 (2 votes) · LW · GW

Part of why I'm skeptical of these concerns is that it seems like a lot of moral behavior is predictable as society gets richer, and we can model the social dynamics to predict some outcomes will be good.

As evidence for the predictability, consider that rich societies are more open to LGBT rights, they have explicit policies against racism, against war, slavery, torture, and it seems like rich societies are moving in the direction of government control over many aspects of life, such as education and healthcare. Is this just a quirk of our timeline, or a natural feature of civilizations of humans as they get richer?

I am inclined to think much of it is the latter.

That's not to say that I think the current path we're going on is a good one. I just think it's more predictable than what you seem to think. Given its predictability, I feel somewhat confident in the following statements: eventually, when aging is cured, people will adopt policies that give people the choice to die. Eventually, when artificial meat is very cheap and tasty, people will ban animal-based meat.

I'm not predicting these outcomes because I am confusing what I hope for and what I think will happen. I just genuinely think that human virtue signaling dynamics will be favorable to those outcomes.

I'm less confident, leaning pessimistic about these questions: I don't think humans will inevitably care about wild animal suffering. I don't think humans will inevitably create a post-human utopia where people can modify their minds into any sort of blissful existence they imagine, and I don't think humans will inevitably care about subroutine suffering. It's these questions that make me uneasy about the future.

Comment by matthew-barnett on Malign generalization without internal search · 2020-01-14T21:05:30.177Z · score: 2 (1 votes) · LW · GW

Sure, we can talk about this over video. Check your Facebook messages.

Comment by matthew-barnett on Malign generalization without internal search · 2020-01-12T23:43:33.003Z · score: 2 (1 votes) · LW · GW
Computing the fastest route to Paris doesn't involve search?
More generally, I think in order for it to work your example can't contain subroutines that perform search over actions. Nor can it contain subroutines such that, when called in the order that the agent typically calls them, they collectively constitute a search over actions.

My example uses search, but the search is not the search of the inner alignment failure. It is merely a subroutine that is called upon by this outer superstructure, which itself is the part that is misaligned. Therefore, I fail to see why my point doesn't follow.

If your position is that inner alignment failures must only occur when internal searches are misaligned with the reward function used during training, then my example would be a counterexample to your claim, since the reason for misalignment was not due to a search being misaligned (except under some unnatural rationalization of the agent, which is a source of disagreement highlighted in the post, and in my discussion with Evan above).

Comment by matthew-barnett on Realism about rationality · 2020-01-12T21:05:12.683Z · score: 2 (1 votes) · LW · GW

How does evolutionary psychology help us during our everyday life? We already know that people like having sex and that they execute all these sorts of weird social behaviors. Why does providing the ultimate explanation for our behavior provide more than a satisfaction of our curiosity?

Comment by matthew-barnett on Malign generalization without internal search · 2020-01-12T19:48:30.914Z · score: 1 (5 votes) · LW · GW

If one's interpretation of the 'objective' of the agent is full of piecewise statements and ad-hoc cases, then what exactly are we doing it by describing it as maximizing an objective in the first place? You might as well describe a calculator by saying that it's maximizing the probability of outputting the following [write out the source code that leads to its outputs]. At some point the model breaks down, and the idea that it is following an objective is completely epiphenomenal to its actual operation. The model that it is maximizing an objective doesn't shed light on its internal operations any more than just spelling out exactly what its source code is.

Comment by matthew-barnett on Malign generalization without internal search · 2020-01-12T19:18:49.280Z · score: 5 (3 votes) · LW · GW
I feel like what you're describing here is just optimization where the objective is determined by a switch statement

Typically when we imagine objectives, we think of a score which rates how well an agent performed some goal in the world. How exactly does the switch statement 'determine' the objective?

Let's say that a human is given the instructions, "If you see the coin flip heads, then become a doctor. If you see the coin flip tails, then become a lawyer." what 'objective function' is it maximizing here? If it's maximizing some weird objective function like, "probability of becoming a doctor in worlds where the coin flips heads, and probability of becoming a lawyer in worlds where the coin flips tails" this would seem to be unnatural, no? Why not simply describe it as a switch case agent instead?

Remember, this matters because we want to be perfectly clear about what types of transparency schemes work. A transparency scheme that assumes that the agent has a well-defined objective that it is using a search to optimize for, would, I think, would fail in the examples I gave. This becomes especially true if the if-statements are complicated nested structures, and repeat as part of some even more complicated loop, which seems likely.

ETA: Basically, you can always rationalize an objective function for any agent that you are given. But the question is simply, what's the best model of our agent, in the sense of being able to mitigate failures. I think most people would not categorize the lunar lander as a search-based agent, even though you could say that it is under some interpretation. The same is true with humans, plants, animals.

Comment by matthew-barnett on 2020's Prediction Thread · 2020-01-12T18:56:53.958Z · score: 1 (2 votes) · LW · GW

That's a good point, but it doesn't reduce my credence much. Perhaps 94% or 95% is more appropriate? I'd be willing to bet on this.

Comment by matthew-barnett on 2020's Prediction Thread · 2020-01-11T01:55:24.796Z · score: 4 (2 votes) · LW · GW

By hand I mean anything that closely resembles a human hand.

Comment by matthew-barnett on 2020's Prediction Thread · 2020-01-03T19:05:40.896Z · score: 2 (1 votes) · LW · GW

I'm willing to bet on this prediction.

Comment by matthew-barnett on 2020's Prediction Thread · 2020-01-02T02:22:23.423Z · score: 4 (2 votes) · LW · GW

A language model making it onto the NYT's bestseller list seems like a very specific thing. High level machine intelligence is not.

Comment by matthew-barnett on 2020's Prediction Thread · 2020-01-01T02:50:30.802Z · score: 2 (1 votes) · LW · GW

If AGI just means, "can, in principle, solve any problem" then I think we could already build very very slow AGI right now (at least for all well-defined solutions -- you just perform a search over candidate solutions).

Plus, I don't think my definition matches the definition given by Bostrom.

By a "superintelligence" we mean an intellect that is much smarter than the best human brains in practically every field, including scientific creativity, general wisdom and social skills.

ETA: I edited the original post to be more specific.

Comment by matthew-barnett on 2020's Prediction Thread · 2019-12-31T21:30:39.724Z · score: 2 (1 votes) · LW · GW

To help calibrate, watch this video.

Comment by matthew-barnett on 2020's Prediction Thread · 2019-12-31T21:14:33.231Z · score: 3 (2 votes) · LW · GW

I will probably accept bets, although the fact that someone would be willing to bet me on some of mine is evidence that I'm overconfident, so I might re-evaluate my probability if someone offers.

Comment by matthew-barnett on 2020's Prediction Thread · 2019-12-31T20:54:25.418Z · score: 2 (1 votes) · LW · GW

I'm using a slighly modified definition given by Grace et al. for high level machine intelligence.

Comment by matthew-barnett on 2020's Prediction Thread · 2019-12-31T20:52:25.275Z · score: 2 (1 votes) · LW · GW

My main data point is that I'm not very impressed by OpenAI's robot hand. It is very impressive relative to what we had 10 years ago, but top humans are extremely adept at manipulating things in their hands.