The Epistemology of AI risk

post by Michaël Trazzi (mtrazzi) · 2020-01-27T23:33:28.667Z · score: 23 (12 votes) · LW · GW · 35 comments

This is a link post for https://www.philiptrammell.com/blog/46

(Disclaimer: Philip Trammell is planning to rewrite this blogpost and make it clearer and more precise, but I think there's something good in this direction. Sharing it to see what the rest of the community thinks. The author is Philip Trammell, not me.)


"Some smart people, including some of my friends, believe that advanced AI poses a serious threat to human civilization in the near future, and that AI safety research is therefore one of the most valuable uses, if not the very most valuable use, of philanthropic talent and money. But most smart people, as far as I can judge their behavior—including some, like Mark Zuckerberg and Robin Hanson, who have expressed their thoughts on this explicitly—do not believe this. (I, for whatever it's worth, am agnostic.) In my experience, when someone points out the existence of smart skeptics like these, believers often respond: “Sure, those people dismiss AI risk. But have they engaged with the arguments?”

If the answer is no, it seems obvious that those who have engaged with the arguments have nothing to learn from these skeptics' judgment. If you aren't worried about rain because you saw a weather report that predicts sun, and I also saw that but also saw an updated weather report that now predicts rain, I should predict rain—not update on your rain skepticism, however smart you may be. Likewise, if Mark Zuckerberg dismisses AI risk because his one exposure to the idea was a Paul Christiano blog post from 2015 with a mistake in it, which a 2016 blog post corrects, then it seems that we who have read both should not update our beliefs at all in light of Zuckerberg's opinion. And when we look at the distribution of opinion among those who have really “engaged with the arguments”, we are left with a substantial majority—maybe everyone but Hanson, depending on how stringent our standards are here!—who do believe that, one way or another, AI development poses a serious existential risk.

But something must be wrong with this inference, since it works for all kinds of mutually contradictory positions. The majority of scholars of every religion are presumably members of that religion. The majority of those who best know the arguments for and against thinking that a given social movement is the world's most important cause, from pro-life-ism to environmentalism to campaign finance reform, are presumably members of that social movement. The majority of people who have seriously engaged with the arguments for flat-earthism are presumably flat-earthers. I don't even know what those arguments are.

What's going wrong, I think, is something like this. People encounter uncommonly-believed propositions now and then, like “AI safety research is the most valuable use of philanthropic money and talent in the world” or “Sikhism is true”, and decide whether or not to investigate them further. If they decide to hear out a first round of arguments but don't find them compelling enough, they drop out of the process. (Let's say that how compelling an argument seems is its “true strength” plus some random, mean-zero error.) If they do find the arguments compelling enough, they consider further investigation worth their time. They then tell the evangelist (or search engine or whatever) why they still object to the claim, and the evangelist (or whatever) brings a second round of arguments in reply. The process repeats.

As should be clear, this process can, after a few iterations, produce a situation in which most of those who have engaged with the arguments for a claim beyond some depth believe in it. But this is just because of the filtering mechanism: the deeper arguments were only ever exposed to people who were already, coincidentally, persuaded by the initial arguments. If people were chosen at random and forced to hear out all the arguments, most would not be persuaded.

Perhaps more disturbingly, if the case for the claim in question is presented as a long fuzzy inference, with each step seeming plausible on its own, individuals will drop out of the process by rejecting the argument at random steps, each of which most observers would accept. Believers will then be in the extremely secure-feeling position of knowing not only that most people who engage with the arguments are believers, but even that, for any particular skeptic, her particular reason for skepticism seems false to almost everyone who knows its counterargument.

The upshot here seems to be that when a lot of people disagree with the experts on some issue, one should often give a lot of weight to the popular disagreement, even when one is among the experts and the people's objections sound insane. Epistemic humility can demand more than deference in the face of peer disagreement: it can demand deference in the face of disagreement from one's epistemic inferiors, as long as they're numerous. They haven't engaged with the arguments, but there is information to be extracted from the very fact that they haven't bothered engaging with them."

35 comments

Comments sorted by top scores.

comment by AprilSR · 2020-01-28T01:28:18.780Z · score: 18 (8 votes) · LW(p) · GW(p)

People seem to be blurring the difference between "The human race will probably survive the creation of a superintelligent AI" and "This isn't even something worth being concerned about." Based on a quick google search, Zuckerberg denies that there's even a chance of existential risks here, whereas I'm fairly certain Hanson thinks there's at least some.

I think it's fairly clear that most skeptics who have engaged with the arguments to any extent at all are closer to the "probably survive" part of the spectrum than the "not worth being concerned about" part.

comment by Donald Hobson (donald-hobson) · 2020-01-28T14:35:06.509Z · score: 14 (6 votes) · LW(p) · GW(p)

Different minds use different criteria to evaluate an argument. Suppose that half the population were perfect rationalists, whose criteria for judging an argument depended only on Occam's razor and Bayesian updates. The other half are hard-coded biblical literalists, who only believe statements based on religious authority. So half the population will consider "Here are the short equations, showing that this concept has low Komelgorov complexity" to be a valid argument, the other half consider, "Pope Clement said ..." to be a strong argument.

Suppose that any position that has strong religious and strong rationalist arguments for it is so obvious that no one is doubting or discussing it. Then most propositions believed by half the population have strong rationalist support, or strong religious support, but not both. If you are a rationalist and see one fairly good rationalist argument for X, you search for more info about X. Any religious arguments get dismissed as nonsense.

The end result is that the rationalists are having a serious discussion about AI risk among themselves. The religous dismiss AI as ludicrous based on some bible verse.

The religious people are having a serious discussion about the second coming of Christ and judgement day, which the rationalists dismiss as ludicrous.

The end result is a society where most of the people who have read much about AI risk think its a thing, and most of the people who have read much about judgement day think its a thing.

If you took some person from one side and forced them to read all the arguments on the other, they still wouldn't believe. Each side has the good arguments under their criteria of what a good argument is.

The rationalists say that the religious have poor epistemic luck, there is nothing we can do to help them now, when super-intelligence comes it can rewire their brains. The religious say that the rationalists are cursed by the devil, when judgement day comes, they will be converted by the glory of god.

The rationalists are designing a super-intelligence, the religious are praying for judgement day.

Bad ideas and good ones can have similar social dynamics because most of the social dynamics around an idea depends on human nature.

comment by riceissa · 2020-01-28T06:06:56.855Z · score: 14 (5 votes) · LW(p) · GW(p)

As should be clear, this process can, after a few iterations, produce a situation in which most of those who have engaged with the arguments for a claim beyond some depth believe in it.

This isn't clear to me, given the model in the post. If a claim is false and there are sufficiently many arguments for the claim, then it seems like everyone eventually ends up rejecting the claim, including those who have engaged most deeply with the arguments. The people who engage deeply "got lucky" by hearing the most persuasive arguments first, but eventually they also hear the weaker arguments and counterarguments to the claim, so they end up at a level of confidence where they don't feel they should bother investigating further. These people can even have more accurate beliefs than the people who dropped out early in the process, depending on the cutoff that is chosen.

comment by Matthew Barnett (matthew-barnett) · 2020-01-28T20:51:40.327Z · score: 11 (3 votes) · LW(p) · GW(p)
The majority of those who best know the arguments for and against thinking that a given social movement is the world's most important cause, from pro-life-ism to environmentalism to campaign finance reform, are presumably members of that social movement.

This seems unlikely to me given my reactions to talking to people in other movements, including the ones you mentioned. The idea that what they're arguing for is "the world's most important cause" hasn't explicitly been considered by most of them, and for those who have, few have done any sort of rigorous analysis.

By contrast, part of the big sell of EA is that it actively searches for the world's biggest causes, and uses a detailed methodology in pursuit of this goal.

comment by Michaël Trazzi (mtrazzi) · 2020-01-28T23:04:11.218Z · score: 1 (1 votes) · LW(p) · GW(p)

the ones you mentioned

To be clear, this is a linkpost for Philip Trammell's blogpost. I'm not involved in the writing.

comment by Matthew Barnett (matthew-barnett) · 2020-01-28T23:41:32.814Z · score: 2 (1 votes) · LW(p) · GW(p)

Apologies for the confusing language, I knew.

comment by Wei_Dai · 2020-01-28T02:53:30.075Z · score: 10 (6 votes) · LW(p) · GW(p)

My feeling is that the current ways that the most prominent AI risk people make their cases don't emphasize the disjunctive nature of AI risk [LW · GW] enough, and tend to focus too much on one particular line of argument that they're especially confident in (e.g., intelligence explosion / fast takeoff). As you say, "If they decide to hear out a first round of arguments but don’t find them compelling enough, they drop out of the process." Well that doesn't tell me much if they only heard about one line of argument in that first round.

comment by Michaël Trazzi (mtrazzi) · 2020-01-28T23:02:15.909Z · score: 1 (1 votes) · LW(p) · GW(p)

As you say

To be clear, the author is Philip Trammell, not me. Added quotes to make it clearer.

comment by Matthew Barnett (matthew-barnett) · 2020-01-28T00:26:02.401Z · score: 2 (1 votes) · LW(p) · GW(p)
when we look at the distribution of opinion among those who have really “engaged with the arguments”, we are left with a substantial majority—maybe everyone but Hanson, depending on how stringent our standards are here!—who do believe that, one way or another, AI development poses a serious existential risk.

For what it's worth, I have "engaged with the arguments" but am still skeptical of the main arguments. I also don't think that my optimism is very unusual for people who work on the problem, either. Based on an image image from about five years ago (the same time Nick Bostrom's book came out), most people at FHI were pretty optimistic. Since then, it's my impression that researchers have become even more optimistic, since more people appear to accept continuous takeoff [LW · GW] and there's been a shift in arguments. AI Impacts recently interviewed [LW · GW] a few researchers who were also skeptical (including Hanson), and all of them have engaged in the main arguments. It's unclear to me that their opinions are actually substantially more optimistic than average.

comment by ofer · 2020-01-28T15:14:10.823Z · score: 9 (2 votes) · LW(p) · GW(p)

and there's been a shift in arguments.

The set of arguments that are being actively discussed by AI safety researchers obviously changed since 2014 (which is true for any active field?). I assume that by "there's been a shift in arguments" you mean something more than that, but I'm not sure what.

Is there any core argument in the book Superintelligence that is no longer widely accepted among AI safety researchers? Does the progress in deep learning since 2014 made the core arguments in the book less compelling? (Do the arguments about instrumental convergence and Goodhart's law fail to apply to deep RL?)

comment by Matthew Barnett (matthew-barnett) · 2020-01-28T16:46:21.767Z · score: 2 (1 votes) · LW(p) · GW(p)

If the old arguments were sound, why would researchers shift their arguments in order to make the case that AI posed a risk? I'd assume that if the old arguments worked, the new ones would be a refinement rather than a shift. Indeed many old arguments were refined, but a lot of the new arguments seem very new.

Is there any core argument in the book Superintelligence that is no longer widely accepted among AI safety researchers?

I can't speak for others, but the general notion of there being a single project that leaps ahead of the rest of the world, and gains superintelligent competence before any other team can even get close, seems suspicious to many researchers that I've talked to. In general, the notion that there will be discontinuities in development is looked with suspicion by a number of people (though, notably some researchers [LW(p) · GW(p)] still think that fast takeoff is likely).

comment by DanielFilan · 2020-01-28T16:52:29.440Z · score: 19 (7 votes) · LW(p) · GW(p)

If the old arguments were sound, why would researchers shift their arguments in order to make the case that AI posed a risk?

I get the sense that a lot of it is different people writing about it rather than people changing their minds.

comment by Matthew Barnett (matthew-barnett) · 2020-01-28T19:38:35.895Z · score: 4 (2 votes) · LW(p) · GW(p)

This makes sense. However, I'd still point out that this is evidence that the arguments weren't convincing, since otherwise they would have used the same arguments, even though they are different people.

comment by ofer · 2020-01-29T11:47:43.186Z · score: 5 (2 votes) · LW(p) · GW(p)

If the old arguments were sound, why would researchers shift their arguments in order to make the case that AI posed a risk? I'd assume that if the old arguments worked, the new ones would be a refinement rather than a shift. Indeed many old arguments were refined, but a lot of the new arguments seem very new.

I'm not sure I understand your model. Suppose AI safety researcher Alice writes a post about a problem that Nick Bostrom did not discuss in Superintelligence back in 2014 (e.g. the inner alignment problem). That doesn't seem to me like meaningful evidence for the proposition "the arguments in Superintelligence are not sound".

I can't speak for others, but the general notion of there being a single project that leaps ahead of the rest of the world, and gains superintelligent competence before any other team can even get close, seems suspicious to many researchers that I've talked to.

It's been a while since I read listened to the audiobook version of Superintelligence, but I don't recall the book arguing that the "second‐place AI lab" will likely be much far behind the leading AI lab (in subjective human time) before we get superintelligence. And even if it would have argued for that, as important as such an estimate may be, how is it relevant to the basic question of whether AI Safety is something humankind should be thinking about?

In general, the notion that there will be discontinuities in development is looked with suspicion by a number of people (though, notably some researchers [LW(p) · GW(p)] still think that fast takeoff is likely).

I don't recall the book relying on (or [EDIT: with a lot less confidence] even mentioning the possibility of) a discontinuity in capabilities. I believe it does argue that once there are AI systems that can do anything humans can, we can expect extremely fast progress.

comment by Matthew Barnett (matthew-barnett) · 2020-01-29T18:51:22.338Z · score: 2 (1 votes) · LW(p) · GW(p)
I'm not sure I understand your model. Suppose AI safety researcher Alice writes a post about a problem that Nick Bostrom did not discuss in Superintelligence back in 2014 (e.g. the inner alignment problem)

I would call the inner alignment problem a refinement of the traditional argument from AI risk. The traditional argument was that there was going to be a powerful system that had a utility function it was maximizing and it might not match ours. Inner alignment says, well, it's not exactly like that. There's going to be a loss function used to train our AIs, and the AIs themselves will have internal objective functions that they are maximizing, and both of these might not match ours.

If all the new arguments were mere refinements of the old ones, then my argument would not work. I don't think that all the new ones are refinements of the old ones, however. For an example, try to map what failure looks like [LW · GW] onto Nick Bostrom's model for AI risk. Influence-seeking sorta looks like what Nick Bostrom was talking about, but I don't think "Going out with a whimper" is what he had in mind (I haven't read the book in a while though).

It's been a while since I read Superintelligence, but I don't recall the book arguing that the "second‐place AI lab" will likely be much far behind the leading AI lab (in subjective human time) before we get superintelligence.

My understanding is that he spent one chapter talking about multipolar outcomes, and the rest of the book talking about unipolar outcomes, where a single team gains a decisive strategic advantage over the rest of the world (which seems impossible unless a single team surges forward in development). Robin Hanson had the same critique in his review of the book.

And even if it would have argued for that, as important as such an estimate may be, how is it relevant to the basic question of whether AI Safety is something humankind should be thinking about?

If AI takeoff is more gradual, there will be warning signs for each risk before it unrolls into a catastrophe. Consider any single source of existential risk from AI, and I can plausibly point to a source of sub-existential risk that would occur in less powerful AI systems. If we ignore risk, then a disaster would occur, but it would be minor, and this would set a precedent for safety in the future.

This is important because if you have the point of view that AI safety must be solved ahead of time, before we actually build the powerful systems, then I would want to see specific technical reasons for why it will be so hard that we won't solve it during the development of those systems.

It's possible that we don't have good arguments yet, but good arguments could present themselves eventually and it would be too late at that point to go back in time and ask people in the past to start work on AI safety. I agree with this heuristic (though it's weak, and should only be used if there are not other more pressing existential risks to work on).

I also agree that there are conceptual arguments for why we should start AI safety work now, and I'm not totally convinced that the future will be either kind or safe to humanity. It's worth understanding the arguments for and against AI safety, lest we treat it as a team to be argued for [? · GW].

comment by ofer · 2020-01-29T20:18:20.497Z · score: 1 (1 votes) · LW(p) · GW(p)

Inner alignment says, well, it's not exactly like that. There's going to be a loss function used to train our AIs, and the AIs themselves will have internal objective functions that they are maximizing, and both of these might not match ours.

As I understand the language, the "loss function used to train our AIs" matches "our objective function" from the classical outer alignment problem. The inner alignment problem seems to me as a separate problem rather than a "refinement of the traditional argument" (we can fail due to just an inner alignment problem; and we can fail due to just an outer alignment problem).

My understanding is that he spent one chapter talking about multipolar outcomes, and the rest of the book talking about unipolar outcomes

I'm not sure what you mean by saying "the rest of the book talking about unipolar outcomes". In what way do the parts in the book that discuss the orthogonality thesis, instrumental convergence and Goodhart's law assume or depend on a unipolar outcome?

This is important because if you have the point of view that AI safety must be solved ahead of time, before we actually build the powerful systems, then I would want to see specific technical reasons for why it will be so hard that we won't solve it during the development of those systems.

Can you give an example of a hypothetical future AI system—or some outcome thereof—that should indicate that humankind ought to start working a lot more on AI safety?

comment by Matthew Barnett (matthew-barnett) · 2020-01-29T21:19:13.278Z · score: 2 (1 votes) · LW(p) · GW(p)
The inner alignment problem seems to me as a separate problem rather than a "refinement of the traditional argument"

By refinement, I meant that the traditional problem of value alignment was decomposed into two levels, and at both levels, values need to be aligned. I am not quite sure why you have framed this as separate rather than a refinement?

I'm not sure what you mean by saying "the rest of the book talking about unipolar outcomes". In what way do the parts in the book that discuss the orthogonality thesis, instrumental convergence and Goodhart's law assume or depend on a unipolar outcome?

The arguments for why those things pose a risk was the relevant part of the book. Specifically, it argued that because of those factors, and the fact that a single project could gain control of the world, it was important to figure everything out ahead of time, rather than waiting until the project was close to completion. Because we don't get a second chance.

The analogy of children playing with a bomb is a particular example. If Bostrom had opted for presenting a gradual narrative, perhaps he would have said that the children will be given increasingly powerful firecrackers and will see the explosive power grow and grow. Or perhaps the sparrows would have trained a population of mini-owls before getting a big owl.

Can you give an example of a hypothetical future AI system—or some outcome thereof—that should indicate that humankind ought to start working a lot more on AI safety?

I don't think there's a single moment that should cause people to panic. Rather, it will be a gradual transition into more powerful technology.

comment by Michaël Trazzi (mtrazzi) · 2020-01-30T00:17:16.832Z · score: 1 (1 votes) · LW(p) · GW(p)

I get the sense that the crux here is more between fast / slow takeoffs than unipolar / multipolar scenarios.

In the case of a gradual transition into more powerful technology, what happens when the children of your analogy discovers recursive self improvement?

comment by Matthew Barnett (matthew-barnett) · 2020-01-30T03:43:47.252Z · score: 2 (1 votes) · LW(p) · GW(p)

Even recursive self improvement can be framed gradually. Recursive technological improvement is thousands of years old. The phenomenon of technology allowing us to build better technology has sustained economic growth. Recursive self improvement is simply a very local form of recursive technological improvement.

You could imagine systems will gradually get better at recursive self improvement. Some will improve themselves sort-of well, and these systems will pose risks. Some other systems will improve themselves really well, and pose greater risks. But we would have seen the latter phenomenon coming ahead of time.

And since there's no hard separation between recursive technological improvement and recursive self improvement, you could imagine technological improvement getting gradually more local, until all the relevant action is from a single system improving itself. In that case, there would also be warning signs before it was too late.

comment by Michaël Trazzi (mtrazzi) · 2020-01-30T10:49:12.553Z · score: 2 (2 votes) · LW(p) · GW(p)

This framing really helped me think about gradual self-improvement, thanks for writing it down!

I agree with most of what you wrote. I still feel that in the case of an AGI re-writing its own code there's some sense of intent that hasn't been explicitly happening for the past thousand years.

Agreed, you could still model Humanity as some kind of self-improving Human + Computer Colossus (cf. Tim Urban's framing) that somehow has some agency. But it's much less effective at self-improving itself, and it's not thinking "yep, I need to invent this new science to optimize this utility function". I agree that the threshold is "when all the relevant action is from a single system improving itself".

there would also be warning signs before it was too late

And what happens then? Will we reach some kind of global consensus to stop any research in this area? How long will it take to build a safe "single system improving itself"? How will all the relevant actors behave in the meantime?

My intuition is that in the best scenario we reach some kind of AGI Cold War situation for long periods of time.

comment by Wei_Dai · 2020-01-30T05:20:20.672Z · score: 3 (1 votes) · LW(p) · GW(p)

If AI takeoff is more gradual, there will be warning signs for each risk before it unrolls into a catastrophe. Consider any single source of existential risk from AI, and I can plausibly point to a source of sub-existential risk that would occur in less powerful AI systems. If we ignore risk, then a disaster would occur, but it would be minor, and this would set a precedent for safety in the future.

It seems like we're seeing plenty of disasters that are likely caused by climate change, yet it hasn't "set a precedent for safety". Couldn't whatever dynamics causing people to turn a blind eye to climate change also apply to AI safety (i.e., companies/governments will fix/mop up AI safety disasters as they occur, but not institute sufficient systemic change that will prevent similar or greater disasters in the future)?

comment by Wei_Dai · 2020-01-28T02:52:55.129Z · score: 4 (2 votes) · LW(p) · GW(p)

For what it’s worth, I have “engaged with the arguments” but am still skeptical of the main arguments. I also don’t think that my optimism is very unusual for people who work on the problem, either.

I'm curious if you've seen The Main Sources of AI Risk? [LW · GW] Have you considered all of those sources/kinds of risks and still think that the total AI-related x-risk is not very large?

comment by Matthew Barnett (matthew-barnett) · 2020-01-28T06:31:43.190Z · score: 5 (4 votes) · LW(p) · GW(p)

[ETA: It's unfortunate I used the word "optimism" in my comment, since my primary disagreement is whether the traditional sources of AI risk are compelling. I'm pessimistic in a sense, since I think by default our future civilization's values will be quite different from mine in important ways.]

My opinion is that AI is likely to be an important technology whose effects will largely determine our future civilization, and the outlook for humanity. And given that AI will be so large, its impact will also largely determine whether our values go extinct or survive. That said, it's difficult to understand the threat to our values from AI without a specific threat model. I appreciate trying to find specific ways that AI can go wrong, but I currently think

  • We are probably not close enough to powerful AI to have a good understanding of the primary dynamics of an AI takeoff, and therefore what type of work will help our values survive one.
  • The way our values might go extinct will probably happen in some unavoidable manner that's not related to the typical sources of AI risk. In other words, it's likely that just general value drift and game theoretic incentives will do more to destroy the value of the long-term future than technical AI errors.
  • The argument that continuous takeoff makes AI safe seems robust to most specific items on your list, though I can see several ways that the argument fails.

If AI does go wrong in one of the ways you have identified, it seems difficult to predict which one (though we can do our best to guess). It seems even harder to do productive work, since I'm skeptical of very short timelines.

Historically, our models of AI development have been notoriously poor. Ask someone from 10 years ago what they think AI might look like, and it seems unlikely that they would have predicted deep learning in a way that would have been useful to make it safer. I suspect that unless AI is very soon, it will be very hard to do specific technical work to make it safer.

comment by Wei_Dai · 2020-01-28T18:45:46.994Z · score: 23 (6 votes) · LW(p) · GW(p)

It’s unfortunate I used the word “optimism” in my comment, since my primary disagreement is whether the traditional sources of AI risk are compelling.

May I beseech you to be more careful about using "optimism" and words like it in the future, because I'm really worried about strategy researchers and decision makers getting the wrong impression [LW · GW] from AI safety researchers about how hard the overall AI risk problem is, and for some reason I keep seeing people say that they're "optimistic" (or other words to that effect) when they mean optimistic about some sub-problem of AI risk instead of AI risk as a whole, but they don't make that clear. In many cases it's pretty predictable that people outside technical AI safety research (or even inside, like in this case) would often misinterpret that as being optimistic about AI risk.

comment by evhub · 2020-01-28T07:11:56.258Z · score: 18 (5 votes) · LW(p) · GW(p)

The argument that continuous takeoff makes AI safe seems robust to most specific items on your list, though I can see several ways that the argument fails.

I feel like this depends on a whole bunch of contingent facts regarding our ability to accurately diagnose and correct what could be very pernicious problems such as deceptive alignment [AF · GW] amidst what seems quite likely to be a very quickly changing and highly competitive world.

It seems even harder to do productive work, since I'm skeptical of very short timelines.

Why does being skeptical of very short timelines preclude our ability to do productive work on AI safety? Surely there are things we can be doing now to gain insight, build research/organizational capacity, etc. that will at least help somewhat, no? (And it seems to me like “probably helps somewhat” is enough when it comes to existential risk.)

comment by Matthew Barnett (matthew-barnett) · 2020-01-28T16:53:55.457Z · score: 9 (3 votes) · LW(p) · GW(p)
I feel like this depends on a whole bunch of contingent facts regarding our ability to accurately diagnose and correct what could be very pernicious problems such as deceptive alignment [LW · GW] amidst what seems quite likely to be a very quickly changing and highly competitive world.

I agree, though I tend to think the costs associated with failing to catch deception will be high enough that any major team will be likely to bear the costs. If some team of researchers doesn't put in the effort, a disaster would likely occur that would be sub-x-risk level, and this would set a precedent for safety standards.

In general, I think humans tend to be very risk averse when it comes to new technologies, though there are notable exceptions (such as during wartime).

Why does being skeptical of very short timelines preclude our ability to do productive work on AI safety?

A full solution to AI safety will necessarily be contingent on the architectures used to build AIs. If we don't understand a whole lot about those architectures, this limits our abilities to do concrete work. I don't find the argument entirely compelling because,

  • It seems reasonably likely that AGI will be built using more-or-less the deep learning paradigm, perhaps given a few insights, and therefore productive work can be done now, and
  • We can still start institutional work, and develop important theoretical insights.

But even given these qualifications, I estimate that the vast majority of productive work to make AIs safe will be completed when the AI systems are actually built, rather than before. It follows that most work during this pre-AGI period might miss important details and be less effective than we think.

And it seems to me like “probably helps somewhat” is enough when it comes to existential risk

I agree, which is why I spend a lot of my time reading and writing posts on Lesswrong about AI risk.

comment by Wei_Dai · 2020-01-29T22:35:13.395Z · score: 5 (2 votes) · LW(p) · GW(p)

It follows that most work during this pre-AGI period might miss important details and be less effective than we think.

Do you think AI alignment researchers have not taken this into consideration already? For example, I'm pretty sure I've read arguments from Paul Christiano for why he is working on his approach even though we don't know how AGI will be built. MIRI people have made such arguments too, I think.

comment by Matthew Barnett (matthew-barnett) · 2020-01-29T23:06:41.506Z · score: 2 (1 votes) · LW(p) · GW(p)

I'm not claiming any sort of knock-down argument. I understand that individual researchers often have very thoughtful reasons for thinking that their approach will work. I just take the heuristic seriously that it is very difficult to predict the future, or to change the course of history in a predictable way. My understanding of past predictions of the future is that they have been more-or-less horrible, and so skepticism of any particular line of research is pretty much always warranted.

In case you think AI alignment researchers are unusually good at predicting the future, and you would put them in a different reference class, I will point out that the type of AI risk stuff people on Lesswrong talk about now is different in meaningful ways to the stuff that was talked about five or ten years ago on here.

To demonstrate, a common assumption was that in the absence of advanced AI architecture design, we could minimally assume that an AI would maximize a utility function, since a utility function is a useful abstraction that seems robust to architectural changes in our underlying AI designs or future insights. The last few years has seen many people here either rejecting this argument, or finding it to be vacuous, or underspecified as an argument. (I'm not taking a hard position, I'm merely pointing out that this shift has occurred).

People also assumed that, in the absence of advanced AI architecture design, we could assume that an AI's first priority would be to increase it's own intelligence, prompting researchers to study stable recursive self-improvement. Again, the last few years has seen people here rejecting this argument, or concluding that it's not a priority for research. (Once again, I'm not here to argue whether this specific shift was entirely justified).

I suspect that even very reasonable sounding arguments of the type, "Well, we might not know what AI will look like, but mimimally we can assume X, and X is a tractable line of research" will turn out to be suspicious in the end. That's not to say that some of these arguments won't be correct. Perhaps, if we're very carfeul, we can find out which ones are correct. I just have a strong heuristic of assuming future cluelessness.

comment by Michaël Trazzi (mtrazzi) · 2020-01-30T00:10:17.812Z · score: 9 (2 votes) · LW(p) · GW(p)

When you say "the last few years has seen many people here" for your 2nd/3rd paragraph, do you have any posts / authors in mind to illustrate?

I agree that there has been a shift in what people write about because the field grew (as Daniel Filan pointed out). However, I don't remember reading anyone dismiss convergent instrumental goals such as increasing your own intelligence or utility functions as an useful abstraction to think about agency.

In your thread with ofer, he asked what was the difference between using loss functions in neural nets vs. objective function / utility functions and I haven't fully catched your opinion on that.

comment by Matthew Barnett (matthew-barnett) · 2020-01-30T01:17:25.133Z · score: 2 (1 votes) · LW(p) · GW(p)
When you say "the last few years has seen many people here" for your 2nd/3rd paragraph, do you have any posts / authors in mind to illustrate?

For the utility of talking about utility functions, see this rebuttal [? · GW] of an argument justifying the use of utility functions by appealing to the VNM-utility theorem, and a few [LW · GW] more [LW · GW] posts [LW · GW] expanding the discussion. The CAIS paper [LW · GW] argues that we shouldn't model future AI as having monolithic long-term utility function. But it's by no means a settled debate.

For the rejection of stable self improvement as a research priority, Paul Christiano wrote a post [LW · GW] in 2014 where he argued that stable recursive self improvement will be solved a special case of reasoning under uncertainty. And again, the CAIS model proposes that technological progress will feed into itself (not unlike what already happens), rather than a monolithic agent improving itself.

I get the impression that very few people outside of MIRI work on studying stable recursive self improvement, though this might be because they think it's not their comparative advantage.

I agree that there has been a shift in what people write about because the field grew (as Daniel Filan pointed out). However, I don't remember reading anyone dismiss convergent instrumental goals such as increasing your own intelligence or utility functions as an useful abstraction to think about agency.

There's a difference between accepting something as a theoretical problem, and accepting that it's a tractable research priority. I was arguing that the type of work we do right now might not be useful for future researchers, and so I wasn't trying to say that these things didn't exist. Rather, it's not clear that productive work can be done on them right now. My evidence was that the way we think about these problems has changed over the years. Of course, you could say that the reason why the research focuses shifted is because we made progress, but I'd be skeptical about that hypothesis.

In your thread with ofer, he asked what was the difference between using loss functions in neural nets vs. objective function / utility functions and I haven't fully catched your opinion on that.

I don't quite understand the question? It's my understanding that I was disputing a notion that the inner alignment should count as a "shift in arguments" for AI risk. I claimed that it was a refinement of the traditional arguments; more specifically, we decomposed the value alignment problem into two levels. I'm quite confused at what I'm missing here.

comment by Michaël Trazzi (mtrazzi) · 2020-01-30T12:12:11.420Z · score: 1 (1 votes) · LW(p) · GW(p)

Thanks for all the references! I don't currently have much time to read all of it right now so I can't really engage with the specific arguments for the rejection of using utility functions/studying recursive self-improvement.

I essentially agree with most of what you wrote. There is maybe a slight disagreement in how you framed (not what you meant) how research focus shifted since 2014.

I see Superintelligence as essentially saying "hey, there is pb A. And even if we solve A, then we might also have B. And given C and D, there might be E." Now that the field is more mature and we have many more researchers getting paid to work on these problems, the arguments became much more goal focused. Now people are saying "I'm going to make progress on sub-problem X, by publishing a paper on Y. And working on Z is not cost-effective given so I'm not going to work on it given humanity's current time constraints."

These approaches are often grouped as "long-term problems-focused" and "making tractable progress now focused". In the first group you have Yudkowsky 2010, Bostrom 2014, MIRI's current research and maybe CAIS. In the second one you have current CHAI/FHI/OpenAI/DeepMind/Ought papers.

Your original framing can be interpreted as "after proving some mathematical theorems, people rejected the main arguments of Superintelligence and now most of the community agrees that working on X, Y and Z are tractable but A, B and C are more controversials".

I think a more nuanced and precise framing would be: "In Superintelligence Bostrom exposes exhaustively the risks associated with advanced AI. A short portion of the book is dedicated to the problems are working on right now. Indeed, people stopped working on the other problems (largest portion of the book) because 1) there hasn't been really productive working on them 2) some rebuttals have been written online giving convincing arguments that those pbs are not tractable anyway 3) there are now well-funded research organizations with incentives to make tangible progress on those pbs."

In your last framing, you presented precise papers/rebuttals (thanks again!) for 2), and I think rebuttals are a great reason to stop working on a pb, but I think they're not the only reason and not the real reason people stopped working on those pb. To be fair, I think 1) can be explained by many more factors than "it's theoretically impossible to make progress on those pbs". It can be that the research mindset required to work on these pbs is less socially/intellectually validating or requires much more theoretical approaches, so will be off-putting/tiresome to most recent grads that enter the field. I also think that AI Safety is now much more intertwined with evidence-based approaches such as Effective Altruism than it was in 2014, which explains 3), so people start presenting their research as "partial solutions to the pb. of AI Safety" or "research agenda".

To be clear, I'm not criticizing the current shift in research. I think it's productive for the field, both in the short term and long term. To give a bit more personal context, I started getting interested in AI Safety after reading Bostrom and have always been more interested in the "finding problems" approach. I went to FHI to work on AI Safety because I was super interested in finding new pbs related to the treacherous turn. It's now almost taboo to say that we're working on pbs that are sub-optimally minimizing AI risk, but the real reason that pushed me to think about those pbs was because they were both important and interesting. The pb. with the current "shift in framing" is that it's making it socially unacceptable for people to think/work on more long-term pbs where there is more variance in research productivity.

I don't quite understand the question?

Sorry about that. I thought there was some link to our discussion about utility functions but I misunderstood.

EDIT: I also wanted to mention that the number of pages in a book doesn't account for how important the author think the pb. is (Bostrom even comments on this in the postface of its book). Again, the book is mostly about saying "here are all the pbs", not "these are the tractable pbs we should start working on, and we should dedicate research ressources proportionally to the amount of pages I talk about it in the book".

comment by evhub · 2020-01-30T20:14:48.980Z · score: 8 (2 votes) · LW(p) · GW(p)

I feel like you are drawing the wrong conclusion from the shift in arguments that has occurred. I would argue that what look like wrong ideas that ended up not contributing to future research could actually have been quite necessary for progressing the field's understanding as a whole. That is, maybe we really needed to engage with utility functions first before we could start breaking down that assumption—or maybe optimization daemons were a necessary step towards understanding mesa-optimization. Thus, I don't think the shift in arguments at all justifies the conclusion that prior work wasn't very helpful, as the prior work could have been necessary to achieve that very shift.

comment by Matthew Barnett (matthew-barnett) · 2020-01-30T20:21:26.262Z · score: 4 (2 votes) · LW(p) · GW(p)

I think this justification for doing research now is valid. However, I think that as the systems developed further, researchers would be forced to shift their arguments for risk anyway, since the concrete ways that the systems go wrong would be readily apparent. It's possible that by that time it would be "too late" as the problems of safety are just too hard and researchers would have wished they made conceptual progress sooner (I'm pretty skeptical of this though).

comment by jessicata (jessica.liu.taylor) · 2020-01-28T07:37:32.809Z · score: 17 (6 votes) · LW(p) · GW(p)

Note that lack of ability to know what alignment work would be useful to do ahead of time increases, rather than decreases, the absolute level of risk; thus, it increases rather than decreases the risk metrics (e.g. probability of humans being wiped out) that FHI estimated.

comment by Matthew Barnett (matthew-barnett) · 2020-01-28T17:04:47.524Z · score: 2 (1 votes) · LW(p) · GW(p)

It could still be that the level of absolute risk is still low, even after taking this into account. I concede that estimating risks like these are very difficult.

comment by Pattern · 2020-01-28T20:14:48.827Z · score: 1 (1 votes) · LW(p) · GW(p)
The majority of those who best know the arguments for and against thinking that a given social movement is the world's most important cause, from pro-life-ism to environmentalism to campaign finance reform, are presumably members of that social movement. The majority of people who have seriously engaged with the arguments for flat-earthism are presumably flat-earthers. I don't even know what those arguments are.

Evidence itself is agnostic, and not an argument.