AI Governance Fundamentals - Curriculum and Application 2021-11-30T02:19:59.104Z
Apply to the new Open Philanthropy Technology Policy Fellowship! 2021-08-22T23:45:39.368Z


Comment by Mauricio on What failure looks like · 2021-12-26T08:06:46.242Z · LW · GW

A more recent clarification from Paul Christiano, on how Part 1 might get locked in / how it relates to concerns about misaligned, power-seeking AI:

I also consider catastrophic versions of "you get what you measure" to be a subset/framing/whatever of "misaligned power-seeking." I think misaligned power-seeking is the main way the problem is locked in.

Comment by Mauricio on My Overview of the AI Alignment Landscape: Threat Models · 2021-12-26T06:52:27.753Z · LW · GW

I'm still pretty confused by "You get what you measure" being framed as a distinct threat model from power-seeking AI (rather than as another sub-threat model). I'll try to address two defenses of that (of framing them as distinct threat models) which I interpret this post as suggesting (in the context of my earlier comment on the overview post). Broadly, I'll be arguing that: power-seeking AI is necessary for "you get what you measure" concerns being plausible motivators for x-risk-focused people to pursue alignment research, so "you get what you measure" concerns are best thought of as a sub-threat model of power-seeking AI.

(Edit: OK, an aspect of "you get what you measure" concerns--the emphasis on something like "sufficiently strong optimization for some goal is very bad for different goals"--is a tweaked framing of power-seeking AI risk in general, rather than a subset. I think the aspect of "you get what you measure" concerns that's a subset of power-seeking AI risk is the emphasis on goal misspecification as the cause of misalignment. Either way, not a distinct threat model.)

Lock-in: Once we’ve noticed problems, how difficult will they be to fix, and how much resistance will there be? For example, despite the clear harms of CO2 emissions, fossil fuels are such an indispensable part of the economy that it’s incredibly hard to get rid of them. A similar thing could happen if AI systems become an indispensable part of the economy, which seems pretty plausible given how incredibly useful human-level AI would be. As another example, imagine how hard it would be to ban social media, if we as a society decided that this was net bad for the world.

Unless I'm missing something, this is just an argument for why AI might get locked in--not an argument for why misaligned AI might get locked in. AI becoming an indispensable part of the economy isn't a long-term problem if people remain capable of identifying and fixing problems with the AI. So we still need an additional lock-in mechanism (e.g. the initially deployed, misaligned AI being power-seeking) to have trouble. (If we're wondering how hard it will be to fix/improve non-power-seeking AI after it's been deployed, the difficulty of banning social media doesn't seem like a great analogy; a more relevant analogy would be the difficulty of fixing/improving social media after it's been deployed. Empirically, this doesn't seem that hard. For example, YouTube's recommendation algorithm started as a click-maximizer, and YouTube has already modified it to learn from human feedback (!).)

See Sam Clarke’s excellent post for more discussion of examples of lock-in.

I don't think Sam Clarke's post (which I'm also a fan of) proposes any lock-in mechanisms that (a) are plausible reasons for x-risk-focused people to pursue AI alignment work (and therefore be in the scope of this post) and (b) do not depend on AI being power-seeking. Clarke proposes five mechanisms by which Part 1 of "What Failure Looks Like" could get locked in -- addressing each of these in turn (in the context of his original post):

  • (1) short-term incentives and collective action -- arguably fails condition (a) or fails condition (b); if we don't assume AI will be power-seeking, then I see no reason why these difficulties would get much worse in hundreds of years than they are now, i.e. no reason why this on its own is a lock-in mechanism.
  • (2) regulatory capture -- the worry here is that the companies controlling AI might have and permanently act on bad values; this arguably fails condition (a), because if we're mainly worried about AI developers being bad, then focusing on intent alignment doesn't make that much sense.
  • (3) genuine ambiguity -- arguably fails condition (a) or fails condition (b); if we don't assume AI will be power-seeking, then I see no reason why these difficulties would get much worse in hundreds of years than they are now, i.e. no reason why this on its own is a lock-in mechanism.
  • (4) dependency and deskilling -- addressed above
  • (5) [AI] opposition to [humanity] taking back influence -- clearly fails condition (b)

So I think there remains no plausible alignment-relevant threat model for "You get what you measure" that doesn't fall under "power-seeking AI." (And even partial deference to Paul Christiano wouldn't imply that there is, since "What Failure Looks Like" Part 1 features power-seeking AI.)

Comment by Mauricio on Zvi’s Thoughts on the Survival and Flourishing Fund (SFF) · 2021-12-20T03:48:48.321Z · LW · GW

Thanks for the thoughtful reply--lots to think more about.

Comment by Mauricio on Zvi’s Thoughts on the Survival and Flourishing Fund (SFF) · 2021-12-15T08:08:11.843Z · LW · GW

(One view from which (political) power-seeking seems much less valuable is if we assume that, on the margin, this kind of power isn't all that useful for solving key problems. But if that were the crux, I'd have expected the original criticism to emphasize the (limited) benefits of power-seeking, rather than its costs.)

Comment by Mauricio on Zvi’s Thoughts on the Survival and Flourishing Fund (SFF) · 2021-12-15T07:20:22.183Z · LW · GW

In my model, one should be deeply skeptical whenever the answer to ‘what would do the most good?’ is ‘get people like me more money and/or access to power.’ One should be only somewhat less skeptical when the answer is ‘make there be more people like me’ or ‘build and fund a community of people like me.’ [...] I wish I had a better way to communicate what I find so deeply wrong here

I'd be very curious to hear more fleshed-out arguments here, if you/others think of them. My best guess about what you have in mind is that it's a combination of the following (lumping all the interventions mentioned in the quoted excerpt into "power-seeking"):

  1. People have personal incentives and tribalistic motivations to pursue power for their in-group, so we're heavily biased toward overestimating its altruistic value.
  2. Seeking power occupies resources/attention that could be spent figuring out how to solve problems, and figuring out how to solve problems is very valuable.
  3. Figuring out how to solve problems isn't just very valuable. It's necessary for things to go well, so just/mainly doing power-seeking makes it way too easy for us to get the mistaken impression that we're making progress and things are going well, while a crucial input into things going well (knowing what to do with power) remains absent.
  4. Power-seeking attracts leeches (which wastes resources and dilutes relevant fields).
  5. Power-seeking pushes people's attention away from object-level discussion and learning. (This is different from (3) in that (3) is about how power-seeking impacts a specific belief, while this point is about attention.)
  6. Power-seeking makes a culture increasingly value power for its own sake (i.e. "power corrupts"?), which is bad for the usual reasons that value drift is bad.

If that's it (is it?), then I'm more sympathetic than I was before writing out the above, but I'm still skeptical:

  • Re: 1: Speaking of object-level arguments, object-level arguments for the usefulness of power and field growth seem very compelling (and simple enough to significantly reduce room for bias).
    • The arguments I find most compelling are:
      • "It seems very plausible that AI will be at least as transformative as the agricultural revolution this century, and there are only ~50-200 people total (depending on how we count) working full-time on improving the long-term impacts of this transition. Oh man."
      • And the instrumental convergence arguments for power-seeking (which seem central enough to AI safety concerns for "power-seeking is bad because we've got to figure out AI safety" to be a really weird position).
  • 4 only seems like a problem with poorly executed power-seeking (although maybe that's hard to avoid?).
  • 2-5 and 6 seem to be horrific problems mostly just if power-seeking is the main activity of the community, rather than one of several activities. My current sense is that people tend to neither explicitly nor implicitly endorse having power-seeking be the main activity of the community as a whole (it's the main activity of some organizations, sure, but that's just specialization.*)

*Maybe then we should worry about the above problems at the organizational level? But these concerns seem mostly cultural, and there seems to be much cultural diffusion / cross-influence across organizations in the movement, so individual organizations are much less vulnerable to these problems than the movement as a whole.

Comment by Mauricio on Is "gears-level" just a synonym for "mechanistic"? · 2021-12-13T09:14:53.656Z · LW · GW

Yeah, my impression is that "mechanistic" is often used in social sciences to refer to a very similar idea as "gears-level." E.g. as discussed in this highly-cited overview (with emphasis added):

The idea that science aims at providing mechanistic explanations of phenomena has a long history (Bechtel 2006) [...]. In philosophy of science, mechanistic explanation has been mainly discussed in the context of biological sciences [...] whereas in the social sciences the idea has been mostly discussed by social scientists themselves (Abbott 2007, Elster 1989, Elster 2007, Gross 2009, Hedström 2005, Hedström & Swedberg 1998, Manicas 2006, Mayntz 2004, Morgan & Winship 2007, Schmidt 2006, Tilly 2001, Wikström 2006). [...] In both contexts, the development of the idea of mechanistic explanation has been partly motivated by the shortcomings of the once hegemonic covering-law account of explanation (Hempel 1965). The basic idea of mechanistic explanation is quite simple: At its core, it implies that proper explanations should detail the ‘cogs and wheels’ of the causal process through which the outcome to be explained was brought about.

Comment by Mauricio on Self-Integrity and the Drowning Child · 2021-10-26T11:46:30.843Z · LW · GW

I agree with and appreciate the broad point. I'll pick on one detail because I think it matters.

this whole parable of the drowning child, was set to crush down the selfish part of you, to make it look like you would be invalid and shameful and harmful-to-others if the selfish part of you won [...]

It is a parable calculated to set at odds two pieces of yourself... arranging for one of them to hammer down the other in a way that would leave it feeling small and injured and unable to speak in its own defense.

This seems uncharitable? Singer's thought experiment may have had the above effects, but my impression's always been that it was calculated largely to help people recognize our impartially altruistic parts—parts of us that in practice seem to get hammered down, obliterated, and forgotten far more often than our self-focused parts (consider e.g. how many people do approximately nothing for strangers vs. how many people do approximately nothing for themselves).

So part of me worries that "the drowning child thought experiment is a calculated assault on your personal integrity!" is not just mistaken but yet another hammer by which people will kick down their own altruistic parts—the parts of us that protect those who are small and injured and unable to speak in their own defense.