Posts

AI Governance Needs Technical Work 2022-09-05T22:28:06.423Z
AI Governance Fundamentals - Curriculum and Application 2021-11-30T02:19:59.104Z

Comments

Comment by Mau (Mauricio) on Are short timelines actually bad? · 2023-02-06T02:32:59.997Z · LW · GW

I agree with parts of that. I'd also add the following (or I'd be curious why they're not important effects):

  • Slower takeoff -> warning shots -> improved governance (e.g. through most/all major actors getting clear[er] evidence of risks) -> less pressure to rush
  • (As OP argued) Shorter timelines -> China has less of a chance to have leading AI companies -> less pressure to rush

More broadly though, maybe we should be using more fine-grained concepts than "shorter timelines" and "slower takeoffs":

  • The salient effects of "shorter timelines" seem pretty dependent on what the baseline is.
    • The point about China seems important if the baseline is 30 years, and not so much if the baseline is 10 years.
  • The salient effects of "slowing takeoff" seem pretty dependent on what part of the curve is being slowed. Slowing it down right before there's large risk seems much more valuable than (just) slowing it down earlier in the curve, as the last few year's investments in LLMs did.
Comment by Mau (Mauricio) on Gradient hacking is extremely difficult · 2023-01-25T04:48:16.910Z · LW · GW

Thanks for writing! I agree the factors this post describes make some types of gradient hacking extremely difficult, but I don't see how they make the following approach to gradient hacking extremely difficult.

Suppose that an agent has some trait which gradient descent is trying to push in direction x because the x-ness of that trait contributes to the agent’s high score; and that the agent wants to use gradient hacking to prevent this. Consider three possible strategies that the agent might try to implement, upon noticing that the x-component of the trait has increased [...] [One potential strategy is] Deterministically increasing the extent to which it fails as the x-component increases.

(from here)

This approach to gradient hacking seems plausibly resistant to the factors this post describes, by the following reasoning: With the above approach, the gradient hacker only worsens performance by a small amount. At the same time, the gradient hacker plausibly improves performance in other ways, since the planning abilities that lead to gradient hacking may also lead to good performance on tasks that demand planning abilities. So, overall, modifying or reducing the influence of the gradient hacker plausibly worsens performance. In other words, gradient descent might not modify away a gradient hacker because gradient hacking is convergently incentivized behavior that only worsens performance by a small amount (while not worsening it at all on net).

(Maybe gradient descent would then train the model to have a heuristic of not doing gradient hacking, while keeping the other benefits of improved planning abilities? But I feel pretty clueless about whether gradient hacking would be encoded in a way that allows such a heuristic to be inserted.)

(I read kind of quickly so may have missed something.)

Comment by Mau (Mauricio) on The case against AI alignment · 2022-12-25T18:42:36.911Z · LW · GW

Ah sorry, I meant the ideas introduced in this post and this one (though I haven't yet read either closely).

Comment by Mau (Mauricio) on The case against AI alignment · 2022-12-25T07:56:24.834Z · LW · GW

Thanks for posting, but I think these arguments have major oversights. This leaves me more optimistic about the extent to which people will avoid and prevent the horrible misuse you describe.

First, this post seems to overstate the extent to which people tend to value and carry out extreme torture. Maximally cruel torture fortunately seems very rare.

  • The post asks "How many people have you personally seen who insist on justifying some form of suffering for those they consider undesirable[?]" But "justifying some form of suffering" isn't actually an example of justifying extreme torture.
  • The post asks, "What society hasn’t had some underclass it wanted to put down in the dirt just to lord power over them?" But that isn't actually an example of people endorsing extreme torture.
  • The post asks, "How many ordinary, regular people throughout history have become the worst kind of sadist under the slightest excuse or social pressure to do so to their hated outgroup?" But has it really been as many as the post suggests? The historical and ongoing atrocities that come to mind were cases of serious suffering in the context of moderately strong social pressure/conditioning--not maximally cruel torture in the context of slight social pressure.
  • So history doesn't actually give us strong reasons to expect maximally suffering-inducing torture at scale (edit: or at least, the arguments this post makes for that aren't strong).

Second, this post seems to overlook a major force that often prevents torture (and which, I argue, will be increasingly able to succeed at doing so): many people disvalue torture and work collectively to prevent it.

  • Torture tends to be illegal and prosecuted. The trend here seems to be positive, with cruelty against children, animals, prisoners, and the mentally ill being increasingly stigmatized, criminalized, and prosecuted over the past few centuries.
  • We're already seeing AI development being highly centralized, with this leading AI developers working to make their AI systems hit some balance of helpful and harmless, i.e. not just letting users carry out whatever misuse they want.
  • Today, the cruelest acts of torture seem to be small-scale acts pursued by not-very-powerful individuals, while (as mentioned above) powerful actors tend to disvalue and work to prevent torture. Most people will probably continue to support the prevention and prosecution of very cruel torture, since that's the usual trend, and also because people would want to ensure that they do not themselves end up as victims of horrible torture. In the future, people will be better equipped to enforce these prohibitions, through improved monitoring technologies (e.g. monitoring and enforcement mechanisms built onto all AI chips).

Third, this post seems to overlook arguments for why AI alignment may be worthwhile (or opposing it may be a bad idea), even if a world with aligned AI wouldn't be worthwhile on its own. My understanding is that most people focused on preventing extreme suffering find such arguments compelling enough to avoid working against alignment, and sometimes even to work towards it.

  • Concern over s-risks will lose support and goodwill if adherents try to kill everyone, as the poster suggests they intend to do ("I will oppose any measure which makes the singularity more likely to be aligned with somebody’s values"). Then, if we do end up with aligned AI, it'll be significantly less likely that powerful actors will work to stamp out extreme suffering.
  • The highest-leverage intervention for preventing suffering is arguably coordinating/trading with worlds where there is a lot of it, and humanity won't be able to do that if we lose control of this world.

These oversights strike me as pretty reckless, when arguing for letting (or making) everyone die.

Comment by Mau (Mauricio) on Let’s think about slowing down AI · 2022-12-22T23:28:46.561Z · LW · GW

Thanks for writing!

I want to push back a bit on the framing used here. Instead of the framing "slowing down AI," another framing we could use is, "lay the groundwork for slowing down in the future, when extra time is most needed." I prefer this latter framing/emphasis because:

  • An extra year in which the AI safety field has access to pretty advanced AI capabilities seems much more valuable for the field's progress (say, maybe 10x) than an extra year with current AI capabilities, since the former type of year would give the field much better opportunities to test safety ideas and more clarity about what types of AI systems are relevant.
    • One counterargument is that AI safety will likely be bottlenecked by serial time, because discarding bad theories and formulating better ones takes serial time, making extra years early on very useful. But my very spotty understanding of the history of science suggests that it doesn't just take time for bad theories to get replaced by better ones--it takes time along with the accumulation of lots of empirical evidence. This supports the view that late-stage time is much more valuable than early-stage time.
  • Slowing down in the future seems much more tractable than slowing down now, since many critical actors seem much more likely to support slowing down if and when there are clear, salient demonstrations of its importance (i.e. warning shots).
  • Given that slowing down later is much more valuable and much more tractable than just slowing down now, it seems much better to focus on slowing down later. But the broader framing of "slow down" doesn't really suggest that focus, and maybe it even discourages it.
Comment by Mau (Mauricio) on Let’s think about slowing down AI · 2022-12-22T21:41:59.659Z · LW · GW

Work to spread good knowledge regarding AGI risk / doom stuff among politicians, the general public, etc. [...] Emphasizing “there is a big problem, and more safety research is desperately needed” seems good and is I think uncontroversial.

Nitpick: My impression is that at least some versions of this outreach are very controversial in the community, as suggested by e.g. the lack of mass advocacy efforts. [Edit: "lack of" was an overstatement. But these are still much smaller than they could be.]

Comment by Mau (Mauricio) on Predicting GPU performance · 2022-12-16T02:07:03.807Z · LW · GW

It does, thanks! (I had interpreted the claim in the paper as comparing e.g. TPUs to CPUs, since the quote mentions CPUs as the baseline.)

Comment by Mau (Mauricio) on Predicting GPU performance · 2022-12-15T03:00:21.956Z · LW · GW

Thanks! To make sure I'm following, does optimization help just by improving utilization?

Comment by Mau (Mauricio) on Predicting GPU performance · 2022-12-14T23:46:33.458Z · LW · GW

Sorry, I'm a bit confused. I'm interpreting the 1st and 3rd paragraphs of your response as expressing opposite opinions about the claimed efficiency gains (uncertainty and confidence, respectively), so I think I'm probably misinterpreting part of your response?

Comment by Mau (Mauricio) on Predicting GPU performance · 2022-12-14T21:54:25.818Z · LW · GW

This is helpful for something I've been working on - thanks!

I was initially confused about how these results could fit with claims from this paper on AI chips, which emphasizes the importance of factors other than transistor density for AI-specialized chips' performance. But on second thought, the claims seem compatible:

  • The paper argues that increases in transistor density have (recently) been slow enough for investment in specialized chip design to be practical. But that's compatible with increases in transistor density still being the main driver of performance improvements (since a proportionally small boost that lasts several years could still make specialization profitable).
  • The paper claims that "AI[-specialized] chips are tens or even thousands of times faster and more efficient than CPUs for training and inference of AI algorithms." But the graph in this post shows less than thousands of times improvements since 2006. These are compatible if remaining efficiency gains of AI-specialized chips came before 2006, which is plausible since GPUs were first released in 1999 (or maybe the "thousands of times" suggestion was just too high).
Comment by Mau (Mauricio) on The Alignment Community Is Culturally Broken · 2022-11-14T20:35:27.215Z · LW · GW

One specific concern people could have with this thoughtspace is the concern that it's hard to square with the knowledge that an AI PhD [edit: or rather, AI/ML expertise more broadly] provides. I took this point to be strongly suggested by the author's suggestions that "experts knowledgeable in the relevant subject matters that would actually lead to doom find this laughable" and that someone who spent their early years "reading/studying deep learning, systems neuroscience, etc." would not find risk arguments compelling. That's directly refuted by the surveys (though I agree that some other concerns about this thoughtspace aren't).

(However, it looks like the author was making a different point to what I first understood.)

Comment by Mau (Mauricio) on The Alignment Community Is Culturally Broken · 2022-11-14T08:16:07.910Z · LW · GW

experts knowledgeable in the relevant subject matters that would actually lead to doom find this laughable

This seems overstated; plenty of AI/ML experts are concerned. [1] [2] [3] [4] [5] [6] [7] [8] [9]

Quoting from [1], a survey of researchers who published at top ML conferences:

The median respondent’s probability of x-risk from humans failing to control AI was 10%

Admittedly, that's a far cry from "the light cone is about to get ripped to shreds," but it's also pretty far from finding those concerns laughable. [Edited to add: another recent survey puts the median estimate of extinction-level extremely bad outcomes at 2%, lower but arguably still not laughable.]

Comment by Mau (Mauricio) on Instead of technical research, more people should focus on buying time · 2022-11-07T05:45:22.021Z · LW · GW

Yep! Here's a compilation.

If someone's been following along with popular LW posts on alignment and is new to governance, I'd expect them to find the "core readings" in "weeks" 4-6 most relevant.

Comment by Mau (Mauricio) on Instead of technical research, more people should focus on buying time · 2022-11-07T05:11:07.044Z · LW · GW

I'm sympathetic under some interpretations of "a ton of time," but I think it's still worth people's time to spend at least ~10 hours of reading and ~10 hours of conversation getting caught up with AI governance/strategy thinking, if they want to contribute.

Arguments for this:

  • Some basic ideas/knowledge that the field is familiar with (e.g. on the semiconductor supply chain, antitrust law, immigration, US-China relations, how relevant governments and AI labs work, the history of international cooperation in the 20th century) seem really helpful for thinking about this stuff productively.
  • First-hand knowledge of how relevant governments and labs work is hard/costly to get on one's own.
  • Lack of shared context makes collaboration with other researchers and funders more costly.
  • Even if the field doesn't know that much and lots of papers are more advocacy pieces, people can learn from what the field does know and read the better content.
Comment by Mau (Mauricio) on Instead of technical research, more people should focus on buying time · 2022-11-06T00:22:21.410Z · LW · GW

more researchers should backchain from “how do I make AGI timelines longer

Like you mention, "end time" seems (much) more valuable than earlier time. But the framing here, as well as the broader framing of "buying time," collapses that distinction (by just using "time" as the metric). So I'd suggest more heavily emphasizing buying end time.

One potential response is: it doesn't matter; both framings suggest the same interventions. But that seems wrong. For example, slowing down AI progress now seems like it'd mostly buy "pre-end time" (potentially by burning "end time," if the way we're slowing down is by safety-conscious labs burning their leads), while setting up standards/regulations/coordination for mitigating racing/unilateralist dynamics at end time buys us end time.

Comment by Mau (Mauricio) on Instead of technical research, more people should focus on buying time · 2022-11-06T00:00:39.917Z · LW · GW

Thanks for posting this!

There's a lot here I agree with (which might not be a surprise). Since the example interventions are all/mostly technical research or outreach to technical researchers, I'd add that a bunch of more "governance-flavored" interventions would also potentially contribute.

  • One of the main things that might keep AI companies from coordinating on safety is that some forms of coordination--especially more ambitious coordination--could violate antitrust law.
    • One thing that could help would be updating antitrust law or how it's enforced so that it doesn't do a terrible job at balancing anticompetition and safety concerns.
    • Another thing that could help would be a standard-setting organization, since coordination on standards is often more accepted when it's done in such a context.
      • [Added] Can standards be helpful for safety before we have reliably safety methods? I think so; until then, we could imagine standards on things like what types of training runs to not run, or when to stop a training run.
  • If some leading AI lab (say, Google Brain) shows itself to be unreceptive to safety outreach and coordination efforts, and if the lead that more safety-conscious labs have over this lab is insufficient, then government action might be necessary to make sure that safety-conscious efforts have the time they need.

To be more direct, I'm nervous that people will (continue to) overlook a promising class of time-buying interventions (government-related ones) through mostly just having learned about government from information sources (e.g. popular news) that are too coarse and unrepresentative to make promising government interventions salient. Some people respond that getting governments to do useful things is clearly too intractable. But I don't see how they can justifiably be so confident if they haven't taken the time to form good models of government. At minimum, the US government seems clearly powerful enough (~$1.5 trillion discretionary annual budget, allied with nearly all developed countries, thousands of nukes, biggest military, experienced global spy network, hosts ODA+, etc.) for its interventions to be worth serious consideration.

Comment by Mau (Mauricio) on Warning Shots Probably Wouldn't Change The Picture Much · 2022-10-08T22:47:03.813Z · LW · GW

I agree with a lot of that. Still, if

nuclear non proliferation [to the extent that it has been achieved] is probably harder than a ban on gain-of-function

that's sufficient to prove Daniel's original criticism of the OP--that governments can [probably] fail at something yet succeed at some harder thing.

(And on a tangent, I'd guess a salient warning shot--which the OP was conditioning on--would give the US + China strong incentives to discourage risky AI stuff.)

Comment by Mau (Mauricio) on Warning Shots Probably Wouldn't Change The Picture Much · 2022-10-07T00:01:24.951Z · LW · GW

I agree it's some evidence, but that's a much weaker claim than "probably policy can't deliver the wins we need."

Comment by Mau (Mauricio) on Warning Shots Probably Wouldn't Change The Picture Much · 2022-10-06T23:23:11.874Z · LW · GW

An earlier comment seems to make a good case that there's already more community investment in AI policy, and another earlier thread points out that the content in brackets doesn't seem to involve a good model of policy tractability.

Comment by Mau (Mauricio) on Warning Shots Probably Wouldn't Change The Picture Much · 2022-10-06T23:16:07.352Z · LW · GW
  1. Perhaps the sorts of government interventions needed to make AI go well are not all that large, and not that precise.

I confess I don't really understand this view.

Specifically for the sub-claim that "literal global cooperation" is unnecessary, I think a common element of people's views is that: the semiconductor supply chain has chokepoints in a few countries, so action from just these few governments can shape what is done with AI everywhere (in a certain range of time).

Comment by Mau (Mauricio) on Warning Shots Probably Wouldn't Change The Picture Much · 2022-10-06T23:02:43.666Z · LW · GW

I'd guess the very slow rate of nuclear proliferation has been much harder to achieve than banning gain-of-function research would be, since, in the absence of intervention, incentives to get nukes would have been much bigger than incentives to do gain-of-function research.

Also, on top of the taboo against chemical weapons, there was the verified destruction of most chemical weapons globally.

Comment by Mau (Mauricio) on Slowing down AI progress is an underexplored alignment strategy · 2022-07-13T05:37:37.586Z · LW · GW

Thanks for the post - I think there are some ways heavy regulation of AI could be very counterproductive or ineffective for safety:

  • If AI progress slows down enough in countries were safety-concerned people are especially influential, then these countries (and their companies) will fall behind in international AI development. This would eliminate much/most of safety-concerned people's opportunities for impacting how AI goes.
  • If China "catches up" to the US in AI (due to US over-regulation) when AI is looking increasingly economically and militarily important, that could easily motivate US lawmakers to hit the gas on AI (which would at least undo some of the earlier slowing down of AI, and would at worst spark an international race to the bottom on AI).

Also, you mention,

The community strategy (insofar as there even is one) is to bet everything on getting a couple of technical alignment folks onto the team at top research labs in the hopes that they will miraculously solve alignment before the mad scientists in the office next door turn on the doomsday machine.

From conversation, my understanding is some governance/policy folks fortunately have (somewhat) more promising ideas than that. (This doesn't show up much on this site, partly because: people on here tend to not be as interested in governance, these professionals tend to be busy, the ideas are fairly rough, and getting the optics right can be more important for governance ideas.) I hear there's some work aimed at posting about some of these ideas - until then, chatting with people (e.g., by reaching out to people at conferences) might be the best way to learn about these ideas.

Comment by Mau (Mauricio) on Convince me that humanity *isn’t* doomed by AGI · 2022-04-18T05:29:53.449Z · LW · GW

My problem is that most of the scenarios I see being discussed are dependent on a long chain of assumptions being true and they often seem to ignore that many things could go wrong, invalidating the full thing: you don't need to be wrong in all those steps, one of them is just enough.

This feels a bit like it might be shifting the goalposts; it seemed like your previous comment was criticizing a specific argumentative step ("reasons not to believe in doom: [...] Orthogonality of intelligence and agency"), rather than just pointing out that there were many argumentative steps.

Anyway, addressing the point about there being many argumentative steps: I partially agree, although I'm not very convinced since there seems to be significant redundancy in arguments for AI risk (e.g., multiple fuzzy heuristics suggesting there's risk, multiple reasons to expect misalignment, multiple actors who could be careless, multiple ways misaligned AI could gain influence under multiple scenarios).

The different AGIs might find it hard/impossible to coordinate. The different AGIs might even be in conflict with one another

Maybe, although here are six reasons to think otherwise:

  • There are reasons to think they will have an easy time coordinating:
    • (1) As mentioned, a very plausible scenario is that many of these AI systems will be copies of some specific model. To the extent that the model has goals, all these copies of any single model would have the same goal. This seems like it would make coordination much easier.
    • (2) Computer programs may be able to give credible signals through open-source code, facilitating cooperation.
    • (3) Focal points of coordination may come up and facilitate coordination, as they often do with humans.
    • (4) If they are initially in conflict, this will create competitive selection pressures for well-coordinated groups (much like how coordinated human states arise from anarchy).
    • (5) They may coordinate due to decision theoretic considerations.
    • (Humans may be able to mitigate coordination earlier on, but this gets harder as their number and/or capabilities grow.)
  • (6) Regardless, they might not need to (widely) coordinate; overwhelming numbers of uncoordinated actors may be risky enough (especially if there is some local coordination, which seems likely for the above reasons).
Comment by Mau (Mauricio) on Convince me that humanity *isn’t* doomed by AGI · 2022-04-16T22:47:43.316Z · LW · GW
  1. Orthogonality of intelligence and agency. I can envision a machine with high intelligence and zero agency, I haven't seen any convincing argument yet of why both things must necessarily go together

Hm, what do you make of the following argument? Even assuming (contestably) that intelligence and agency don't in principle need to go together, in practice they'll go together because there will appear to be strong economic or geopolitical incentives to build systems that are both highly intelligent and highly agentic (e.g., AI systems that can run teams). (And even if some AI developers are cautious enough to not build such systems, less cautious AI developers will, in the absence of strong coordination.)

Also, (2) and (3) seem like reasons why a single AI system may be unable to disempower humanity. Even if we accept that, how relevant will these points be when there is a huge number of highly capable AI systems (which may happen because of the ease and economic benefits of replicating highly capable AI systems)? Their numbers might make up for their limited knowledge and limited plans.

(Admittedly, in these scenarios, people might have significantly more time to figure things out.)

Or as Paul Christiano puts it (potentially in making a different point):

At the same time, it becomes increasingly difficult for humans to directly control what happens in a world where nearly all productive work, including management, investment, and the design of new machines, is being done by machines. We can imagine a scenario in which humans continue to make all goal-oriented decisions about the management of PepsiCo but are assisted by an increasingly elaborate network of prosthetics and assistants. But I think human management becomes increasingly implausible as the size of the world grows (imagine a minority of 7 billion humans trying to manage the equivalent of 7 trillion knowledge workers; then imagine 70 trillion), and as machines’ abilities to plan and decide outstrip humans’ by a widening margin. In this world, the AIs that are left to do their own thing outnumber and outperform those which remain under close management of humans.

Comment by Mauricio on [deleted post] 2022-02-18T05:29:15.569Z

So we need a way to have alignment deployed throughout the algorithmic world before anyone develops AGI. To do this, we'll start by offering alignment as a service for more limited AIs.

I'm tentatively fairly excited about some version of this, so I'll suggest some tweaks that can hopefully be helpful for your success (or for the brainstorming of anyone else who's thinking about doing something similar in the future).

We will refine and develop this deployment plan, depending on research results, commercial opportunities, feedback, and suggestions.

I suspect there'd be much better commercial/scaling opportunities for a somewhat similar org that offered a more comprehensive, high-quality package of "trustworthy AI services"--e.g., addressing bias, privacy issues, and other more mainstream concerns along with safety/alignment concerns. Then there'd be less of a need to convince companies about paying for some new service--you would mostly just need to convince them that you're the best provider of services that they're already interested in. (Cf. ethical AI consulting companies that already exist.)

(One could ask: But wouldn't the extra price be the same, whether you're offering alignment in a package or separately? Not necessarily--IP concerns and transaction costs incentivize AI companies to reduce the number of third parties they share their algorithms with.)

As an additional benefit, a more comprehensive package of "trustworthy AI services" would be directly competing for consumers with companies like the AI consulting company mentioned above. This might pressure those companies to start offering safety/alignment services--a mechanism for broadening adoption that isn't available to an org that only provides alignment services.

[From the website] We are hiring AI safety researchers, ML engineers and other staff.

Relatedly to the earlier point, given that commercial opportunities are a big potential bottleneck (in other words, given that selling limited alignment services might be as much of a communications and persuasion challenge as it is a technical challenge), my intuition would be to also put significant emphasis into hiring people who will kill it at the persuasion: people who are closely familiar with the market and regulatory incentives faced by relevant companies, people with sales and marketing experience, people with otherwise strong communications skills, etc. (in addition to the researchers and engineers).

Comment by Mau (Mauricio) on What failure looks like · 2021-12-26T08:06:46.242Z · LW · GW

A more recent clarification from Paul Christiano, on how Part 1 might get locked in / how it relates to concerns about misaligned, power-seeking AI:

I also consider catastrophic versions of "you get what you measure" to be a subset/framing/whatever of "misaligned power-seeking." I think misaligned power-seeking is the main way the problem is locked in.

Comment by Mau (Mauricio) on My Overview of the AI Alignment Landscape: Threat Models · 2021-12-26T06:52:27.753Z · LW · GW

I'm still pretty confused by "You get what you measure" being framed as a distinct threat model from power-seeking AI (rather than as another sub-threat model). I'll try to address two defenses of that (of framing them as distinct threat models) which I interpret this post as suggesting (in the context of this earlier comment on the overview post). Broadly, I'll be arguing that: power-seeking AI is necessary for "you get what you measure" issues posing existential threats, so "you get what you measure" concerns are best thought of as a sub-threat model of power-seeking AI.

(Edit: An aspect of "you get what you measure" concerns--the emphasis on something like "sufficiently strong optimization for some goal is very bad for different goals"--is a tweaked framing of power-seeking AI risk in general, rather than a subset.)

Lock-in: Once we’ve noticed problems, how difficult will they be to fix, and how much resistance will there be? For example, despite the clear harms of CO2 emissions, fossil fuels are such an indispensable part of the economy that it’s incredibly hard to get rid of them. A similar thing could happen if AI systems become an indispensable part of the economy, which seems pretty plausible given how incredibly useful human-level AI would be. As another example, imagine how hard it would be to ban social media, if we as a society decided that this was net bad for the world.

Unless I'm missing something, this is just an argument for why AI might get locked in--not an argument for why misaligned AI might get locked in. AI becoming an indispensable part of the economy isn't a long-term problem if people remain capable of identifying and fixing problems with the AI. So we still need an additional lock-in mechanism (e.g. the initially deployed, misaligned AI being power-seeking) to have trouble. (If we're wondering how hard it will be to fix/improve non-power-seeking AI after it's been deployed, the difficulty of banning social media doesn't seem like a great analogy; a more relevant analogy would be the difficulty of fixing/improving social media after it's been deployed. Empirically, this doesn't seem that hard. For example, YouTube's recommendation algorithm started as a click-maximizer, and YouTube has already modified it to learn from human feedback.)

See Sam Clarke’s excellent post for more discussion of examples of lock-in.

I don't think Sam Clarke's post (which I'm also a fan of) proposes any lock-in mechanisms that (a) would plausibly cause existential catastrophe from misaligned AI and (b) do not depend on AI being power-seeking. Clarke proposes five mechanisms by which Part 1 of "What Failure Looks Like" could get locked in -- addressing each of these in turn (in the context of his original post):

  • (1) short-term incentives and collective action -- arguably fails condition (a) or fails condition (b); if we don't assume AI will be power-seeking, then I see no reason why these difficulties would get much worse in hundreds of years than they are now, i.e. no reason why this on its own is a lock-in mechanism.
  • (2) regulatory capture -- the worry here is that the companies controlling AI might have and permanently act on bad values; this arguably fails condition (a), because if we're mainly worried about AI developers being bad, then focusing on intent alignment doesn't make that much sense.
  • (3) genuine ambiguity -- arguably fails condition (a) or fails condition (b); if we don't assume AI will be power-seeking, then I see no reason why these difficulties would get much worse in hundreds of years than they are now, i.e. no reason why this on its own is a lock-in mechanism.
  • (4) dependency and deskilling -- addressed above
  • (5) [AI] opposition to [humanity] taking back influence -- clearly fails condition (b)

So I think there remains no plausible alignment-relevant threat model for "You get what you measure" that doesn't fall under "power-seeking AI."

Comment by Mau (Mauricio) on Zvi’s Thoughts on the Survival and Flourishing Fund (SFF) · 2021-12-15T07:20:22.183Z · LW · GW

In my model, one should be deeply skeptical whenever the answer to ‘what would do the most good?’ is ‘get people like me more money and/or access to power.’ One should be only somewhat less skeptical when the answer is ‘make there be more people like me’ or ‘build and fund a community of people like me.’ [...] I wish I had a better way to communicate what I find so deeply wrong here

I'd be very curious to hear more fleshed-out arguments here, if you or others think of them. My best guess about what you have in mind is that it's a combination of the following (lumping all the interventions mentioned in the quoted excerpt into "power-seeking"):

  1. People have personal incentives and tribalistic motivations to pursue power for their in-group, so we're heavily biased toward overestimating its altruistic value.
  2. Seeking power occupies resources and attention that could be spent figuring out how to solve problems, and figuring out how to solve problems is very valuable.
  3. Figuring out how to solve problems isn't just very valuable. It's necessary for things to go well, so mainly doing power-seeking makes it way too easy for us to get the mistaken impression that we're making progress and things are going well, while a crucial input into things going well (knowing what to do with power) remains absent.
  4. Power-seeking attracts leeches (which wastes resources and dilutes relevant fields).
  5. Power-seeking pushes people's attention away from object-level discussion and learning. (This is different from (3) in that (3) is about how power-seeking impacts a specific belief, while this point is about attention.)
  6. Power-seeking makes a culture increasingly value power for its own sake, which is bad for the usual reasons that value drift is bad.

If that's it (is it?), then I'm more sympathetic than I was before writing out the above, but I'm still skeptical:

  • Re: 1: Speaking of object-level arguments, object-level arguments for the usefulness of power and field growth seem very compelling (and simple enough to significantly reduce room for bias).
  • 4 mainly seems like a problem with poorly executed power-seeking (although maybe that's hard to avoid?).
  • 2-5 and 6 seem to be horrific problems mostly just if power-seeking is the main activity of a community, rather than one of several activities.

(One view from which power-seeking seems much less valuable is if we assume that, on the margin, this kind of power isn't all that useful for solving key problems. But if that were the crux, I'd have expected the original criticism to emphasize the (limited) benefits of power-seeking, rather than its costs.)

Comment by Mau (Mauricio) on Self-Integrity and the Drowning Child · 2021-10-26T11:46:30.843Z · LW · GW

I agree with and appreciate the broad point. I'll pick on one detail because I think it matters.

this whole parable of the drowning child, was set to crush down the selfish part of you, to make it look like you would be invalid and shameful and harmful-to-others if the selfish part of you won [...]

It is a parable calculated to set at odds two pieces of yourself... arranging for one of them to hammer down the other in a way that would leave it feeling small and injured and unable to speak in its own defense.

This seems uncharitable? Singer's thought experiment may have had the above effects, but my impression's been that it was calculated largely to help people recognize our impartially altruistic parts—parts of us that in practice seem to get hammered down, obliterated, and forgotten far more often than our self-focused parts (consider e.g. how many people do approximately nothing for strangers vs. how many people do approximately nothing for themselves).

So part of me worries that "the drowning child thought experiment is a calculated assault on your personal integrity!" is not just mistaken but yet another hammer by which people will kick down their own altruistic parts—the parts of us that protect those who are small and injured and unable to speak in their own defense.