Distinguishing AI takeover scenarios 2021-09-08T16:19:40.602Z
Survey on AI existential risk scenarios 2021-06-08T17:12:42.026Z
What are the biggest current impacts of AI? 2021-03-07T21:44:10.633Z
Clarifying “What failure looks like” (part 1) 2020-09-20T20:40:48.295Z


Comment by Sam Clarke on An Increasingly Manipulative Newsfeed · 2021-09-13T10:39:56.662Z · LW · GW

(Note: this post is an extended version of this post about stories of continuous deception. If you are already familiar with treacherous turn vs. sordid stumble you can skip the first part.)

FYI, broken link in this sentence.

Comment by Sam Clarke on Persuasion Tools: AI takeover without AGI or agency? · 2021-08-12T16:46:45.021Z · LW · GW

I found this post helpful and interesting, and refer to it often! FWIW I think that powerful persuasion tools could have bad effects on the memetic ecosystem even if they don't shift the balance of power to a world with fewer, more powerful ideologies. In particular, the number of ideologies could remain roughly constant, but each could get more 'sticky'. This would make reasonable debate and truth-seeking harder, as well as reducing trusted and credible multipartisan sources. This seems like an existential risk factor, e.g. because it will make coordination harder. (Analogy to how vaccine and mask hesitancy during Covid was partly due to insufficient trust in public health advice). Or more speculatively I could also imagine an extreme version of sticky, splintered epistemic bubbles this leading to moral stagnation/value lock-in.

Minor question on framing: I'm wondering why you chose to call this post "AI takeover without AGI or agency?" given that the effects of powerful persuasion tools you talk about aren't what (I normally think of as) "AI takeover"? (Rather, if I've understood correctly, they are "persuasion tools as existential risk factor", or "persuasion tools as mechanism for power concentration among humans".)

Somewhat related: I think there could be a case made for takeover by goal-directed but narrow AI, though I haven't really seen it made. But I can't see a case for takeover by non-goal-directed AI, since why would AI systems without goals want to take over? I'd be interested if you have any thoughts on those two things.

Comment by Sam Clarke on How to Sleep Better · 2021-07-19T11:04:51.656Z · LW · GW

only sleep when I'm tired

Sounds cool, I'm tempted to try this out, but I'm wondering how this jives with the common wisdom that going to bed at the same time every night is important? And "No screens an hour before bed" - how do you know what "an hour before bed is" if you just go to bed when tired?

Comment by Sam Clarke on How to Sleep Better · 2021-07-19T11:01:50.994Z · LW · GW

I feel similarly, and still struggle with turning off my brain. Has anything worked particularly well for you?

Comment by Sam Clarke on How to Sleep Better · 2021-07-19T10:59:20.405Z · LW · GW

I'm curious how you actually use the information from your Oura ring? To help measure the effectiveness of sleep interventions? As one input for deciding how to spend your day? As a motivator to sleep better? Something else?

Comment by Sam Clarke on Some thoughts on risks from narrow, non-agentic AI · 2021-07-01T07:52:11.445Z · LW · GW

Makes sense, thanks!

Comment by Sam Clarke on Some thoughts on risks from narrow, non-agentic AI · 2021-06-30T10:11:35.230Z · LW · GW

being trained on "follow instructions"

What does this actually mean, in terms of the details of how you'd train a model to do this?

Comment by Sam Clarke on Survey on AI existential risk scenarios · 2021-06-14T08:53:25.944Z · LW · GW

Thanks for the reply - a couple of responses:

it doesn't seem useful to get a feeling for "how far off of ideal are we likely to be" when that is composed of: 1. What is the possible range of AI functionality (as constrained by physics)? - ie what can we do?

No, these cases aren't included. The definition is: "an existential catastrophe that could have been avoided had humanity's development, deployment or governance of AI been otherwise". Physics cannot be changed by humanity's development/deployment/governance decisions. (I agree that cases 2 and 3 are included).

Knowing that experts think we have a (say) 10% chance of hitting the ideal window says nothing about what an interested party should do to improve those chances.

That's correct. The survey wasn't intended to understand respondents' views on interventions. It was only intended to understand: if something goes wrong, what do respondents think that was? Someone could run another survey that asks about interventions (in fact, this other recent survey does that). For the reasons given in the Motivation section of this post, we chose to limit our scope to threat models, rather than interventions.

Comment by Sam Clarke on Survey on AI existential risk scenarios · 2021-06-11T13:03:48.713Z · LW · GW

Thanks for pointing this out. We did intend for cases like this to be included, but I agree that it's unclear if respondents interpreted it that way. We should have clarified this in the survey instructions.

Comment by Sam Clarke on Survey on AI existential risk scenarios · 2021-06-11T13:00:36.743Z · LW · GW

Is one question combining the risk of "too much" AI use and "too little" AI use?

Yes, it is. Combining these cases seems reasonable to me, though we definitely should have clarified this in the survey instructions. They're both cases where humanity could avoided an existential catastrophe by making different decisions with respect to AI.

Comment by Sam Clarke on What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs) · 2021-06-07T15:31:54.348Z · LW · GW

Thanks a lot for this post, I found it extremely helpful and expect I will refer to it a lot in thinking through different threat models.

I'd be curious to hear how you think the Production Web stories differ from part 1 of Paul's "What failure looks like".

To me, the underlying threat model seems to be basically the same: we deploy AI systems with objectives that look good in the short-run, but when those systems become equally or more capable than humans, their objectives don't generalise "well" (i.e. in ways desirable by human standards), because they're optimising for proxies (namely, a cluster of objectives that could loosely be described as "maximse production" within their industry sector) that eventually come apart from what we actually want ("maximising production" eventually means using up resources critical to human survival but non-critical to machines).

From reading some of the comment threads between you and Paul, it seems like you disagree about where, on the margin, resources should be spent (improving the cooperative capabilities of AI systems and humans vs improving single-single intent alignment) - but you agree on this particular underlying threat model?

It also seems like you emphasise different aspects of these threat models: you emphasise the role of competitive pressures more (but they're also implicit in Paul's story), and Paul emphases failures of intent alignment more (but they're also present in your story) - though this is consistent with having the same underlying threat model?

(Of couse, both you and Paul also have other threat models, e.g. you have Flash War, Paul has part 2 of "What failure looks like", and also Another (outer) alignment failure story, which seems to be basically a more nuanced version of part 1 of "What failure looks like". Here, I'm curious specifically about the two theat models I've picked out.)

(I could have lots of this totally wrong, and would appreciate being corrected if so)

Comment by Sam Clarke on What are some real life Inadequate Equilibria? · 2021-05-25T14:00:04.225Z · LW · GW

I'm a bit confused about the edges of the inadequate equilbrium concept you're interested in.

In particular, do simple cases of negative externalities count? E.g. the econ 101 example of "factory pollutes river" - seems like an instance of (1) and (2) in Eliezer's taxonomy - depending on whether you're thinking of the "decision-maker" as (1) the factory owner (who would lose out personally) or (2) the government (who can't learn the information they need because the pollution is intentionally hidden). But this isn't what I'd typically think of as a bad Nash equilibrium, because (let's suppose) the factory owners wouldn't actually be better off by "cooperating"

Comment by Sam Clarke on What will 2040 probably look like assuming no singularity? · 2021-05-21T11:32:04.608Z · LW · GW

Just an outside view that over the last decades, a number of groups who previously had to suppress their identities/were vilified are now more accepted (e.g., LGBTQ+, feminists, vegans), and I expect this trend to continue.

I'm curious if you expect this trend to change, or maybe we're talking about slightly different things here?

Comment by Sam Clarke on What will 2040 probably look like assuming no singularity? · 2021-05-19T16:32:36.401Z · LW · GW

I had something like "everybody who has to strongly hide part of their identity when living in cities" in mind

Comment by Sam Clarke on Less Realistic Tales of Doom · 2021-05-19T16:27:17.699Z · LW · GW

Thanks for writing this! Here's another, that I'm posting specifically because it's confusing to me.

Value erosion

Takeoff was slow and lots of actors developed AGI around the same time. Intent alignment turned out relatively easy and so lots of actors with different values had access to AGIs that were trying to help them. Our ability to solve coordination problems remained at ~its current level. Nation states, or something like them, still exist, and there is still lots of economic competition between and within them. Sometimes there is military conflict, which destroys some nation states, but it never destroys the world.

The need to compete in these ways limits the extent to which each actor is able to spend their resources on things they actually want (because they have to spend a cut on competing, economically or militarily). Moreover, this cut is ever-increasing, since the actors who don't increase their competitiveness get wiped out. Different groups start spreading to the stars. Human descendants eventually colonise the galaxy, but have to spend ever closer to 100% of their energy on their militaries and producing economically valuable stuff. Those who don't get outcompeted (i.e. destroyed in conflict or dominated in the market) and so lose their most of their ability to get what they want.

Moral: even if we solve intent alignment, avoid catastrophic war or misuse of AI by bad actors, and other acute x-risks, the future could (would probably?) still be much worse than it could be, if we don't also coordinate to stop the value race to the bottom.

Comment by Sam Clarke on What will 2040 probably look like assuming no singularity? · 2021-05-19T15:33:25.311Z · LW · GW

Epistemic effort: I thought about this for 20 minutes and dumped my ideas, before reading others' answers

  • The latest language models are assisting or doing a number of tasks across society in rich countries, e.g.
    • Helping lawyers search and summarise cases, suggest inferences, etc. but human lawyers still make calls at the end of the day
    • Similar for policymaking, consultancy, business strategising etc.
    • Lots of non-truth seeking journalism. All good investigative journalism is still done by humans.
    • Telemarketing and some customer service jobs
  • The latest deep RL models are assisting or doing a number of tasks in across society in rich countries, e.g.
    • Lots of manufacturing
    • Almost all warehouse management
    • Most content filtering on social media
    • Financing decisions made by banks
  • Other predictions
    • it's much easier to communicate with anyone, anywhere, at higher bandwidth (probably thanks to really good VR and internet)
    • the way we consume information has changed a lot (probably also related to VR, and content selection algorithms getting really good)
    • the way we shop has changed a lot (probably again due to content selection algorithms. I'm imagining there being very little effort between having a material desire and spending money to have it fulfilled)
    • education hasn't really changed
    • international travel hasn't really changed
    • discrimination against groups that are marginalised in 2021 has reduced somewhat
    • nuclear energy is even more widespread and much safer
    • getting some psychotherapy or similar is really common (>80% of people)
Comment by Sam Clarke on What will 2040 probably look like assuming no singularity? · 2021-05-19T10:04:25.334Z · LW · GW

Thanks for this, really interesting!

Meta question: when you wrote this list, what did your thought process/strategies look like, and what do you think are the best ways of getting better at this kind of futurism?

More context:

  • One obvious answer to my second question is to get feedback - but the main bottleneck there is that these things won't happen for many years. Getting feedback from others (hence this post, I presume) is a partial remedy, but isn't clearly that helpful (e.g. if everyone's futurism capabilities are limited in the same ways). Maybe you've practised futurism over shorter time horizons a lot? Or you expect that people giving you feedback have?
  • After reading the first few entries, I spent 20 mins writing my own list before reading yours. Some questions/confusions that occurred:
    • All of my ideas ended up with epistemic status "OK, that might happen, but I'd need to spend at least a day researching this to be able to say anything like "probably that'll happen by 2040" "
      • So I'm wondering if you did this/already had the background knowledge, or if I'm wrong that this is necessary
    • My strategies were (1) consider important domains (e.g. military, financial markets, policymaking), and what better LMs/deep RL/DL in general/other emerging tech will do to those domains; (2) consider obvious AI/emerging tech applications (e.g. customer service); (3) look back to 2000 and 1980 and extrapolate apparent trends.
      • How good are these strategies? what other strategies are there? how should they be weighed?
    • How much is my bottleneck to being better at this (a) better models for extrapolating trends in AI capabilities/other emerging tech vs (b) better models of particular domains vs (c) better models of the-world-in-general vs (d) something else?
Comment by Sam Clarke on Less Realistic Tales of Doom · 2021-05-12T14:52:54.638Z · LW · GW

Will MacAskill calls this the "actual alignment problem"

Wei Dai has written a lot about related concerns in posts like The Argument from Philosophical Difficulty

Comment by Sam Clarke on What Failure Looks Like: Distilling the Discussion · 2021-05-10T10:02:02.270Z · LW · GW

The AI systems in part I of the story are NOT "narrow" or "non-agentic"

  • There's no difference between the level of "narrowness" or "agency" of the AI systems between parts I and II of the story.
    • Many people (including Richard Ngo and myself) seem to have interpreted part I as arguing that there could be an AI takeover by AI systems that are non-agentic and/or narrow (i.e. are not agentic AGI). But this is not at all what Paul intended to argue.
    • Put another way, both parts I and II are instances of the "second species" concern/gorilla problem: that AI systems will gain control of humanity's future. (I think this is also identical to what people mean when they say "AI takeover".)
    • As far as I can tell, this isn't really a different kind of concern from the classic Bostrom-Yudkowsky case for AI x-risk. It's just a more nuanced picture of what goes wrong, that also makes failure look plausible in slow takeoff worlds.
  • Instead, the key difference between parts I and II of the story is the way that the models' objectives generalise.
    • In part II, it's the kind of generalisation typically called a "treacherous turn". The models learn the objective of "seeking influence". Early in training, the best way to do that is by "playing nice". The failure mode is that, once they become sufficiently capable, they no longer need to play nice and instead take control of humanity's future.
    • In part I, it's a different kind of generalisation, which has been much less discussed. The models learn some easily-measurable objective which isn't what humans actually want. In other words, the failure mode is that these models are trying to "produce high scores" instead of "help humans get what they want". You might think that using human feedback to specify the base objective will alleviate this problem (e.g. use learn a reward model from human demonstrations or preferences about a hard-to-measure objective). But this doesn't obviously help: now, the failure mode is that the model learns the objective "do things that look to humans like you are achieving X" or "do things that the humans giving feedback about X will rate highly" (instead of "actually achieving X").
    • Notice that in both of these scenarios, the models are mesa-optimizers (i.e. the learned models are themselves optimizers), and failure ensues because the models' learned objectives generalise in the wrong way.

This was discussed in comments (on a separate post) by Richard Ngo and Paul Christiano. There's a lot more important discussion in that comment thread, which is summarised in this doc.

Comment by Sam Clarke on AMA: Paul Christiano, alignment researcher · 2021-04-30T12:00:39.204Z · LW · GW

Relatedly: if we manage to solve intent alignment (including making it competitive) but still have an existential catastrophe, what went wrong?

Comment by Sam Clarke on AMA: Paul Christiano, alignment researcher · 2021-04-30T07:59:48.417Z · LW · GW

Are there any research questions you're excited about people working on, for making AI go (existentially) well, that are not related to technical AI alignment or safety? If so, what? (I'm especially interested in AI strategy/governance questions)

Comment by Sam Clarke on What are the biggest current impacts of AI? · 2021-03-30T02:40:38.685Z · LW · GW

Thanks for your reply! This is interesting, though I'm a little confused by some parts of it.

Is the following a good summary of your main point? A main feature of your model of AI development/deployment is that there will be many shared components of AI systems, perhaps owned by 1-3 companies, that get licensed out to people who want to use them. This is because many problems you want to solve with AI systems can be decomposed into the same kinds of subproblems, so you can reuse components that solve those subproblems many times, and there's extra incentive to do this because designing those components is really hard. One implication of this is that progress will be faster than in a world where components are separately designed by different companies, because more training data per component so components will be able to generalise more quickly.

I guess I'm confused whether there is so much overlap in subproblems that this is how things will go.

For example, you are going to want to classify and segment the images from a video feed into a state space of [identity, locations]. So does everyone else.

Hmm, it seems this is a subproblem that only a smallish proportion of companies will want to solve (e.g. companies providing police surveillance software, contact tracing software, etc.) - but really, not that many economically relevant tasks involve facial recognition. But maybe I'm missing something?

Similarly at a broader level, even if some of your algorithms have a different state space, the form of your algorithm is the same as everyone else.

Hmm, just because the abstract form of your algorithm is the same as everyone else's, this doesn't mean you can reuse the same algorithm... In some sense, it's trivial that abstract form of all algorithms is the same: [inputs] -> [outputs]. But this doesn't mean the same algorithm can be reused to solve all the problems.

The fact that companies exist seems like good evidence that economically relevant problems can be decomposed into subproblems that individual agents with human-level intelligence can solve. But I'm pretty uncertain whether economically relevant problems can be decomposed into subproblems that more narrow systems can solve? Maybe there's an argument in your answer that I'm missing?

Comment by Sam Clarke on What are the biggest current impacts of AI? · 2021-03-30T02:28:38.898Z · LW · GW

The State of AI report is by far the best resource I've come across so far. Reading it led me to significantly update my models about how much ML systems are already being deployed. I was particularly surprised by military applications, e.g.

  • Lockheed have developed a drone that uses ML algorithms to "analyze enemy signals ... and compute effective countermeasures on the fly" for disrupting enemy "communications and radar networks without their realizing they’re being deceived"
  • Heron Systems develop a deep RL agent (called AlphaDogfight) that beat a human pilot 5-0 in a virtual dogfight. US Defense Secretary announced a "real-world competition [between a human pilot and AI] involving full-scale tactical aircraft in 2024" (could just be hype)
  • the US Army Research Lab published a paper exploring how natural language commands could be used to improve performance of RL agents where there are sparse reward functions, using StarCraft II for their experiments
  • (and this is just the R&D that's public...)

Also, AI-based facial recognition is in active use by governments in 50% of the world

Comment by Sam Clarke on Misalignment and misuse: whose values are manifest? · 2021-03-29T22:43:26.911Z · LW · GW

If we solve the problem normally thought of as "misalignment", it seems like this scenario would now go well.

This might be totally obvious, but I think it's worth pointing out that even if we "solve misalignment" - which I take to mean solving the technical problem of intent alignment - Bob could still chose to deploy a business strategising AI, in which case this failure mode would still occur. In fact, with a solution to intent alignment, it seems Bob would be more likely to do this, because his business strategising assistant will actually be trying to do what Bob wants (help his business suceed).

Comment by Sam Clarke on Clarifying “What failure looks like” (part 1) · 2021-01-11T01:49:50.748Z · LW · GW


Comment by Sam Clarke on Clarifying “What failure looks like” (part 1) · 2020-09-22T21:48:03.761Z · LW · GW

This was helpful to me, thanks. I agree this seems almost certainly to be the end state if AI systems are optimizing hard for simple, measurable objectives.

I'm still confused about what happens if AI systems are optimizing moderately for more complicated, measurable objectives (which better capture what humans actually want). Do you think the argument you made implies that we still eventually end up with a universe tiled with molecular smiley faces in this scenario?

Comment by Sam Clarke on Clarifying “What failure looks like” (part 1) · 2020-09-21T21:50:31.366Z · LW · GW

Thanks for your comment!

Are we sure that given the choice between "lower crime, lower costs and algorithmic bias" and "higher crime, higher costs and only human bias", and we have dictatorial power and can consider long-term effects, we would choose the latter on reflection?

Good point, thanks, I hadn't thought that sometimes it actually would make sense, on reflection, to choose an algorithm pursuing an easy-to-measure goal over humans pursuing incorrect goals. One thing I'd add is that if one did delve into the research to work this out for a particular case, it seems that an important (but hard to quantify) consideration would be the extent to which choosing the algorithm in this case makes it more likely that the use of that algorithm becomes entrenched, or it sets a precedent for the use of such algorithms. This feels important since these effects could plausibly make WFLL1-like things more likely in the longer run (when the harm of using misaligned systems is higher, due to the higher capabilities of those systems).

Note ML systems are way more interpretable than humans, so if they are replacing humans then this shouldn't make that much of a difference.

Good catch. I had the "AI systems replace entire institutions" scenario in mind, but agree that WFLL1 actually feels closer to "AI systems replace humans". I'm pretty confused about what this would look like though, and in particular, whether institutions would retain their interpretability if this happened. It seems plausible that the best way to "carve up" an institution into individual agents/services differs for humans and AI systems. E.g. education/learning is big part of human institution design - you start at the bottom and work your way up as you learn skills and become trusted to act more autonomously - but this probably wouldn't be the case for institutions composed of AI systems, since the "CEO" could just copy their model parameters to the "intern" :). And if institutions composed of AI systems are quite different to institutions composed of humans, then they might not be very interpretable. Sure, you could assert that AI systems replace humans one-for-one, but if this is not the best design, then there may be competitive pressure to move away from this towards something less interpretable.