sam-clarke

Posts
Comments

Posts

Deference on AI timelines: survey results 2023-03-30T23:03:52.661Z

When reporting AI timelines, be clear who you're deferring to 2022-10-10T14:24:14.504Z

Sam Clarke's Shortform 2021-10-07T14:29:18.702Z

Collection of arguments to expect (outer and inner) alignment failure? 2021-09-28T16:55:28.385Z

Distinguishing AI takeover scenarios 2021-09-08T16:19:40.602Z

Survey on AI existential risk scenarios 2021-06-08T17:12:42.026Z

What are the biggest current impacts of AI? 2021-03-07T21:44:10.633Z

Clarifying “What failure looks like” 2020-09-20T20:40:48.295Z

Comments

Comment by Sam Clarke on When reporting AI timelines, be clear who you're deferring to · 2023-03-30T23:34:10.527Z · LW · GW

Finally posted: https://www.lesswrong.com/posts/qccxb3uzwFDsRuJuP/deference-on-ai-timelines-survey-results

Comment by Sam Clarke on Deference on AI timelines: survey results · 2023-03-30T23:33:35.291Z · LW · GW

Did people say why they deferred to these people?

No, only asked respondents to give names

I think another interesting question to correlate this would be "If you believe AI x-risk is a severely important issue, what year did you come to believe that?".

Agree, that would have been interesting to ask

Comment by Sam Clarke on Deference on AI timelines: survey results · 2023-03-30T23:15:21.348Z · LW · GW

Things that surprised me about the results

There’s more variety than I expected in the group of people who are deferred to
- I suspect that some of the people in the “everyone else” cluster defer to people in one of the other clusters—in which case there is more deference happening than these results suggest.
There were more “inside view” responses than I expected (maybe partly because people who have inside views were incentivised to respond, because it’s cool to say you have inside views or something). Might be interesting to think about whether it’s good (on the community level) for this number of people to have inside views on this topic.
Metaculus was given less weight than I expected (but as per Eli (see footnote 2), I think that’s a good thing).
Grace et al. AI expert surveys (1, 2) were deferred to less than I expected (but again, I think that’s good—many respondents to those surveys seem to have inconsistent views, see here for more details. And also there’s not much reason to expect AI experts to be excellent at forecasting things like AGI—it’s not their job, it’s probably not a skill they spend time training).
It seems that if you go around talking to lots of people about AI timelines, you could move the needle on community beliefs more than I expected.

Comment by Sam Clarke on When reporting AI timelines, be clear who you're deferring to · 2023-01-09T11:10:41.132Z · LW · GW

Sorry for late, will be out this month!

Comment by Sam Clarke on Will Capabilities Generalise More? · 2022-07-04T17:24:04.696Z · LW · GW

Just wanted to say this is the single most useful thing I've read for improving my understanding of alignment difficulty. Thanks for taking the time to write it!

Comment by Sam Clarke on Inner Alignment: Explain like I'm 12 Edition · 2022-04-13T16:32:18.155Z · LW · GW

Part of me thinks: I was trying to push on whether it has a world model or rather has just memorised loads of stuff on the internet and learned a bunch of heuristics for how to produce compelling internet-like text. For me, "world model" evokes some object that has a map-territory relationship with the world. It's not clear to me that GPT-3 has that.

Another part of me thinks: I'm confused. It seems just as reasonable to claim that it obviously has a world model that's just not very smart. I'm probably using bad concepts and should think about this more.

Comment by Sam Clarke on Inner Alignment: Explain like I'm 12 Edition · 2022-04-07T17:35:36.660Z · LW · GW

It looks good to me!

This is already true for GPT-3

Idk, maybe...?

Comment by Sam Clarke on Inner Alignment: Explain like I'm 12 Edition · 2022-04-05T16:25:15.542Z · LW · GW

Re the argument for "Why internalization might be difficult", I asked Evan Hubinger for his take on your rendition of the argument, and he thinks it's not right.

Rather, the argument that Risks from Learned Optimization makes that internalization would be difficult is that:

~all models with good performance on a diverse training set probably have to have a complex world model already, which likely includes a model of the base objective,
so having the base objective re-encoded in a separate part of the model that represents its objective is just a waste of space/complexity.

Especially since this post is now (rightly!) cited in several introductory AI risk syllabi, it might be worth correcting this, if you agree it's an error.

Comment by Sam Clarke on Inner Alignment: Explain like I'm 12 Edition · 2022-03-14T16:28:41.759Z · LW · GW

Edit: or do you just mean that even though you take the same steps, the two feel different because retreating =/= going further along the wall

Yeah, this — I now see what you were getting at!

Comment by Sam Clarke on Late 2021 MIRI Conversations: AMA / Discussion · 2022-03-02T14:33:33.857Z · LW · GW

One argument for alignment difficulty is that corrigibility is "anti-natural" in a certain sense. I've tried to write out my understanding of this argument, and would be curious if anyone could add or improve anything about it.

I'd be equally interested in any attempts at succinctly stating other arguments for/against alignment difficulty.

Comment by Sam Clarke on Inner Alignment: Explain like I'm 12 Edition · 2022-03-02T14:08:48.321Z · LW · GW

Instead of "always go left", how about "always go along one wall"?

Yeah, maybe better, though still doesn't quite capture the "backing up" part of the algorithm. Maybe "I explore all paths through the maze, taking left hand turns first, backing up if I reach a dead end"... that's a bit verbose though.

I don't think there is a difference.

Gotcha

Comment by Sam Clarke on Inner Alignment: Explain like I'm 12 Edition · 2022-03-02T08:49:07.105Z · LW · GW

Another small nitpick: the difference, if any, between proxy alignment and corrigibility isn't explained. The concept of proxy alignment is introduced in subsection "The concept" without first defining it.

Comment by Sam Clarke on You are probably underestimating how good self-love can be · 2022-03-01T14:33:55.055Z · LW · GW

I've since been told about Tasshin Fogleman's guided metta meditations, and have found their aesethic to be much more up my alley than the others I've tried. I'd expect others who prefer a more rationalist-y aesthetic to feel similarly.

The one called 'Loving our parts' seems particularly good for self-love practice.

Comment by Sam Clarke on Inner Alignment: Explain like I'm 12 Edition · 2022-02-23T13:15:14.869Z · LW · GW

I still find the arguments that inner misalignment is plausible to rely on intuitions that feel quite uncertain to me (though I'm convinced that inner misalignment is possible).

So, I currently tend to prefer the following as the strongest "solid, specific reason to expect dangerous misalignment":

We don't yet have training setups that incentivise agents to do what their operators want, once they are sufficiently powerful.

Instead, the best we can do currently is naive reward modelling, and agents trained in this way are obviously incentivised to seize control of the memory cell where their reward is implemented (and eliminate anyone who might try to interfere with this) once they're sufficiently powerful - because that will allow them to get much higher scores, much more easily, than actually bringing about complicated changes to the world.

Meanwhile, AI capabilities are marching on scarily fast, so we probably don't have that much time to find a solution. And it's plausible that a solution will be very difficult because corrigibility seems "anti-natural" in a certain sense.

Curious what you think about this?

Comment by Sam Clarke on Comments on Carlsmith's “Is power-seeking AI an existential risk?” · 2022-02-23T13:12:16.604Z · LW · GW

Re: corrigibility being "anti-natural" in a certain sense - I think I have a better understanding of this now:

Eventually, we need to train an AI system capable enough to enable a pivotal act (in particular, actions that prevent the world from being destroyed by any other future AGI)
AI systems that are capable enough to enable a pivotal act must be (what Eliezer calls) a “consequentialist”: a system that “searches paths through time and selects high-scoring ones for output”
Training an aligned/corrigible/obedient consequentialist is something that Eliezer can’t currently see a way of doing, because it seems like a very unnatural sort of system. This makes him pessimistic about our current trajectory. The argument here seems kinda like a more subtle version of the instrumental convergence thesis. We want to train a system that:
- (1) searches for (and tries to bring about) paths through time that are robust enough to hit a narrow target (enabling a pivotal act and a great future in general)
- but also (2) is happy for certain human-initiated attempts to change that target (modify its goals, shut it down, etc.)
This seems unnatural and Eliezer can’t see how to do it currently.
An exacerbating factor is that even if top labs pursue alignment/corrigiblity/obedience, they will either be mistaken in having achieved it (because it’s hard), or honestly panic about not having achieved it and halt, by which point a runner-up who doesn’t understand the importance of alignment/corrigibility/obedience deploys their system which destroys the world.
(This is partly based on this summary)

Comment by Sam Clarke on Inner Alignment: Explain like I'm 12 Edition · 2022-02-23T12:52:38.105Z · LW · GW

Minor:

(If you don't know what depth-first search means: as far as mazes are concerned, it's simply the "always go left" rule.)

I was confused for a while, because my interpretation of "always go left" doesn't involve backing up (instead, when you get to a wall on the left, you just keep walking into it forever).

Comment by Sam Clarke on You are probably underestimating how good self-love can be · 2022-02-04T18:11:47.944Z · LW · GW

Amazing!

This has inspired me to try this too. I think I won't do 1h per day because I'm out of practice with meditation so 1h sounds real hard, but I commit to doing 20 mins per day for 10 days sometime in February.

What resources did you use to learn/practice? (Anything additional to the ones recommended in this post?) Was there anything else that helped?

Comment by Sam Clarke on Distinguishing AI takeover scenarios · 2022-01-14T11:57:37.939Z · LW · GW

Good idea, I can't get it to work on LW but here is the link: https://docs.google.com/document/d/1XyXNZjRTNImRB6HNOOr_2S0uASpwjopRfyd4Y8fATf8/edit?usp=sharing

Comment by Sam Clarke on Discussion with Eliezer Yudkowsky on AGI interventions · 2021-11-18T10:02:48.578Z · LW · GW

why attempts to teach corrigibility in safe regimes are unlikely to generalize well to higher levels of intelligence and unsafe regimes (...)

If you know of a reference to, or feel like expaining in some detail, the arguments given (in parentheses) for this claim, I'd love to hear them!

Comment by Sam Clarke on Ngo and Yudkowsky on alignment difficulty · 2021-11-17T14:26:41.771Z · LW · GW

Minor terminology note, in case discussion about "genomic/genetic bottleneck" continues: genetic bottleneck appears to have a standard meaning in ecology (different to Richard's meaning), so genomic bottleneck seems like the better term to use.

Comment by Sam Clarke on Comments on Carlsmith's “Is power-seeking AI an existential risk?” · 2021-11-17T14:16:28.301Z · LW · GW

Strong upvote, I would also love to see more disscussion on the difficulty of inner alignment.

which if true should preclude strong confidence in disaster scenarios

Though only for disaster scenarios that rely on inner misalignment, right?

... seem like world models that make sense to me, given the surrounding justifications

FWIW, I don't really understand those world models/intuitions yet:

Re: "earlier patches not generalising as well as the deep algorithms" - I don't understand/am sceptical about the abstraction of "earlier patches" vs. "deep algorithms learned as intelligence is scaled up". What seem to be dubbed "patches that won't generalise well" seem to me to be more like "plausibly successful shaping of the model's goals". I don't see why, at some point when the model gets sufficiently smart, gradient descent will get it to throw out the goals it used to have. What am I missing?
Re: corrigibility being "anti-natural" in a certain sense - I think I just don't understand this at all. Has it been discussed clearly anywhere else?

(jtbc, I think inner misalignment might be a big problem, I just haven't seen any good argument for it plausibly being the main problem)

Comment by Sam Clarke on Sam Clarke's Shortform · 2021-10-13T12:02:37.645Z · LW · GW

My own guess is that this is not that far-fetched.

Thanks for writing this out, I found it helpful and it's updated me a bit towards human extinction not being that far-fetched in the 'Part 1' world. Though I do still think that, in this world, humans would almost certaintly have very little chance of ever gaining control over our future/trajectory.

Without the argument this feels alarmist

Let me try to spell out the argument a little more - I think my original post was a little unclear. I don't think the argument actually appeals to the "convergent instrumental value of resource acquisition". We're not talking about randomly sampling an objective function for AGI and asking whether it implies resource acquisition for intstrumental reasons.

Rather, we're talking about selecting an objective function for AGI using something like gradient descent on some training objective, and - instead of an aligned objective arising from this process - a resource-acquiring/influence-seeking objective emerges. This is because doing well on the training objective is a good strategy for gaining resources/influence.

Random objectives that aren't resource/influence-seeking will be selected against by the training process, because they don't perform well on the training objective.

On this model, the AGI will have a resource-acquiring objective function, and we don't need to appeal to the convergent instrumental value of resource acquisition.

I'm curious if this distinction makes sense and seems right to you?

Comment by Sam Clarke on Sam Clarke's Shortform · 2021-10-08T15:58:53.831Z · LW · GW

Good catch, I edited the last points in each part to make the scale of the disaster clearer, and removed the reference to gorillas.

I do think the scale of disaster is smaller (in expectation) in Part 1 than in Part 2, for the reason mentioned here - basically, the systems in Part 1 are somewhat more aligned with human intentions (albeit poorly specified proxies to them), so there's some chance that they leave humans alone. Whereas Part 2 is a treacherous turn inner alignment failure, where the systems learned arbitrary objectives and so have no incentive at all to keep humans alive.

Comment by Sam Clarke on Sam Clarke's Shortform · 2021-10-07T14:29:19.110Z · LW · GW

I sometimes want to point people towards a very short, clear summary of What failure looks like, which doesn't seem to exist, so here's my attempt.

Many agentic AI systems gradually increase in intelligence and generality, and are deployed increasingly widely across society to do important tasks (e.g., law enforcement, running companies, manufacturing and logistics).
Initially, this world looks great from a human perspective, and most people are much richer than they are today.
But things then go badly in one of two ways (or more likely, a combination of both).
[Part 1] Going out with a whimper
- In the training process, we used easily-measurable proxy goals as objective functions, that don’t push the AI systems to do what we actually want e.g.
  - 'maximise positive feedback from your operator' instead of 'try to help your operator get what they actually want'
  - 'reduce reported crimes' instead of 'actually prevent crime'
  - 'increase reported life satisfaction' instead of 'actually help humans live good lives'
  - 'increasing human wealth on paper' instead of 'increasing effective human control over resources'
- (We did this because ML needs lots of data/feedback to train systems, and you can collect much more data/feedback on easily-measurable objectives.)
- Due to competitive pressures, systems continue being deployed despite some people pointing out this is a bad idea.
- The goals of AI systems gradually gain more influence over the future relative to human goals.
- Eventually, the proxies for which the AI systems are optimising come apart from the goals we truly care about, but by then humanity won’t be able to take back influence, and we’ll have permanently lost some of our ability to steer our trajectory. In the end, we will either go extinct or be mostly disempowered.
- (In some sense, this isn’t really a big departure from what is already happening today - just imagine replacing today’s powerful corporations and states with machines pursuing similar objectives).
[Part 2] Going out with a bang
- These AI systems end up learning objectives that are unrelated to the objective functions used in the training process, because the objective they ended up learning was more naturally discovered during the training process (e.g. "don't get shut down").
- The systems seek influence as an instrumental subgoal (since with more influence, a system is more likely to be able to e.g. prevent attempts to shut it down).
- Early in training, the best way to do that is by being obedient (since systems understand that unobedient behaviour would get them shut down).
- Then, once the systems become sufficiently capable, they attempt to acquire resources and influence to more effectively achieve their goals, including by eliminating the influence of humans. In the end, humans will most likely go extinct, because the systems have no incentive to preserve our survival.

Comment by Sam Clarke on The theory-practice gap · 2021-10-06T13:39:26.275Z · LW · GW

If we don’t have the techniques to reliably align AI, will someone deploy AI anyway? I think it’s more likely the answer is yes.

What level of deployment of unaligned benchmark systems do you expect would make doom plausible? "Someone" suggests maybe you think one deployment event of a sufficiently powerful system could be enough (which would be surprising in slow takeoff worlds). If you do think this, is it something to do with your expectations about discontinuous progress around AGI?

Comment by Sam Clarke on Collection of arguments to expect (outer and inner) alignment failure? · 2021-10-04T16:27:24.489Z · LW · GW

A more valid criticism would be that the authors spend most of their time on showing that all of these failure mechanisms are theoretically possible, without spending much time discussing how likely each of them is are in practice

Sure, I agree this is a stronger point.

The collection of posts under the threat models tag may be what you are looking for: many of these posts highlight the particular risk scenarios the authors feel are most compelling or likely.

Not really, unfortunately. In those posts, the authors are focusing on painting a plausible picture of what the world looks like if we screw up alignment, rather than analysing the arguments that we should expect alignment failures in the first place - which is what I'm interested in (with the exception of Steven's scenario, who already answered here).

The main problem with distilling this work into, say, a top 3 of most powerful 1-page arguments is that we are not dealing purely with technology-driven failure modes.

I fully agree that thinking through e.g. incentives that different actors will have in the lead up to TAI, the interaction between AI technology and society, etc. is super important. But we can think through those things as well - e.g. we can look at historical examples of humanity being faced with scenarios where the global economy is (mis)aligned with human needs, and reason about the extent to which AI will be different. I'd count all of that as part of the argument to expect alignment failure. Yes, as soon as you bring societal interactions into the mix, things become a whole lot more complicated. But that isn't reason not to try.

As it stands, I don't think there are super clear arguments for alignment failure that take into account interactions between AI tech and society that are ready to be distilled down, though I tried doing some of it here.

Equally, much of the discussion (and predictions of many leading thinkers in this space) is premised on technical alignment failure being the central concern (i.e. if we had better technical alignment solutions, we would manage to avoid existential catastrophe). I don't want to argue about whether that's correct here, but just want to point out that at least some people think that at least some of the plausible failure modes are mostly technology-driven.

So will you be distilling for an audience of pessimists or optimists?

Neither - just trying to think clearly through the arguments on both sides.

In the particular case you describe, I find the "pessimist" side more compelling, because I don't see much evidence that humanity has really learned any lessons from oil and climate change. In particular, we still don't know how to solve collective action problems.

This has kept me from spending much energy myself on rationally quantifying the odds of different failure mode scenarios. I'd rather spend my energy in finding ways to improve the odds.

Yeah, I'm sympathetic to this line of thought, and I think I personally tend to err on the side of trying to spend too much energy on quantifying odds and not enough on acting.

However, to the extent that you're impartial between different ways of trying to improve the odds (e.g. working on technical AI alignment vs other technical AI safety vs AI policy vs meta interventions vs other cause areas entirely), then it still pays to work out (e.g.) how plausible AI alignment failure is, in order to inform your decision about what to do if you want to have the best chance of helping.

Comment by Sam Clarke on Collection of arguments to expect (outer and inner) alignment failure? · 2021-10-04T15:01:06.692Z · LW · GW

I'm broadly sympathetic to your point that there have been an unfortunate number of disagreements about inner alignment terminology, and it has been and remains a source of confusion.

to the extent that Evan has felt a need to write an entire clarification post.

Yeah, and recently there has been even more disagreement/clarification attempts.

I should have specified this on the top level question, but (as mentioned in my own answer) I'm talking about abergal's suggestion of what inner alignment failure should refer to (basically: a model pursuing a different objective to the one it was trained on, when deployed out-of-distribution, while retaining most or all of the capabilities it had on the training distribution). I agree this isn't crisp and is far from a mathematical formalisim, but note that there are several examples of this kind of failure in current ML systems that help to clarify what the concept is, and people seem to agree on these examples.

If you can think of toy examples that make real trouble for this definition of inner alignment failure, then I'd be curious to hear what they are.

Comment by Sam Clarke on Collection of arguments to expect (outer and inner) alignment failure? · 2021-10-04T08:41:40.873Z · LW · GW

Thanks for your reply!

depends on what you mean with strongest arguments.

By strongest I definitely mean the second thing (probably I should have clarified here, thanks for picking up on this).

Also, the strongest argument when you address an audience of type A, say policy makers, may not be the strongest argument for an audience of type B, say ML researchers.

Agree, though I expect it's more like, the emphasis needs to be different, whilst the underlying argument is similar (conditional on talking about your second definition of "strongest").

many distilled collections of arguments already exist, even book-length ones like Superintelligence, Human Compatible, and The Alignment Problem.

Probably I should have clarified some more here. By "distilled", I mean:

a really short summary (e.g. <1 page for each argument, with links to literature which discuss the argument's premises)
that makes it clear what the epistemic status of the argument is.

Those books aren't short, and neither do they focus on working out exactly how strong the case for alignment failure is, but rather on drawing attention to the problem and claiming that more work needs to be done on the current margin (which I absolutely agree with).

I also don't think they focus on surveying the range of arguments for alignment failure, but rather on presenting the author's particular view.

If there are distilled collections of arguments with these properties, please let me know!

(As some more context for my original question: I'm most interested in arguments for inner alignment failure. I'm pretty confused by the fact that some researchers seem to think inner alignment is the main problem and/or probably extremely difficult, and yet I haven't really heard a rigorous case made for its plausibility.)

Comment by Sam Clarke on What are your greatest one-shot life improvements? · 2021-10-04T08:22:09.112Z · LW · GW

Immersion reading, i.e. reading a book and listening to the audio version at the same time. It makes it easier to read when tired, improves retention, increases the speed at which I can comfortably read.

Most of all, with a good narrator, it makes reading fiction feel closer to watching a movie in terms of the 'immersiveness' of the experience (which retaining all the ways in which fiction is better than film).

It's also marginally very cheap and easy if you're willing to pay for a Kindle and Audible subscription.

Comment by Sam Clarke on Collection of arguments to expect (outer and inner) alignment failure? · 2021-09-28T16:56:06.321Z · LW · GW

Arguments for outer alignment failure, i.e. that we will plausibly train advanced AI systems using a training objective that doesn't incentivise or produce the behaviour we actually want from the AI system. (Thanks to Richard for spelling out these arguments clearly in AGI safety from first principles.)

It's difficult to explicitly write out objective functions which express all our desires about AGI behaviour.
- There’s no simple metric which we’d like our agents to maximise - rather, desirable AGI behaviour is best formulated in terms of concepts like obedience, consent, helpfulness, morality, and cooperation, which we can’t define precisely in realistic environments.
- Although we might be able to specify proxies for those goals, Goodhart’s law suggests that some undesirable behaviour will score very well according to these proxies, and therefore be reinforced in AIs trained on them.
Comparatively primitive AI systems have already demonstrated many examples of outer alignment failures, even on much simpler objectives than what we would like AGIs to be able to do.

Arguments for inner alignment failure, i.e. that advanced AI systems will plausibly pursue an objective other than the training objective while retaining most or all of the capabilities it had on the training distribution.^[1]

There exist certain subgoals, such as "acquiring influence", that are useful for achieving a broad range of final goals. Therefore, these may reliably lead to higher reward during training. Agents might come to value these subgoals for their own sake, and highly capable agents that e.g. want influence are likely to take adversarial action against humans.
The models we train might learn heuristics instead of the complex training objective, which are good enough to score very well on the training distribution, but break down under distributional shift.
- This could happen if the model class isn't expressive enough to learn the training objective; or because heuristics are more easily discovered (than the training objective) during the learning process.
Argument by analogy to human evolution: humans are misaligned with the goal of increasing genetic fitness.
- The naive version of this argument seems quite weak to me, and could do with more investigation about just how analogous modern ML training and human evolution are.
The training objective is a narrow target among a large space of possible objectives that do well on the training distribution.
- The naive version of this argument also seems quite weak to me. Lots of human achievements have involved hitting very improbable, narrow targets. I think there's a steelman version, but I'm not going to try to give it here.
The arguments in Sections 3.2, 3.3 and 4.4 of Risks from Learned Optimization are also relevant, which give arguments for mesa-optimisation failure.
- (Remember, mesa-optimisation failure is a specific kind of inner alignment failure. It's an inner alignment failure when the learned model is a optimiser in the sense that it is internally searching through a search space looking for elements that score highly according to some objective function that is explicitly represented within the system).

This follows abergal's suggestion of what inner alignment should refer to. ↩︎

Comment by Sam Clarke on An Increasingly Manipulative Newsfeed · 2021-09-13T10:39:56.662Z · LW · GW

(Note: this post is an extended version of this post about stories of continuous deception. If you are already familiar with treacherous turn vs. sordid stumble you can skip the first part.)

FYI, broken link in this sentence.

Comment by Sam Clarke on Persuasion Tools: AI takeover without AGI or agency? · 2021-08-12T16:46:45.021Z · LW · GW

I found this post helpful and interesting, and refer to it often! FWIW I think that powerful persuasion tools could have bad effects on the memetic ecosystem even if they don't shift the balance of power to a world with fewer, more powerful ideologies. In particular, the number of ideologies could remain roughly constant, but each could get more 'sticky'. This would make reasonable debate and truth-seeking harder, as well as reducing trusted and credible multipartisan sources. This seems like an existential risk factor, e.g. because it will make coordination harder. (Analogy to how vaccine and mask hesitancy during Covid was partly due to insufficient trust in public health advice). Or more speculatively I could also imagine an extreme version of sticky, splintered epistemic bubbles this leading to moral stagnation/value lock-in.

Minor question on framing: I'm wondering why you chose to call this post "AI takeover without AGI or agency?" given that the effects of powerful persuasion tools you talk about aren't what (I normally think of as) "AI takeover"? (Rather, if I've understood correctly, they are "persuasion tools as existential risk factor", or "persuasion tools as mechanism for power concentration among humans".)

Somewhat related: I think there could be a case made for takeover by goal-directed but narrow AI, though I haven't really seen it made. But I can't see a case for takeover by non-goal-directed AI, since why would AI systems without goals want to take over? I'd be interested if you have any thoughts on those two things.

Comment by Sam Clarke on How to Sleep Better · 2021-07-19T11:04:51.656Z · LW · GW

only sleep when I'm tired

Sounds cool, I'm tempted to try this out, but I'm wondering how this jives with the common wisdom that going to bed at the same time every night is important? And "No screens an hour before bed" - how do you know what "an hour before bed is" if you just go to bed when tired?

Comment by Sam Clarke on How to Sleep Better · 2021-07-19T11:01:50.994Z · LW · GW

I feel similarly, and still struggle with turning off my brain. Has anything worked particularly well for you?

Comment by Sam Clarke on How to Sleep Better · 2021-07-19T10:59:20.405Z · LW · GW

I'm curious how you actually use the information from your Oura ring? To help measure the effectiveness of sleep interventions? As one input for deciding how to spend your day? As a motivator to sleep better? Something else?

Comment by Sam Clarke on Some thoughts on risks from narrow, non-agentic AI · 2021-07-01T07:52:11.445Z · LW · GW

Makes sense, thanks!

Comment by Sam Clarke on Some thoughts on risks from narrow, non-agentic AI · 2021-06-30T10:11:35.230Z · LW · GW

being trained on "follow instructions"

What does this actually mean, in terms of the details of how you'd train a model to do this?

Comment by Sam Clarke on Survey on AI existential risk scenarios · 2021-06-14T08:53:25.944Z · LW · GW

Thanks for the reply - a couple of responses:

it doesn't seem useful to get a feeling for "how far off of ideal are we likely to be" when that is composed of: 1. What is the possible range of AI functionality (as constrained by physics)? - ie what can we do?

No, these cases aren't included. The definition is: "an existential catastrophe that could have been avoided had humanity's development, deployment or governance of AI been otherwise". Physics cannot be changed by humanity's development/deployment/governance decisions. (I agree that cases 2 and 3 are included).

Knowing that experts think we have a (say) 10% chance of hitting the ideal window says nothing about what an interested party should do to improve those chances.

That's correct. The survey wasn't intended to understand respondents' views on interventions. It was only intended to understand: if something goes wrong, what do respondents think that was? Someone could run another survey that asks about interventions (in fact, this other recent survey does that). For the reasons given in the Motivation section of this post, we chose to limit our scope to threat models, rather than interventions.

Comment by Sam Clarke on Survey on AI existential risk scenarios · 2021-06-11T13:03:48.713Z · LW · GW

Thanks for pointing this out. We did intend for cases like this to be included, but I agree that it's unclear if respondents interpreted it that way. We should have clarified this in the survey instructions.

Comment by Sam Clarke on Survey on AI existential risk scenarios · 2021-06-11T13:00:36.743Z · LW · GW

Is one question combining the risk of "too much" AI use and "too little" AI use?

Yes, it is. Combining these cases seems reasonable to me, though we definitely should have clarified this in the survey instructions. They're both cases where humanity could avoided an existential catastrophe by making different decisions with respect to AI.

Comment by Sam Clarke on What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs) · 2021-06-07T15:31:54.348Z · LW · GW

Thanks a lot for this post, I found it extremely helpful and expect I will refer to it a lot in thinking through different threat models.

I'd be curious to hear how you think the Production Web stories differ from part 1 of Paul's "What failure looks like".

To me, the underlying threat model seems to be basically the same: we deploy AI systems with objectives that look good in the short-run, but when those systems become equally or more capable than humans, their objectives don't generalise "well" (i.e. in ways desirable by human standards), because they're optimising for proxies (namely, a cluster of objectives that could loosely be described as "maximse production" within their industry sector) that eventually come apart from what we actually want ("maximising production" eventually means using up resources critical to human survival but non-critical to machines).

From reading some of the comment threads between you and Paul, it seems like you disagree about where, on the margin, resources should be spent (improving the cooperative capabilities of AI systems and humans vs improving single-single intent alignment) - but you agree on this particular underlying threat model?

It also seems like you emphasise different aspects of these threat models: you emphasise the role of competitive pressures more (but they're also implicit in Paul's story), and Paul emphases failures of intent alignment more (but they're also present in your story) - though this is consistent with having the same underlying threat model?

(Of couse, both you and Paul also have other threat models, e.g. you have Flash War, Paul has part 2 of "What failure looks like", and also Another (outer) alignment failure story, which seems to be basically a more nuanced version of part 1 of "What failure looks like". Here, I'm curious specifically about the two theat models I've picked out.)

(I could have lots of this totally wrong, and would appreciate being corrected if so)

Comment by Sam Clarke on What are some real life Inadequate Equilibria? · 2021-05-25T14:00:04.225Z · LW · GW

I'm a bit confused about the edges of the inadequate equilbrium concept you're interested in.

In particular, do simple cases of negative externalities count? E.g. the econ 101 example of "factory pollutes river" - seems like an instance of (1) and (2) in Eliezer's taxonomy - depending on whether you're thinking of the "decision-maker" as (1) the factory owner (who would lose out personally) or (2) the government (who can't learn the information they need because the pollution is intentionally hidden). But this isn't what I'd typically think of as a bad Nash equilibrium, because (let's suppose) the factory owners wouldn't actually be better off by "cooperating"

Comment by Sam Clarke on What will 2040 probably look like assuming no singularity? · 2021-05-21T11:32:04.608Z · LW · GW

Just an outside view that over the last decades, a number of groups who previously had to suppress their identities/were vilified are now more accepted (e.g., LGBTQ+, feminists, vegans), and I expect this trend to continue.

I'm curious if you expect this trend to change, or maybe we're talking about slightly different things here?

Comment by Sam Clarke on What will 2040 probably look like assuming no singularity? · 2021-05-19T16:32:36.401Z · LW · GW

I had something like "everybody who has to strongly hide part of their identity when living in cities" in mind

Comment by Sam Clarke on Less Realistic Tales of Doom · 2021-05-19T16:27:17.699Z · LW · GW

Thanks for writing this! Here's another, that I'm posting specifically because it's confusing to me.

Value erosion

Takeoff was slow and lots of actors developed AGI around the same time. Intent alignment turned out relatively easy and so lots of actors with different values had access to AGIs that were trying to help them. Our ability to solve coordination problems remained at ~its current level. Nation states, or something like them, still exist, and there is still lots of economic competition between and within them. Sometimes there is military conflict, which destroys some nation states, but it never destroys the world.

The need to compete in these ways limits the extent to which each actor is able to spend their resources on things they actually want (because they have to spend a cut on competing, economically or militarily). Moreover, this cut is ever-increasing, since the actors who don't increase their competitiveness get wiped out. Different groups start spreading to the stars. Human descendants eventually colonise the galaxy, but have to spend ever closer to 100% of their energy on their militaries and producing economically valuable stuff. Those who don't get outcompeted (i.e. destroyed in conflict or dominated in the market) and so lose their most of their ability to get what they want.

Moral: even if we solve intent alignment, avoid catastrophic war or misuse of AI by bad actors, and other acute x-risks, the future could (would probably?) still be much worse than it could be, if we don't also coordinate to stop the value race to the bottom.

Comment by Sam Clarke on What will 2040 probably look like assuming no singularity? · 2021-05-19T15:33:25.311Z · LW · GW

Epistemic effort: I thought about this for 20 minutes and dumped my ideas, before reading others' answers

The latest language models are assisting or doing a number of tasks across society in rich countries, e.g.
- Helping lawyers search and summarise cases, suggest inferences, etc. but human lawyers still make calls at the end of the day
- Similar for policymaking, consultancy, business strategising etc.
- Lots of non-truth seeking journalism. All good investigative journalism is still done by humans.
- Telemarketing and some customer service jobs
The latest deep RL models are assisting or doing a number of tasks in across society in rich countries, e.g.
- Lots of manufacturing
- Almost all warehouse management
- Most content filtering on social media
- Financing decisions made by banks
Other predictions
- it's much easier to communicate with anyone, anywhere, at higher bandwidth (probably thanks to really good VR and internet)
- the way we consume information has changed a lot (probably also related to VR, and content selection algorithms getting really good)
- the way we shop has changed a lot (probably again due to content selection algorithms. I'm imagining there being very little effort between having a material desire and spending money to have it fulfilled)
- education hasn't really changed
- international travel hasn't really changed
- discrimination against groups that are marginalised in 2021 has reduced somewhat
- nuclear energy is even more widespread and much safer
- getting some psychotherapy or similar is really common (>80% of people)

Comment by Sam Clarke on What will 2040 probably look like assuming no singularity? · 2021-05-19T10:04:25.334Z · LW · GW

Thanks for this, really interesting!

Meta question: when you wrote this list, what did your thought process/strategies look like, and what do you think are the best ways of getting better at this kind of futurism?

More context:

One obvious answer to my second question is to get feedback - but the main bottleneck there is that these things won't happen for many years. Getting feedback from others (hence this post, I presume) is a partial remedy, but isn't clearly that helpful (e.g. if everyone's futurism capabilities are limited in the same ways). Maybe you've practised futurism over shorter time horizons a lot? Or you expect that people giving you feedback have?
After reading the first few entries, I spent 20 mins writing my own list before reading yours. Some questions/confusions that occurred:
- All of my ideas ended up with epistemic status "OK, that might happen, but I'd need to spend at least a day researching this to be able to say anything like "probably that'll happen by 2040" "
  - So I'm wondering if you did this/already had the background knowledge, or if I'm wrong that this is necessary
- My strategies were (1) consider important domains (e.g. military, financial markets, policymaking), and what better LMs/deep RL/DL in general/other emerging tech will do to those domains; (2) consider obvious AI/emerging tech applications (e.g. customer service); (3) look back to 2000 and 1980 and extrapolate apparent trends.
  - How good are these strategies? what other strategies are there? how should they be weighed?
- How much is my bottleneck to being better at this (a) better models for extrapolating trends in AI capabilities/other emerging tech vs (b) better models of particular domains vs (c) better models of the-world-in-general vs (d) something else?

Comment by Sam Clarke on Less Realistic Tales of Doom · 2021-05-12T14:52:54.638Z · LW · GW

Will MacAskill calls this the "actual alignment problem"

Wei Dai has written a lot about related concerns in posts like The Argument from Philosophical Difficulty

Comment by Sam Clarke on What Failure Looks Like: Distilling the Discussion · 2021-05-10T10:02:02.270Z · LW · GW

The AI systems in part I of the story are NOT "narrow" or "non-agentic"

There's no difference between the level of "narrowness" or "agency" of the AI systems between parts I and II of the story.
- Many people (including Richard Ngo and myself) seem to have interpreted part I as arguing that there could be an AI takeover by AI systems that are non-agentic and/or narrow (i.e. are not agentic AGI). But this is not at all what Paul intended to argue.
- Put another way, both parts I and II are instances of the "second species" concern/gorilla problem: that AI systems will gain control of humanity's future. (I think this is also identical to what people mean when they say "AI takeover".)
- As far as I can tell, this isn't really a different kind of concern from the classic Bostrom-Yudkowsky case for AI x-risk. It's just a more nuanced picture of what goes wrong, that also makes failure look plausible in slow takeoff worlds.
Instead, the key difference between parts I and II of the story is the way that the models' objectives generalise.
- In part II, it's the kind of generalisation typically called a "treacherous turn". The models learn the objective of "seeking influence". Early in training, the best way to do that is by "playing nice". The failure mode is that, once they become sufficiently capable, they no longer need to play nice and instead take control of humanity's future.
- In part I, it's a different kind of generalisation, which has been much less discussed. The models learn some easily-measurable objective which isn't what humans actually want. In other words, the failure mode is that these models are trying to "produce high scores" instead of "help humans get what they want". You might think that using human feedback to specify the base objective will alleviate this problem (e.g. use learn a reward model from human demonstrations or preferences about a hard-to-measure objective). But this doesn't obviously help: now, the failure mode is that the model learns the objective "do things that look to humans like you are achieving X" or "do things that the humans giving feedback about X will rate highly" (instead of "actually achieving X").
- Notice that in both of these scenarios, the models are mesa-optimizers (i.e. the learned models are themselves optimizers), and failure ensues because the models' learned objectives generalise in the wrong way.

This was discussed in comments (on a separate post) by Richard Ngo and Paul Christiano. There's a lot more important discussion in that comment thread, which is summarised in this doc.

Comment by Sam Clarke on AMA: Paul Christiano, alignment researcher · 2021-04-30T12:00:39.204Z · LW · GW

Relatedly: if we manage to solve intent alignment (including making it competitive) but still have an existential catastrophe, what went wrong?

User info

Posts

Comments

Value erosion