Posts

On attunement 2024-03-25T12:47:34.856Z
Video and transcript of presentation on Scheming AIs 2024-03-22T15:52:03.311Z
On green 2024-03-21T17:38:56.295Z
On the abolition of man 2024-01-18T18:17:06.201Z
Being nicer than Clippy 2024-01-16T19:44:23.893Z
An even deeper atheism 2024-01-11T17:28:31.843Z
Does AI risk “other” the AIs? 2024-01-09T17:51:47.020Z
When "yang" goes wrong 2024-01-08T16:35:50.607Z
Deep atheism and AI risk 2024-01-04T18:58:47.745Z
Gentleness and the artificial Other 2024-01-02T18:21:34.746Z
Otherness and control in the age of AGI 2024-01-02T18:15:54.168Z
Empirical work that might shed light on scheming (Section 6 of "Scheming AIs") 2023-12-11T16:30:57.989Z
Summing up "Scheming AIs" (Section 5) 2023-12-09T15:48:49.109Z
Speed arguments against scheming (Section 4.4-4.7 of “Scheming AIs") 2023-12-08T21:09:48.672Z
Simplicity arguments for scheming (Section 4.3 of "Scheming AIs") 2023-12-07T15:05:54.267Z
The counting argument for scheming (Sections 4.1 and 4.2 of "Scheming AIs") 2023-12-06T19:28:19.393Z
Arguments for/against scheming that focus on the path SGD takes (Section 3 of "Scheming AIs") 2023-12-05T18:48:12.917Z
Non-classic stories about scheming (Section 2.3.2 of "Scheming AIs") 2023-12-04T18:44:32.825Z
Does scheming lead to adequate future empowerment? (Section 2.3.1.2 of "Scheming AIs") 2023-12-03T18:32:42.748Z
The goal-guarding hypothesis (Section 2.3.1.1 of "Scheming AIs") 2023-12-02T15:20:28.152Z
How useful for alignment-relevant work are AIs with short-term goals? (Section 2.2.4.3 of "Scheming AIs") 2023-12-01T14:51:04.624Z
Is scheming more likely in models trained to have long-term goals? (Sections 2.2.4.1-2.2.4.2 of “Scheming AIs”) 2023-11-30T16:43:07.557Z
“Clean” vs. “messy” goal-directedness (Section 2.2.3 of “Scheming AIs”) 2023-11-29T16:32:30.068Z
Two sources of beyond-episode goals (Section 2.2.2 of “Scheming AIs”) 2023-11-28T13:49:49.175Z
Two concepts of an “episode” (Section 2.2.1 of “Scheming AIs”) 2023-11-27T18:01:29.153Z
Situational awareness (Section 2.1 of “Scheming AIs”) 2023-11-26T23:00:47.588Z
On “slack” in training (Section 1.5 of “Scheming AIs”) 2023-11-25T17:51:42.814Z
Why focus on schemers in particular (Sections 1.3 and 1.4 of “Scheming AIs”) 2023-11-24T19:18:34.229Z
A taxonomy of non-schemer models (Section 1.2 of “Scheming AIs”) 2023-11-22T15:24:37.126Z
Varieties of fake alignment (Section 1.1 of “Scheming AIs”) 2023-11-21T15:00:31.906Z
New report: "Scheming AIs: Will AIs fake alignment during training in order to get power?" 2023-11-15T17:16:42.088Z
Superforecasting the premises in “Is power-seeking AI an existential risk?” 2023-10-18T20:23:51.723Z
In memory of Louise Glück 2023-10-15T02:59:42.687Z
The “no sandbagging on checkable tasks” hypothesis 2023-07-31T23:06:02.909Z
Predictable updating about AI risk 2023-05-08T21:53:34.730Z
[Linkpost] Shorter version of report on existential risk from power-seeking AI 2023-03-22T18:09:02.938Z
A Stranger Priority? Topics at the Outer Reaches of Effective Altruism (my dissertation) 2023-02-21T17:26:12.981Z
Seeing more whole 2023-02-17T05:12:58.583Z
Why should ethical anti-realists do ethics? 2023-02-16T16:27:30.795Z
[Linkpost] Human-narrated audio version of "Is Power-Seeking AI an Existential Risk?" 2023-01-31T19:21:48.907Z
On sincerity 2022-12-23T17:13:09.478Z
Against meta-ethical hedonism 2022-12-02T00:23:26.039Z
Against the normative realist's wager 2022-10-13T16:35:30.933Z
Video and Transcript of Presentation on Existential Risk from Power-Seeking AI 2022-05-08T03:50:12.758Z
On expected utility, part 4: Dutch books, Cox, and Complete Class 2022-03-24T07:51:18.221Z
On expected utility, part 3: VNM, separability, and more 2022-03-22T03:05:21.073Z
On expected utility, part 2: Why it can be OK to predictably lose 2022-03-18T08:38:52.045Z
On expected utility, part 1: Skyscrapers and madmen 2022-03-16T21:58:39.257Z
Simulation arguments 2022-02-18T10:45:55.541Z
On infinite ethics 2022-01-31T07:04:44.244Z

Comments

Comment by Joe Carlsmith (joekc) on Open Thread Spring 2024 · 2024-03-22T19:49:16.401Z · LW · GW

That post ran into some cross-posting problems so had to re-do

Comment by Joe Carlsmith (joekc) on Counting arguments provide no evidence for AI doom · 2024-03-06T21:22:54.358Z · LW · GW

The point of that part of my comment was that insofar as part of Nora/Quintin's response to simplicity argument is to say that we have active evidence that SGD's inductive biases disfavor schemers, this seems worth just arguing for directly, since even if e.g. counting arguments were enough to get you worried about schemers from a position of ignorance about SGD's inductive biases, active counter-evidence absent such ignorance could easily make schemers seem quite unlikely overall.

There's a separate question of whether e.g. counting arguments like mine above (e.g.,  "A very wide variety of goals can prompt scheming; By contrast, non-scheming goals need to be much more specific to lead to high reward; I’m not sure exactly what sorts of goals SGD’s inductive biases favor, but I don’t have strong reason to think they actively favor non-schemer goals; So, absent further information, and given how many goals-that-get-high-reward are schemer-like, I should be pretty worried that this model is a schemer") do enough evidence labor to privilege schemers as a hypothesis at all. But that's the question at issue in the rest of my comment. And in e.g. the case of "there are 1000 chinese restaurants in this, and only ~100 non-chinese restaurants," the number of chinese restaurants seems to me like it's enough to privilege "Bob went to a chinese restaurant" as a hypothesis (and this even without thinking that he made his choice by sampling randomly from a uniform distribution over restaurants). Do you disagree in that restaurant case? 

Comment by Joe Carlsmith (joekc) on Counting arguments provide no evidence for AI doom · 2024-03-06T21:05:54.870Z · LW · GW

The probability I give for scheming in the report is specifically for (goal-directed) models that are trained on diverse, long-horizon tasks (see also Cotra on "human feedback on diverse tasks," which is the sort of training she's focused on). I agree that various of the arguments for scheming could in principle apply to pure pre-training as well, and that folks (like myself) who are more worried about scheming in other contexts (e.g., RL on diverse, long-horizon tasks) have to explain what makes those contexts different. But I think there are various plausible answers here related to e.g. the goal-directedness, situational-awareness, and horizon-of-optimization of the models in questions (see e.g. here for some discussion, in the report, for why goal-directed models trained on longer episode seem more likely to scheme; and see here for discussion of why situational awareness seems especially likely/useful in models performing real-world tasks for you).

Re: "goal optimization is a good way to minimize loss in general" -- this isn't a "step" in the arguments for scheming I discuss. Rather, as I explain in the intro to report, the arguments I discuss condition on the models in question being goal-directed (not an innocuous assumptions, I think -- but one I explain and argue for in section 3 of my power-seeking report, and which I think important to separate from questions about whether to expect goal-directed models to be schemers), and then focus on whether the goals in question will be schemer-like. 

Comment by Joe Carlsmith (joekc) on Counting arguments provide no evidence for AI doom · 2024-02-28T05:15:03.304Z · LW · GW

Thanks for writing this -- I’m very excited about people pushing back on/digging deeper re: counting argumentssimplicity arguments, and the other arguments re: scheming I discuss in the report. Indeed, despite the general emphasis I place on empirical work as the most promising source of evidence re: scheming, I also think that there’s a ton more to do to clarify and maybe debunk the more theoretical arguments people offer re: scheming – and I think playing out the dialectic further in this respect might well lead to comparatively fast progress (for all their centrality to the AI risk discourse, I think arguments re: scheming have received way too little direct attention). And if, indeed, the arguments for scheming are all bogus, this is super good news and would be an important update, at least for me, re: p(doom) overall. So overall I’m glad you’re doing this work and think this is a valuable post. 

Another note up front: I don’t think this post “surveys the main arguments that have been put forward for thinking that future AIs will scheme.” In particular: both counting arguments and simplicity arguments (the two types of argument discussed in the post) assume we can ignore the path that SGD takes through model space. But the report also discusses two arguments that don’t make this assumption – namely, the “training-game independent proxy goals story” (I think this one is possibly the most common story, see e.g. Ajeya here, and all the talk about the evolution analogy) and the “nearest max-reward goal argument.” I think that the idea that “a wide variety of goals can lead to scheming” plays some role in these arguments as well, but not such that they are just the counting argument restated, and I think they’re worth treating on their own terms. 

On counting arguments and simplicity arguments

Focusing just on counting arguments and simplicity arguments, though: Suppose that I’m looking down at a superintelligent model newly trained on diverse, long-horizon tasks. I know that it has extremely ample situational awareness – e.g., it has highly detailed models of the world, the training process it’s undergoing, the future consequences of various types of power-seeking, etc – and that it’s getting high reward because it’s pursuing some goal (the report conditions on this). Ok, what sort of goal? 

We can think of arguments about scheming in two categories here. 

  • (I) The first tries to be fairly uncertain/agnostic about what sorts of goals SGD’s inductive biases favor, and it argues that given this uncertainty, we should be pretty worried about scheming. 
    • I tend to think of my favored version of the counting argument (that is, the hazy counting argument) in these terms. 
  • (II) The second type focuses on a particular story about SGD’s inductive biases and then argues that this bias favors schemers.
    • I tend to think of simplicity arguments in these terms. E.g., the story is that SGD’s inductive biases favor simplicity, schemers can have simpler goals, so schemers are favored.

Let’s focus first on (I), the more-agnostic-about-SGD’s-inductive-biases type. Here’s a way of pumping the sort of intuition at stake in the hazy counting argument:

  1. A very wide variety of goals can prompt scheming.
  2. By contrast, non-scheming goals need to be much more specific to lead to high reward.
  3. I’m not sure exactly what sorts of goals SGD’s inductive biases favor, but I don’t have strong reason to think they actively favor non-schemer goals.
  4. So, absent further information, and given how many goals-that-get-high-reward are schemer-like, I should be pretty worried that this model is a schemer. 

Now, as I mention in the report, I'm happy to grant that this isn't a super rigorous argument. But how, exactly, is your post supposed to comfort me with respect to it? We can consider two objections, both of which are present in/suggested by your post in various ways.

  • (A) This sort of reasoning would lead to you giving significant weight to SGD overfitting. But SGD doesn’t overfit, so this sort of reasoning must be going wrong, and in fact you should have low probability on SGD having selected a schemer, even given this ignorance about SGD's inductive biases.
  • (B): (3) is false: we know enough about SGD’s inductive biases to know that it actively favors non-scheming goals over scheming goals. 

Let’s start with (A). I agree that this sort of reasoning would lead you to giving significant weight to SGD overfitting, absent any further evidence. But it’s not clear to me that giving this sort of weight to overfitting was unreasonable ex ante, or that having learned that SGD-doesn't-overfit, you should now end up with low p(scheming) even given your ongoing ignorance about SGD's inductive biases.

Thus, consider the sort of analogy I discuss in the counting arguments section. Suppose that all we know is that Bob lives in city X, that he went to a restaurant on Saturday, and that town X has a thousand chinese restaurants, a hundred mexican restaurants, and one indian restaurant. What should our probability be that he went to a chinese restaurant? 

In this case, my intuitive answer here is: “hefty.”[1] In particular, absent further knowledge about Bob’s food preferences, and given the large number of chinese restaurants in the city, “he went to a chinese restaurant” seems like a pretty salient hypothesis. And it seems quite strange to be confident that he went to a non-chinese restaurant instead. 

Ok but now suppose you learn that last week, Bob also engaged in some non-restaurant leisure activity. For such leisure activities, the city offers: a thousand movie theaters, a hundred golf courses, and one escape room. So it would’ve been possible to make a similar argument for putting hefty credence on Bob having gone to a movie. But lo, it turns out that actually, Bob went golfing instead, because he likes golf more than movies or escape rooms.

How should you update about the restaurant Bob went to? Well… it’s not clear to me you should update much. Applied to both leisure and to restaurants, the hazy counting argument is trying to be fairly agnostic about Bob’s preferences, while giving some weight to some type of “count.” Trying to be uncertain and agnostic does indeed often mean putting hefty probabilities on things that end up false. But: do you have a better proposed alternative, such that you shouldn’t put hefty probability on “Bob went to a chinese restaurant”, here, because e.g. you learned that hazy counting arguments don’t work when applied to Bob? If so, what is it? And doesn’t it seem like it’s giving the wrong answer?

Or put another way: suppose you didn’t yet know whether SGD overfits or not, but you knew e.g. about the various theoretical problems with unrestricted uses of the indifference principle. What should your probability have been, ex ante, on SGD overfitting? I’m pretty happy to say “hefty,” here. E.g., it’s not clear to me that the problem, re: hefty-probability-on-overfitting, was some a priori problem with hazy-counting-argument-style reasoning. For example: given your philosophical knowledge about the indifference principle, but without empirical knowledge about ML, should you have been super surprised if it turned out that SGD did overfit? I don’t think so. 

Now, you could be making a different, more B-ish sort of argument here: namely, that the fact that SGD doesn’t overfit actively gives us evidence that SGD’s inductive biases also disfavor schemers. This would be akin to having seen Bob, in a different city, actively seek out mexican restaurants despite there being many more chinese restaurants available, such that you now have active evidence that he prefers mexican and is willing to work for it. This wouldn’t be a case of having learned that bob’s preferences are such that hazy counting arguments “don’t work on bob” in general. But it would be evidence that Bob prefers non-chinese.

I’m pretty interested in arguments of this form. But I think that pretty quickly, they move into the territory of type (II) arguments above: that is, they start to say something like “we learn, from SGD not overfitting, that it prefers models of type X. Non-scheming models are of type X, schemers are not, so we now know that SGD won’t prefer schemers.”

But what is X? I’m not sure your answer (though: maybe it will come in a later post). You could say something like “SGD prefers models that are ‘natural’” – but then, are schemers natural in that sense? Or, you could say “SGD prefers models that behave similarly on the training and test distributions” – but in what sense is a schemer violating this standard? On both distributions, a schemer seeks after their schemer-like goal. I’m not saying you can’t make an argument for a good X, here – but I haven’t yet heard it. And I’d want to hear its predictions about non-scheming forms of goal-misgeneralization as well.  

Indeed, my understanding is that a quite salient candidate for “X” here is “simplicity” – e.g., that SGD’s not overfitting is explained by its bias towards simpler functions. And this puts us in the territory of the “simplicity argument” above. I.e., we’re now being less agnostic about SGD’s preferences, and instead positing some more particular bias. But there’s still the question of whether this bias favors schemers or not, and the worry is that it does. 

This brings me to your take on simplicity arguments. I agree with you that simplicity arguments are often quite ambiguous about the notion of simplicity at stake (see e.g. my discussion here). And I think they’re weak for other reasons too (in particular, the extra cognitive faff scheming involves seems to me more important than its enabling simpler goals). 

But beyond “what is simplicity anyway,” you also offer some other considerations, other than SGD-not-overfitting, meant to suggest that we have active evidence that SGD’s inductive biases disfavor schemers. I’m not going to dig deep on those considerations here, and I’m looking forward to your future post on the topic. For now, my main reaction is: “we have active evidence that SGD’s inductive biases disfavor schemers” seems like a much more interesting claim/avenue of inquiry than trying to nail down the a priori philosophical merits of counting arguments/indifference principles, and if you believe we have that sort of evidence, I think it’s probably most productive to just focus on fleshing it out and examining it directly. That is, whatever their a priori merits, counting arguments are attempting to proceed from a position of lots of uncertainty and agnosticism, which only makes sense if you’ve got no other good evidence to go on. But if we do have such evidence (e.g., if (3) above is false), then I think it can quickly overcome whatever “prior” counting arguments set (e.g., if you learn that Bob has a special passion for mexican food and hates chinese, you can update far towards him heading to a mexican restaurant). In general, I’m very excited for people to take our best current understanding of SGD’s inductive biases (it’s not my area of expertise), and apply it to p(scheming), and am interested to hear your own views in this respect. But if we have active evidence that SGD’s inductive biases point away from schemers, I think that whether counting arguments are good absent such evidence matters way less, and I, for one, am happy to pay them less attention.

(One other comment re: your take on simplicity arguments: it seems intuitively pretty non-simple to me to fit the training data on the training distribution, and then cut to some very different function on the test data, e.g. the identity function or the constant function. So not sure your parody argument that simplicity also predicts overfitting works. And insofar as simplicity is supposed to be the property had by non-overfitting functions, it seems somewhat strange if positing a simplicity bias predicts over-fitting after all.)

A few other comments

Re: goal realism, it seems like the main argument in the post is something like: 

  1. Michael Huemer says that it’s sometimes OK to use the principle of indifference if you’re applying it to explanatorily fundamental variables. 
  2. But goals won’t be explanatorily fundamental. So the principle of indifference is still bad here.

I haven’t yet heard much reason to buy Huemer’s view, so not sure how much I care about debating whether we should expect goals to satisfy his criteria of fundamentality. But I'll flag I do feel like there’s a pretty robust way in which explicitly-represented goals appropriately enter into our explanations of human behavior – e.g., I have buying a flight to New York because I want to go to New York, I have a representation of that goal and how my flight-buying achieves it, etc. And it feels to me like your goal reductionism is at risk of not capturing this. (To be clear: I do think that how we understand goal-directedness matters for scheming -- more here -- and that if models are only goal-directed in a pretty deflationary sense, this makes scheming a way weirder hypothesis. But I think that if models are as goal-directed as strategic and agentic humans reasoning about how to achieve explicitly represented goals, their goal-directedness has met a fairly non-deflationary standard.)

I’ll also flag some broader unclarity about the post’s underlying epistemic stance. You rightly note that the strict principle of indifference has many philosophical problems. But it doesn’t feel to me like you’ve given a compelling alternative account of how to reason “on priors” in the sorts of cases where we’re sufficiently uncertain that there’s a temptation to spread one’s credence over many possibilities in the broad manner that principles-of-indifference-ish reasoning attempts to do. 

Thus, for example, how does your epistemology think about a case like “There are 1000 people in this town, one of them is the murderer, what’s the probability that it’s Mortimer P. Snodgrass?” Or: “there are a thousand white rooms, you wake up in one of them, what’s the probability that it’s room number 734?” These aren’t cases like dice, where there’s a random process designed to function in principle-of-indifference-ish ways. But it’s pretty tempting to spread your credence out across the people/rooms (even if in not-fully-uniform ways), in a manner that feels closely akin to the sort of thing that principle-of-indifference-ish reasoning is trying to do. (We can say "just use all the evidence available to you" -- but why should this result in such principle-of-indifference-ish results?)

Your critique of counting argument would be more compelling to me if you had a fleshed out account of cases like these -- e.g., one which captures the full range of cases where we’re pulled towards something principle-of-indifference-ishsuch that you can then take that account and explain why it shouldn’t point us towards hefty probabilities on schemers, a la the hazy counting argument, even given very-little-evidence about SGD’s inductive biases.

More to say on all this, and I haven't covered various ways in which I'm sympathetic to/moved by points in the vicinity of the ones you're making here.  But for now: thanks again for writing, looking forward to future installments. 

  1. ^

    Though I do think cases like this can get complicated, and depending on how you carve up the hypothesis space, in some versions "hefty" won't be the right answer. 

Comment by Joe Carlsmith (joekc) on When "yang" goes wrong · 2024-01-11T02:39:52.715Z · LW · GW

Ah nice, thanks!

Comment by Joe Carlsmith (joekc) on On infinite ethics · 2023-12-21T17:57:53.994Z · LW · GW

Hi David -- it's true that I don't engage your paper (there's a large literature on infinite ethics, and the piece leaves out a lot of it -- and I'm also not sure I had seen your paper at the time I was writing), but re: your comments here on the ethical relevance of infinities: I discuss the fact that the affectable universe is probably finite -- "current science suggests that our causal influence is made finite by things like lightspeed and entropy" -- in section 1 of the essay (paragraph 5), and argue that infinites are still 

  1. relevant in practice due to (i) the remaining probability that current physical theories are wrong about our causal influence (and note also related possibilities like having causal influence on whether you go to an infinite-heaven/hell etc, a la pascal) and (ii) due to the possibility of having infinite acausal influence conditional on various plausible-in-my-view decision theories, and
  2. relevant to ethical theory, even if not to day-to-day decision-making, due to (a) ethical theory generally aspiring to cover various physically impossible cases, and (b) the existence of intuitions about infinite cases (e.g., heaven > hell, pareto, etc) that seem prima facie amenable to standard attempts at systematization. 
Comment by Joe Carlsmith (joekc) on New report: "Scheming AIs: Will AIs fake alignment during training in order to get power?" · 2023-11-28T01:27:21.667Z · LW · GW

(Partly re-hashing my response from twitter.)

I'm seeing your main argument here as a version of what I call, in section 4.4, a "speed argument against schemers" -- e.g., basically, that SGD will punish the extra reasoning that schemers need to perform. 

(I’m generally happy to talk about this reasoning as a complexity penalty, and/or about the params it requires, and/or about circuit-depth -- what matters is the overall "preference" that SGD ends up with. And thinking of this consideration as a different kind of counting argument *against* schemers seems like it might well be a productive frame. I do also think that questions about whether models will be bottlenecked on serial computation, and/or whether "shallower" computations will be selected for, are pretty relevant here, and the report includes a rough calculation in this respect in section 4.4.2 (see also summary here).)

Indeed, I think that maybe the strongest single argument against scheming is a combination of 

  1. "Because of the extra reasoning schemers perform, SGD would prefer non-schemers over schemers in a comparison re: final properties of the models" and 
  2. "The type of path-dependence/slack at stake in training is such that SGD will get the model that it prefers overall." 

My sense is that I'm less confident than you in both (1) and (2), but I think they're both plausible (the report, in particular, argues in favor of (1)), and that the combination is a key source of hope. I'm excited to see further work fleshing out the case for both (including e.g. the sorts of arguments for (2) that I took you and Nora to be gesturing at on twitter -- the report doesn't spend a ton of time on assessing how much path-dependence to expect, and of what kind).

Re: your discussion of the "ghost of instrumental reasoning," "deducing lots of world knowledge 'in-context,' and "the perspective that NNs will 'accidentally' acquire such capabilities internally as a convergent result of their inductive biases" -- especially given that you only skimmed the report's section headings and a small amount of the content, I have some sense, here, that you're responding to other arguments you've seen about deceptive alignment, rather than to specific claims made in the report (I don't, for example, make any claims about world knowledge being derived "in-context," or about models "accidentally" acquiring flexible instrumental reasoning). Is your basic thought something like: sure, the models will develop flexible instrumental reasoning that could in principle be used in service of arbitrary goals, but they will only in fact use it in service of the specified goal, because that's the thing training pressures them to do? If so, my feeling is something like: ok, but a lot of the question here is whether using the instrumental reasoning in service of some other goal (one that backchains into getting-reward) will be suitably compatible with/incentivized by training pressures as well. And I don't see e.g. the reversal curse as strong evidence on this front. 

Re: "mechanistically ungrounded intuitions about 'goals' and 'tryingness'" -- as I discuss in section 0.1, the report is explicitly setting aside disputes about whether the relevant models will be well-understood as goal-directed (my own take on that is in section 2.2.1 of my report on power-seeking AI here). The question in this report is whether, conditional on goal-directedness, we should expect scheming. That said, I do think that what I call the "messyness" of the relevant goal-directedness might be relevant to our overall assessment of the arguments for scheming in various ways, and that scheming might require an unusually high standard of goal-directedness in some sense. I discuss this in section 2.2.3, on "'Clean' vs. 'messy' goal-directedness," and in various other places in the report.

Re: "long term goals are sufficiently hard to form deliberately that I don't think they'll form accidentally" -- the report explicitly discusses cases where we intentionally train models to have long-term goals (both via long episodes, and via short episodes aimed at inducing long-horizon optimization). I think scheming is more likely in those cases. See section 2.2.4, "What if you intentionally train the model to have long-term goals?" That said, I'd be interested to see arguments that credit assignment difficulties actively count against the development of beyond-episode goals (whether in models trained on short episodes or long episodes) for models that are otherwise goal-directed. And I do think that, if we could be confident that models trained on short episodes won't learn beyond-episode goals accidentally (even irrespective of mundane adversarial training -- e.g., that models rewarded for getting gold coins on the episode would not learn a goal that generalizes to caring about gold coins in general, even prior to efforts to punish it for sacrificing gold-coins-on-the-episode for gold-coins-later), that would be a significant source of comfort (I discuss some possible experimental directions in this respect in section 6.2).

Comment by Joe Carlsmith (joekc) on Situational awareness (Section 2.1 of “Scheming AIs”) · 2023-11-28T00:04:01.062Z · LW · GW

I agree that AIs only optimizing for good human ratings on the episode (what I call "reward-on-the-episode seekers") have incentives to seize control of the reward process, that this is indeed dangerous, and that in some cases it will incentivize AIs to fake alignment in an effort to seize control of the reward process on the episode (I discuss this in the section on "non-schemers with schemer-like traits"). However, I also think that reward-on-the-episode seekers are also substantially less scary than schemers in my sense, for reasons I discuss here (i.e., reasons to do with what I call "responsiveness to honest tests," the ambition and temporal scope of their goals, and their propensity to engage in various forms of sandbagging and what I call "early undermining"). And this especially for reward-on-the-episode seekers with fairly short episodes, where grabbing control over the reward process may not be feasible on the relevant timescales.

Comment by Joe Carlsmith (joekc) on Situational awareness (Section 2.1 of “Scheming AIs”) · 2023-11-27T23:54:34.916Z · LW · GW

Agree that it would need to have some conception of the type of training signal to optimize for, that it will do better in training the more accurate its picture of the training signal, and that this provides an incentive to self-locate more accurately (though not necessary to degree at stake in e.g. knowing what server you're running on).

Comment by Joe Carlsmith (joekc) on New report: "Scheming AIs: Will AIs fake alignment during training in order to get power?" · 2023-11-27T23:08:23.169Z · LW · GW

The question of how strongly training pressures models to minimize loss is one that I isolate and discuss explicitly in the report, in section 1.5, "On 'slack' in training" -- and at various points the report references how differing levels of "slack" might affect the arguments it considers. Here I was influenced in part by discussions with various people, yourself included, who seemed to disagree about how much weight to put on arguments in the vein of: "policy A would get lower loss than policy B, so we should think it more likely that SGD selects policy A than policy B." 

(And for clarity, I don't think that arguments of this form always support expecting models to do tons of reasoning about the training set-up. For example, as the report discusses in e.g. Section 4.4, on "speed arguments," the amount of world-modeling/instrumental-reasoning that the model does can affect the loss it gets via e.g. using up cognitive resources. So schemers -- and also, reward-on-the-episode seekers -- can be at some disadvantage, in this respect, relative to models that don't think about the training process at all.)

Comment by Joe Carlsmith (joekc) on New report: "Scheming AIs: Will AIs fake alignment during training in order to get power?" · 2023-11-27T22:44:02.254Z · LW · GW

Agents that end up intrinsically motivated to get reward on the episode would be "terminal training-gamers/reward-on-the-episode seekers," and not schemers, on my taxonomy. I agree that terminal training-gamers can also be motivated to seek power in problematic ways (I discuss this in the section on "non-schemers with schemer-like traits"), but I think that schemers proper are quite a bit scarier than reward-on-the-episode seekers, for reasons I describe here.

Comment by Joe Carlsmith (joekc) on Understanding and controlling auto-induced distributional shift · 2023-08-21T07:05:55.992Z · LW · GW

I found this post a very clear and useful summary -- thanks for writing. 

Comment by Joe Carlsmith (joekc) on Predictable updating about AI risk · 2023-05-12T19:42:54.197Z · LW · GW

Re: "0.00002 would be one in five hundred thousand, but with the percent sign it's one in fifty million." -- thanks, edited. 

Re: volatility -- thanks, that sounds right to me, and like a potentially useful dynamic to have in mind. 

Comment by Joe Carlsmith (joekc) on Why should ethical anti-realists do ethics? · 2023-02-17T17:54:05.897Z · LW · GW

Oops! You're right, this isn't the right formulation of the relevant principle. Will edit to reflect. 

Comment by Joe Carlsmith (joekc) on Utilitarianism Meets Egalitarianism · 2023-01-14T01:36:07.597Z · LW · GW

Really appreciated this sequence overall, thanks for writing.

Comment by Joe Carlsmith (joekc) on Strong Evidence is Common · 2023-01-12T02:48:45.873Z · LW · GW

I really like this post. It's a crisp, useful insight, made via a memorable concrete example (plus a few others), in a very efficient way. And it has stayed with me. 

Comment by Joe Carlsmith (joekc) on On sincerity · 2022-12-28T20:03:24.319Z · LW · GW

Thanks for these thoughtful comments, Paul. 

  • I think the account you offer here is a plausible tack re: unification — I’ve added a link to it in the “empirical approaches” section. 
  • “Facilitates a certain flavor of important engagement in the vicinity of persuasion, negotiation and trade” is a helpful handle, and another strong sincerity association for me (cf "a space that feels ready to collaborate, negotiate, figure stuff out, make stuff happen"). 
  • I agree that it’s not necessarily desirable for sincerity (especially in your account’s sense) to permeate your whole life (though on my intuitive notion, it’s possible for some underlying sincerity to co-exist with things like play, joking around, etc), and that you can’t necessarily get to sincerity by some simple move like “just letting go of pretense.” 
  • This stuff about encouraging more effective delusion by probing for sincerity via introspection is interesting, as are these questions about whether I’m underestimating the costs of sincerity. In this latter respect, maybe worth distinguishing the stakes of “thoughts in the privacy of your own head" (e.g., questions about the value of self-deception, non-attention to certain things, etc) from more mundane costs re: e.g., sincerity takes effort, it’s not always the most fun thing, and so on. Sounds like you've got the former especially in mind, and they seem like the most salient source of possible disagreement. I agree it's a substantive question how the trade-offs here shake out, and at some point would be curious to hear more about your take.
Comment by Joe Carlsmith (joekc) on On expected utility, part 2: Why it can be OK to predictably lose · 2022-12-24T17:25:58.648Z · LW · GW

Glad to hear you liked it :)

Comment by Joe Carlsmith (joekc) on Can you control the past? · 2022-05-08T22:26:55.681Z · LW · GW

:) -- nice glasses

Comment by Joe Carlsmith (joekc) on On expected utility, part 2: Why it can be OK to predictably lose · 2022-03-23T19:50:23.732Z · LW · GW

Oops! Yep, thanks for catching. 

Comment by Joe Carlsmith (joekc) on On expected utility, part 3: VNM, separability, and more · 2022-03-22T20:22:53.123Z · LW · GW

Thanks! Fixed.

Comment by Joe Carlsmith (joekc) on The innocent gene · 2022-02-14T22:33:11.692Z · LW · GW

Yeah, as I say, I think "neither innocent nor guilty" is least misleading -- but I find "innocent" an evocative frame. Do you have suggestions for an alternative to "selfish"?

Comment by Joe Carlsmith (joekc) on The ignorance of normative realism bot · 2022-01-20T05:21:13.999Z · LW · GW

Is the argument here supposed to be particular to meta-normativity, or is it something more like "I generally think that there are philosophy facts, those seem kind of a priori-ish and not obviously natural/normal, so maybe a priori normative facts are OK too, even if we understand neither of them"? 

Re: meta-philosophy, I tend to see philosophy as fairly continuous with just "good, clear thinking" and "figuring out how stuff hangs together," but applied in a very general way that includes otherwise confusing stuff. I agree various philosophical domains feel pretty a priori-ish, and I don't have a worked out view of a priori knowledge, especially synthetic a priori knowledge (I tend to expect us to be able to give an account of how we get epistemic access to analytic truths). But I think I basically want to make the same demands of other a priori-ish domains that I do normativity. That is, I want the right kind of explanatory link between our belief formation and the contents of the domain -- which, for "realist" construals of the domain, I expect to require that the contents of the domain play some role in explaining our beliefs. 

Re: the relationship between meta-normativity and normativity in particular, I wonder if a comparison to the relationship between "meta-theology" and "theology" might be instructive here. I feel like I want to be fairly realist about certain "meta-theological facts" like "the God of Christianity doesn't exist" (maybe this is just a straightforward theological fact?). But this doesn't tempt me towards realism about God. Maybe talking about normative "properties" instead of normative facts would be easier here, since one can imagine e.g. a nihilist denying the existence of normative properties, but accepting some 'normative' (meta-normative?) facts like "there is no such thing as goodness" or "pleasure is not good."

Comment by Joe Carlsmith (joekc) on Reviews of “Is power-seeking AI an existential risk?” · 2022-01-03T05:58:04.909Z · LW · GW

Reviewers ended up on the list via different routes. A few we solicited specifically because we expected them to have relatively well-developed views that disagree with the report in one direction or another (e.g., more pessimistic, or more optimistic), and we wanted to understand the best objections in this respect. A few came from trying to get information about how generally thoughtful folks with different backgrounds react to the report. A few came from sending a note to GPI saying we were open to GPI folks providing reviews. And a few came via other miscellaneous routes. I’d definitely be interested to see more reviews from mainstream ML researchers, but understanding how ML researchers in particular react to the report wasn’t our priority here.

Comment by Joe Carlsmith (joekc) on Reviews of “Is power-seeking AI an existential risk?” · 2021-12-21T04:44:14.725Z · LW · GW

Cool, these comments helped me get more clarity about where Ben is coming from. 

Ben, I think the conception of planning I’m working with is closest to your “loose” sense. That is, roughly put, I think of planning as happening when (a) something like simulations are happening, and (b) the output is determined (in the right way) at least partly on the basis of those simulations (this definition isn’t ideal, but hopefully it’s close enough for now). Whereas it sounds like you think of (strict) planning as happening when (a) something like simulations are happening, and (c) the agent’s overall policy ends up different (and better) as a result. 

What’s the difference between (b) and (c)? One operationalization could be: if you gave an agent input 1, then let it do its simulations thing and produce an output, then gave it input 1 again, could the agent’s performance improve, on this round, in virtue of the simulation-running that it did on the first round? On my model, this isn’t necessary for planning; whereas on yours, it sounds like it is? 

Let’s say this is indeed a key distinction. If so, let’s call my version “Joe-planning” and your version “Ben-planning.” My main point re: feedforward neural network was that they could do Joe-planning in principle, which it sounds like you think at least conceivable. I agree that it seems tough for shallow feedforward networks to do much of Joe-planning in practice. I also grant that when humans plan, they are generally doing Ben-planning in addition to Joe-planning (e.g., they’re generally in a position to do better on a given problem in virtue of having planned about that same problem yesterday).

Seems like key questions re: the connection to AI X-risk include:

  1. Is there reason to think a given type of planning especially dangerous and/or relevant to the overall argument for AI X-risk?
  2. Should we expect that type of planning to be necessary for various types of task performance?

Re: (1), I do think Ben-planning poses dangers that Joe-planning doesn’t. Notably, Ben planning does indeed allow a system to improve/change its policy "on its own" and without new data, whereas Joe planning need not — and this seems more likely to yield unexpected behavior. This seems continuous, though, with the fact that a Ben-planning agent is learning/improving its capabilities in general, which I flag separately as an important risk factor.

Another answer to (1), suggested by some of your comments, could appeal to the possibility that agents are more dangerous when you can tweak a single simple parameter like “how much time they have to think” or “search depth” and thereby get better performance (this feels related to Eliezer’s worries about “turning up the intelligence dial” by “running it with larger bounds on the for-loops”). I agree that if you can just “turn up the intelligence dial,” that is quite a bit more worrying than if you can’t — but I think this is fairly orthogonal to the Joe-planning vs. Ben-planning distinction. For example, I think you can have Joe-planning agents where you can increase e.g. their search depth by tweaking a single parameter, and you can have Ben-planning agents where the parameters you’d need to tweak aren’t under your control (or the agent’s control), but rather are buried inside some tangled opaque neural network you don't understand.

The central reason I'm interested in Joe-planning, though, is that I think the instrumental convergence argument makes the most sense if Joe-planning is involved -- e.g., if the agent is running simulations that allow it to notice and respond to incentives to seek power (there are versions of the argument that don't appeal to Joe-planning, but I like these less -- see discussion in footnote 87 here). It's true that you can end up power-seeking-ish via non-Joe-planning paths (for example, if in training you developed sphex-ish heuristics that favor power-seeking-ish actions); but when I actually imagine AI systems that end up power-seeking, I imagine it happening because they noticed, in the course of modeling the world in order to achieve their goals, that power-seeking (even in ways humans wouldn't like) would help.

Can this happen without Ben-planning? I think it can. Suppose, for example, that none of your previous Joe-planning models were power-seeking. Then, you train a new Joe-planner, who can run more sophisticated simulations. On some inputs, this Joe-planner realizes that power-seeking is advantageous, and goes for it (or starts deceiving you, or whatever).

Re: (2), for the reasons discussed in section 3.1, I tend to see Joe-planning as pretty key to lots of task-performance — though I acknowledge that my intuitions are surprised by how much it looks like you can do via something more intuitively “sphexish.” And I acknowledge that some of those arguments may apply less to Ben-planning. I do think this is some comfort, since agents that learn via planning are indeed scarier. But I am separately worried that ongoing learning will be very useful/incentivized, too.

Comment by Joe Carlsmith (joekc) on Reviews of “Is power-seeking AI an existential risk?” · 2021-12-21T04:10:26.910Z · LW · GW

I’m glad you think it’s valuable, Ben — and thanks for taking the time to write such a thoughtful and detailed review. 

I’m sympathetic to the possibility that the high level of conjuctiveness here created some amount of downward bias, even if the argument does actually have a highly conjunctive structure.” 

Yes, I am too. I’m thinking about the right way to address this going forward. 

I’ll respond re: planning in the thread with Daniel.

Comment by Joe Carlsmith (joekc) on Reviews of “Is power-seeking AI an existential risk?” · 2021-12-17T19:05:05.658Z · LW · GW

(Note that my numbers re: short-horizon systems + 12 OOMs being enough, and for +12 OOMs in general, changed since an earlier version you read, to 35% and 65% respectively.)

Comment by Joe Carlsmith (joekc) on Reviews of “Is power-seeking AI an existential risk?” · 2021-12-17T06:14:18.613Z · LW · GW

Thanks for these comments.

that suggests that CrystalNights would work, provided we start from something about as smart as a chimp. And arguably OmegaStar would be about as smart as a chimp - it would very likely appear much smarter to people talking with it, at least.

"starting with something as smart as a chimp" seems to me like where a huge amount of the work is being done, and if Omega-star --> Chimp-level intelligence, it seems a lot less likely we'd need to resort to re-running evolution-type stuff. I also don't think "likely to appear smarter than a chimp to people talking with it" is a good test, given that e.g. GPT-3 (2?) would plausibly pass, and chimps can't talk. 

"Do you not have upwards of 75% credence that the GPT scaling trends will continue for the next four OOMs at least? If you don't, that is indeed a big double crux." -- Would want to talk about the trends in question (and the OOMs -- I assume you mean training FLOP OOMs, rather than params?). I do think various benchmarks are looking good, but consider e.g. the recent Gopher paper

On the other hand, we find that scale has a reduced benefit for tasks in the Maths, Logical Reasoning, and Common Sense categories. Smaller models often perform better across these categories than larger models. In the cases that they don’t, larger models often don’t result in a performance increase. Our results suggest that for certain flavours of mathematical or logical reasoning tasks, it is unlikely that scale alone will lead to performance breakthroughs. In some cases Gopher has a lower performance than smaller models– examples of which include Abstract Algebra and Temporal Sequences from BIG-bench, and High School Mathematics from MMLU.

(Though in this particular case, re: math and logical reasoning, there are also other relevant results to consider, e.g. this and this.) 

It seems like "how likely is it that continuation of GPT scaling trends on X-benchmarks would result in APS-systems" is probably a more important crux, though?

Re: your premise 2, I had (wrongly, and too quickly) read this as claiming "if you have X% on +12 OOMs, you should have at least 1/2*X% on +6 OOMs," and log-uniformity was what jumped to mind as what might justify that claim. I have a clearer sense of what you were getting at now, and I accept something in the vicinity if you say 80% on +12 OOMs (will edit accordingly). My +12 number is lower, though, which makes it easier to have a flatter distribution that puts more than half of the +12 OOM credence above +6. 

The difference between 20% and 50% on APS-AI by 2030 seems like it could well be decision-relevant to me (and important, too, if you think that risk is a lot higher in short-timelines worlds). 

Comment by Joe Carlsmith (joekc) on On the limits of idealized values · 2021-12-08T01:22:47.334Z · LW · GW

I haven't given a full account of my views of realism anywhere, but briefly, I think that the realism the realists-at-heart want is a robust non-naturalist realism, a la David Enoch, and that this view implies:

  1. an inflationary metaphysics that it just doesn't seem like we have enough evidence for,
  2. an epistemic challenge (why would we expect our normative beliefs to correlate with the non-natural normative facts?) that realists have basically no answer to except "yeah idk but maybe this is a problem for math and philosophy too?" (Enoch's chapter 7 covers this issue; I also briefly point at it in this section, in talking about why the realist bot would expect its desires and intuitions to correlate with the the contents of the envelope buried in the mountain), and
  3. an appeal to a non-natural realm that a lot of realists take as necessary to capture the substance and heft of our normative lives, but which I don't think is necessary for this, at least when it comes to caring (i think moral "authority" and "bindingness regardless of what you care about" might be a different story, but one that "the non-natural realm says so" doesn't obviously help with, either). i wrote up my take on this issue here.

Also, most realists are externalists, and I think that externalist realism severs an intuitive connection between normativity and motivation that I would prefer to preserve (though this is more of an "I don't like that" than a "that's not true" objection). I wrote about this here

There are various ways of being a "naturalist realist," too, but the disagreement between naturalist realism and anti-realism/subjectivism/nihilism is, in my opinion, centrally a semantic one. The important question is whether anything normativity-flavored is in a deep sense something over and above the standard naturalist world picture. Once we've denied that, we're basically just talking about how to use words to describe that standard naturalist world picture. I wrote a bit about how I think of this kind of dialectic here

This is a familiar dialectic in philosophical debates about whether some domain X can be reduced to Y (meta-ethics is a salient comparison to me). The anti-reductionist (A) will argue that our core intuitions/concepts/practices related to X make clear that it cannot be reduced to Y, and that since X must exist (as we intuitively think it does), we should expand our metaphysics to include more than Y. The reductionist (R) will argue that X can in fact be reduced to Y, and that this is compatible with our intuitions/concepts/everyday practices with respect to X, and hence that X exists but it’s nothing over and above Y. The nihilist (N), by contrast, agrees with A that it follows from our intuitions/concepts/practices related to X that it cannot be reduced to Y, but agrees with D that there is in fact nothing over and above Y, and so concludes that there is no X, and that our intuitions/concepts/practices related to X are correspondingly misguided. Here, the disagreement between A vs. R/N is about whether more than Y exists; the disagreement between R vs. A/N is about whether a world of only Y “counts” as a world with X. This latter often begins to seem a matter of terminology; the substantive questions have already been settled.

There's a common strain of realism in utilitarian circles that tries to identify "goodness" with something like "valence," treats "valence" as a "phenomenal property", and then tries to appeal to our "special direct epistemic access" to phenomenal consciousness in order to solve the epistemic challenge above. i think this doesn't help at all (the basic questions about how the non-natural realm interacts with the natural one remain unanswered -- and this is a classic problem for non-physicalist theories of consciousness as well), but that it gets its appeal centrally via running through people's confusion/mystery relationship with phenomenal consciousness, which muddies the issue enough to make it seem like the move might help. I talk about issues in this vein a bit in the latter half of my podcast with Gus Docker

Re: your list of 6 meta-ethical options, I'd be inclined to pull apart the question of 

  • (a) do any normative facts exists, and if so, which ones, vs.
  • (b) what's the empirical situation with respect to deliberation within agents and disagreement across agents (e.g., do most agents agree and if so why; how sensitive is the deliberation of a given agent to initial conditions, etc).

With respect to (a), my take is closest to 6 ("there aren't any normative facts at all") if the normative facts are construed in a non-naturalist way, and closest to "whatever, it's mostly a terminology dispute at this point" if the normative facts are construed in a naturalist way (though if we're doing the terminology dispute, I'm generally more inclined towards naturalist realism over nihilism). Facts about what's "rational" or "what decision theory wins" fall under this response as well (I talk about this a bit here).

With respect to (b), my first pass take is "i dunno, it's an empirical question," but if I had to guess, I'd guess lots of disagreement between agents across the multiverse, and a fair amount of sensitivity to initial conditions on the part of individual deliberators. 

Re: my ghost, it starts out valuing status as much as i do, but it's in a bit of a funky situation insofar as it can't get normal forms of status for itself because it's beyond society. It can, if it wants, try for some weirder form of cosmic status amongst hypothetical peers ("what they would think if they could see me now!"), or it can try to get status for the Joe that it left behind in the world, but my general feeling is that the process of stepping away from the Joe and looking at the world as a whole tends to reduce its investment in what happens to Joe in particular, e.g.

Perhaps, at the beginning, the ghost is particularly interested in Joe-related aspects of the world. Fairly soon, though, I imagine it paying more and more attention to everything else. For while the ghost retains a deep understanding of Joe, and a certain kind of care towards him, it is viscerally obvious, from the ghost’s perspective, unmoored from Joe’s body, that Joe is just one creature among so many others; Joe’s life, Joe’s concerns, once so central and engrossing, are just one tiny, tiny part of what’s going on.

That said, insofar as the ghost is giving recommendations to me about what to do, it can definitely take into account the fact that I want status to whatever degree, and am otherwise operating in the context of social constraints, coordination mechanisms, etc. 

Comment by Joe Carlsmith (joekc) on On the limits of idealized values · 2021-12-02T07:20:51.032Z · LW · GW

In the past, I've thought of idealizing subjectivism as something like an "interim meta-ethics," in the sense that it was a meta-ethic I expected to do OK conditional on each of the three meta-ethical views discussed here, e.g.:

  1. Internalist realism (value is independent of your attitudes, but your idealized attitudes always converge on it)
  2. Externalist realism (value is independent of your attitudes, but your idealized attitudes don't always converge on it)
  3. Idealizing subjectivism (value is determined by your idealized attitudes)

The thought was that on (1), idealizing subjectivism tracks the truth. On (2), maybe you're screwed even post-idealization, but whatever idealization process you were going to do was your best shot at the truth anyway. And on (3), idealizing subjectivism is just true. So, you don't go too far wrong as an idealizing subjectivist. (Though note that we can run similar lines or argument for using internalist or externalist forms of realism as the "interim meta-ethics." The basic dynamic here is just that, regardless of what you think about (1)-(3), doing your idealization procedures is the only thing you know how to do, so you should just do it.)

I still feel some sympathy towards this, but I've also since come to view attempts at meta-ethical agnosticism of this kind as much less innocent and straightforward than this picture hopes. In particular, I feel like I see meta-ethical questions interacting with object-level moral questions, together with other aspects of philosophy, at tons of different levels (see e.g. here, here, and here for a few discussions), so it has felt corresponding important to just be clear about which view is most likely to be true. 

Beyond this, though, for the reasons discussed in this post, I've also become clearer in my skepticism that "just do your idealization procedure" is some well-defined thing that we can just take for granted. And I think that once we double click on it, we actually get something that looks less like any of 1-3, and more like the type of active, existentialist-flavored thing I tried to point at in Sections X and XI

Re: functional roles of morality, one thing I'll flag here is that in my view, the most fundamental meta-ethical questions aren't about morality per se, but rather are about practical normativity more generally (though in practice, many people seem most pushed towards realism by moral questions in particular, perhaps due to the types of "bindingness" intuitions I try to point at here -- intuitions that I don't actually think realism on its own helps with).

Should you think of your idealized self as existing in a context where morality still plays these (and other) functional roles? As with everything about your idealization procedure, on my picture it's ultimately up to you. Personally, I tend to start by thinking about individual ghost versions of myself who can see what things are like in lots of different counterfactual situations (including, e.g., situations where morality plays different functional roles, or in which I am raised differently), but who are in some sense "outside of society," and who therefore aren't doing much in the way of direct signaling, group coordination, etc. That said, these ghost version selves start with my current values, which have indeed resulted from my being raised in environments where morality is playing roles of the kind you mentioned.

Comment by Joe Carlsmith (joekc) on SIA > SSA, part 1: Learning from the fact that you exist · 2021-10-01T09:49:44.256Z · LW · GW

Glad you liked it :). I haven’t spent much time engaging with UDASSA — or with a lot other non-SIA/SSA anthropic theories — at this point, but UDASSA in particular is on my list to understand better. Here I wanted to start with the first-pass basics.

Comment by Joe Carlsmith (joekc) on Can you control the past? · 2021-09-21T17:27:11.787Z · LW · GW

Yes, edited :)

Comment by Joe Carlsmith (joekc) on The Adventure: a new Utopia story · 2021-09-17T23:20:11.352Z · LW · GW

I appreciated this, especially given how challenging this type of exercise can be. Thanks for writing.

Comment by Joe Carlsmith (joekc) on Distinguishing AI takeover scenarios · 2021-09-13T17:53:24.238Z · LW · GW

Rohin is correct. In general, I meant for the report's analysis to apply to basically all of these situations (e.g., both inner and outer-misaligned, both multi-polar and unipolar, both fast take-off and slow take-off), provided that the misaligned AI systems in question ultimately end up power-seeking, and that this power-seeking leads to existential catastrophe. 

It's true, though, that some of my discussion was specifically meant to address the idea that absent a brain-in-a-box-like scenario, we're fine. Hence the interest in e.g. deployment decisions, warning shots, and corrective mechanisms.

Comment by Joe Carlsmith (joekc) on Can you control the past? · 2021-08-28T08:29:41.521Z · LW · GW

Thanks!

Comment by Joe Carlsmith (joekc) on MIRI/OP exchange about decision theory · 2021-08-27T20:36:21.881Z · LW · GW

Mostly personal interest on my part (I was working on a blog post on the topic, now up), though I do think that the topic has broader relevance.

Comment by Joe Carlsmith (joekc) on Thoughts on being mortal · 2021-08-05T07:59:57.304Z · LW · GW

I think this could've been clearer: it's been a bit since I wrote this/read the book, but I don't think I meant to imply that "some forms of hospice do prolong life at extreme costs to its quality" (though the sentence does read that way); more that some forms of medical treatment prolong life at extreme cost to its quality, and Gawande discusses hospice as an alternative.

Comment by Joe Carlsmith (joekc) on Actually possible: thoughts on Utopia · 2021-07-31T01:55:31.884Z · LW · GW

Glad to hear it :)

Comment by Joe Carlsmith (joekc) on On the limits of idealized values · 2021-06-24T07:09:20.732Z · LW · GW

I agree that there are other meta-ethical options, including ones that focus more on groups, cultures, agents in general, and so on, rather than individual agents (an earlier draft had a brief reference to this). And I think it's possible that some of these are in a better position to make sense of certain morality-related things, especially obligation-flavored ones, than the individually-focused subjectivism considered here (I gesture a little at something in this vicinity at the end of this post). I wanted a narrower focus in this post, though.

Comment by Joe Carlsmith (joekc) on On the limits of idealized values · 2021-06-24T07:00:25.706Z · LW · GW

Thanks :). I didn't mean for the ghost section to imply that the ghost civilization solves the problems discussed in the rest of the post re: e.g. divergence, meta-divergence, and so forth. Rather, the point was that taking responsibility for making the decision yourself (this feels closely related to "making peace with your own agency"), in consultation with/deference towards whatever ghost civilizations etc you want, changes the picture relative to e.g. requiring that there be some particular set of ghosts that already defines the right answer.

Comment by Joe Carlsmith (joekc) on On the limits of idealized values · 2021-06-24T06:51:32.295Z · LW · GW

Glad you liked it, and thanks for sharing the Bakker piece -- I found it evocative.

Comment by Joe Carlsmith (joekc) on On the limits of idealized values · 2021-06-24T06:49:24.850Z · LW · GW

I agree that it's a useful heuristic, and the "baby steps" idealization you describe seems to me like a reasonable version to have in mind and to defer to over ourselves (including re: how to continue idealizing). I also appreciate that your 2012 post actually went through sketched a process in that amount of depth/specificity.

Comment by Joe Carlsmith (joekc) on Draft report on existential risk from power-seeking AI · 2021-05-07T17:59:03.472Z · LW · GW

Hi Koen, 

Glad to hear you liked section 4.3.3. And thanks for pointing to these posts -- I certainly haven't reviewed all the literature, here, so there may well be reasons for optimism that aren't sufficiently salient to me.

Re: black boxes, I do think that black-box systems that emerge from some kind of evolution/search process are more dangerous; but as I discuss in 4.4.1, I also think that the bare fact that the systems are much more cognitively sophisticated than humans creates significant and safety-relevant barriers to understanding, even if the system has been designed/mechanistically understood at a different level.

Re: “there is a whole body of work which shows that evolved systems are often power-seeking” -- anything in particular you have in mind here?

Comment by Joe Carlsmith (joekc) on Draft report on existential risk from power-seeking AI · 2021-05-07T17:52:24.058Z · LW · GW

Hi Daniel, 

Thanks for taking the time to clarify. 

One other factor for me, beyond those you quote, is the “absolute” difficulty of ensuring practical PS-alignment, e.g. (from my discussion of premise 3):

Part of this uncertainty has to do with the “absolute” difficulty of achieving practical PS-alignment, granted that you can build APS systems at all. A system’s practical PS-alignment depends on the specific interaction between a number of variables -- notably, its capabilities (which could themselves be controlled/limited in various ways), its objectives (including the time horizon of the objectives in question), and the circumstances it will in fact exposed to (circumstances that could involve various physical constraints, monitoring mechanisms, and incentives, bolstered in power by difficult-to-anticipate future technology, including AI technology). I expect problems with proxies and search to make controlling objectives harder; and I expect barriers to understanding (along with adversarial dynamics, if they arise pre-deployment) to exacerbate difficulties more generally; but even so, it also seems possible to me that it won’t be “that hard” (by the time we can build APS systems at all) to eliminate many tendencies towards misaligned power-seeking (for example, it seems plausible to me that selecting very strongly against (observable) misaligned power-seeking during training goes a long way), conditional on retaining realistic levels of control over a system’s post-deployment capabilities and circumstances (though how often one can retain this control is a further question).

My sense is that relative to you, I am (a) less convinced that ensuring practical PS-alignment will be “hard” in this absolute sense, once you can build APS systems at all (my sense is that our conceptions of what it takes to “solve the alignment problem” might be different), (b) less convinced that practically PS-misaligned systems will be attractive to deploy despite their PS-misalignment (whether because of deception, or for other reasons), (c) less convinced that APS systems becoming possible/incentivized by 2035 implies “fast take-off” (it sounds like you’re partly thinking: those are worlds where something like the scaling hypothesis holds, and so you can just keep scaling up; but I don’t think the scaling hypothesis holding to an extent that makes some APS systems possible/financially feasible implies that you can just scale up quickly to systems that can perform at strongly superhuman levels on e.g. ~any task, whatever the time horizons, data requirements, etc), and (d) more optimistic about something-like-present-day-humanity’s ability to avoid/prevent failures at a scale that disempower ~all of humanity (though I do think Covid, and its policitization, an instructive example in this respect), especially given warning shots (and my guess is that we do get warning shots both before or after 2035, even if APS systems become possible/financially feasible before then).

Re: nuclear winter, as I understand it, you’re reading me as saying: “in general, if a possible and incentivized technology is dangerous, there will be warning shots of the dangers; humans (perhaps reacting to those warning shots) won’t deploy at a level that risks the permanent extinction/disempowerment of ~all humans; and if they start to move towards such disempowerment/extinction, they’ll take active steps to pull back.” And your argument is: “if you get to less than 10% doom on this basis, you’re going to give too low probabilities on scenarios like nuclear winter in the 20th century.” 

I don’t think of myself as leaning heavily on an argument at that level of generality (though maybe there’s a bit of that). For example, that statement feels like it’s missing the “maybe ensuring practical PS-alignment just isn’t that hard, especially relative to building practically PS-misaligned systems that are at least superficially attractive to deploy” element of my own picture. And more generally, I expect to say different things about e.g. biorisk, climate change, nanotech, etc, depending on the specifics, even if generic considerations like “humans will try not to all die” apply to each.

Re: nuclear winter in particular, I’d want to think a bit more about what sort of probability I’d put on nuclear winter in the 20th century (one thing your own analysis skips is the probability that a large nuclear conflict injects enough smoke into the stratosphere to actually cause nuclear winter, which I don’t see as guaranteed -- and we’d need to specify what level of cooling counts). And nuclear winter on its own doesn’t include a “scaling to the permanent disempowerment/extinction of ~all of humanity” step -- a step that, FWIW, I see as highly questionable in the nuclear winter case, and which is important to my own probability on AI doom (see premise 5). And there are various other salient differences: for example, mutually assured destruction seems like a distinctly dangerous type of dynamic, which doesn’t apply to various AI deployment scenarios; nuclear weapons have widespread destruction as their explicit function, whereas most AI systems won’t; and so on. That said, I think comparisons in this vein could still be helpful; and I’m sympathetic to points in the vein of “looking at the history of e.g. climate, nuclear risk, BSL-4 accidents, etc the probability that humans will deploy technology that risks global catastrophe, and not stop doing so even after getting significant evidence about the risks at stake, can’t be that low” (I talk about this a bit in 4.4.3 and 6.2).

Comment by Joe Carlsmith (joekc) on Draft report on existential risk from power-seeking AI · 2021-05-01T00:58:04.050Z · LW · GW

Thanks for reading, and for your comments on the doc. I replied to specific comments there, but at a high level: the formal work you’ve been doing on this does seem helpful and relevant (thanks for doing it!). And other convergent phenomena seem like helpful analogs to have in mind.

Comment by Joe Carlsmith (joekc) on Draft report on existential risk from power-seeking AI · 2021-05-01T00:35:51.756Z · LW · GW

Glad to hear it, Steven. Thanks for reading, and for taking the time to write up your own threat model.

Comment by Joe Carlsmith (joekc) on Draft report on existential risk from power-seeking AI · 2021-05-01T00:34:14.031Z · LW · GW

Thanks, this seems like a salient type of consideration, and one that isn’t captured very explicitly in the current list (though I think it may play a role in explaining the bullet point about humans with general skill-sets being in-demand).

Comment by Joe Carlsmith (joekc) on Draft report on existential risk from power-seeking AI · 2021-05-01T00:33:16.058Z · LW · GW

Hi Daniel, 

Thanks for reading. I think estimating p(doom) by different dates (and in different take-off scenarios) can be a helpful consistency check, but I disagree with your particular “sanity check” here -- and in particular, premise (2). That is, I don’t think that conditional on APS-systems becoming possible/financially feasible by 2035, it’s clear that we should have at least 50% on doom (perhaps some of disagreement here is about what it takes for the problem to be "real," and to get "solved"?). Nor do I see 10% on “Conditional it being both possible and strongly incentivized to build APS systems, APS systems will end up disempowering approximately all of humanity” as obviously overconfident (though I do take some objections in this vein seriously). I’m not sure exactly what “10% on nuclear war” analog argument you have in mind: would you be able to sketch it out, even if hazily?

Comment by Joe Carlsmith (joekc) on Clarifying inner alignment terminology · 2021-02-19T21:33:00.566Z · LW · GW

Cool (though FWIW, if you're going to lean on the notion of policies being aligned with humans, I'd be inclined to define that as well, in addition to defining what it is for agents to be aligned with humans. But maybe the implied definition is clear enough: I'm assuming you have in mind something like "a policy is aligned with humans if an agent implementing that policy is aligned with humans."). 

Regardless, sounds like your definition is pretty similar to: "An agent is intent aligned if its behavioral objective is such that an arbitrarily powerful and competent agent pursuing this objective to arbitrary extremes wouldn't act in ways that humans judge bad"? If you see it as importantly different from this, I'd be curious.

Comment by Joe Carlsmith (joekc) on Clarifying inner alignment terminology · 2021-02-19T18:43:57.003Z · LW · GW

Aren't they now defined in terms of each other? 

"Intent alignment: An agent is intent aligned if its behavioral objective is outer aligned.

Outer alignment: An objective function  is outer aligned if all models that perform optimally on  in the limit of perfect training and infinite data are intent aligned."