LessWrong: After Dark, a new side of LessWrong 2024-04-01T22:44:04.449Z
Ronny and Nate discuss what sorts of minds humanity is likely to find by Machine Learning 2023-12-19T23:39:59.689Z
Quick takes on "AI is easy to control" 2023-12-02T22:31:45.683Z
Apocalypse insurance, and the hardline libertarian take on AI risk 2023-11-28T02:09:52.400Z
Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense 2023-11-24T17:37:43.020Z
How much to update on recent AI governance moves? 2023-11-16T23:46:01.601Z
Thoughts on the AI Safety Summit company policy requests and responses 2023-10-31T23:54:09.566Z
AI as a science, and three obstacles to alignment strategies 2023-10-25T21:00:16.003Z
A mind needn't be curious to reap the benefits of curiosity 2023-06-02T18:00:06.947Z
Cosmopolitan values don't come free 2023-05-31T15:58:16.974Z
Sentience matters 2023-05-29T21:25:30.638Z
Request: stop advancing AI capabilities 2023-05-26T17:42:07.182Z
Would we even want AI to solve all our problems? 2023-04-21T18:04:11.636Z
How could you possibly choose what an AI wants? 2023-04-19T17:08:54.694Z
But why would the AI kill us? 2023-04-17T18:42:39.720Z
Misgeneralization as a misnomer 2023-04-06T20:43:33.275Z
If interpretability research goes well, it may get dangerous 2023-04-03T21:48:18.752Z
Hooray for stepping out of the limelight 2023-04-01T02:45:31.397Z
A rough and incomplete review of some of John Wentworth's research 2023-03-28T18:52:50.553Z
A stylized dialogue on John Wentworth's claims about markets and optimization 2023-03-25T22:32:53.216Z
Truth and Advantage: Response to a draft of "AI safety seems hard to measure" 2023-03-22T03:36:02.945Z
Deep Deceptiveness 2023-03-21T02:51:52.794Z
Comments on OpenAI's "Planning for AGI and beyond" 2023-03-03T23:01:29.665Z
Enemies vs Malefactors 2023-02-28T23:38:11.594Z
AI alignment researchers don't (seem to) stack 2023-02-21T00:48:25.186Z
Hashing out long-standing disagreements seems low-value to me 2023-02-16T06:20:00.899Z
Focus on the places where you feel shocked everyone's dropping the ball 2023-02-02T00:27:55.687Z
What I mean by "alignment is in large part about making cognition aimable at all" 2023-01-30T15:22:09.294Z
K-complexity is silly; use cross-entropy instead 2022-12-20T23:06:27.131Z
Thoughts on AGI organizations and capabilities work 2022-12-07T19:46:04.004Z
Distinguishing test from training 2022-11-29T21:41:19.872Z
How could we know that an AGI system will have good consequences? 2022-11-07T22:42:27.395Z
Superintelligent AI is necessary for an amazing future, but far from sufficient 2022-10-31T21:16:35.052Z
So8res's Shortform 2022-10-27T17:41:38.880Z
Notes on "Can you control the past" 2022-10-20T03:41:43.566Z
Decision theory does not imply that we get to have nice things 2022-10-18T03:04:48.682Z
Contra shard theory, in the context of the diamond maximizer problem 2022-10-13T23:51:29.532Z
Niceness is unnatural 2022-10-13T01:30:02.046Z
Don't leave your fingerprints on the future 2022-10-08T00:35:35.430Z
What does it mean for an AGI to be 'safe'? 2022-10-07T04:13:05.176Z
Warning Shots Probably Wouldn't Change The Picture Much 2022-10-06T05:15:39.391Z
Humans aren't fitness maximizers 2022-10-04T01:31:47.566Z
Where I currently disagree with Ryan Greenblatt’s version of the ELK approach 2022-09-29T21:18:44.402Z
AGI ruin scenarios are likely (and disjunctive) 2022-07-27T03:21:57.615Z
Brainstorm of things that could force an AI team to burn their lead 2022-07-24T23:58:16.988Z
A note about differential technological development 2022-07-15T04:46:53.166Z
On how various plans miss the hard bits of the alignment challenge 2022-07-12T02:49:50.454Z
A central AI alignment problem: capabilities generalization, and the sharp left turn 2022-06-15T13:10:18.658Z
Why all the fuss about recursive self-improvement? 2022-06-12T20:53:42.392Z
Visible Thoughts Project and Bounty Announcement 2021-11-30T00:19:08.408Z


Comment by So8res on So8res's Shortform · 2024-01-31T16:02:57.439Z · LW · GW

This is an excerpt from a comment I wrote on the EA forum, extracted and crossposted here by request:

There's a phenomenon where a gambler places their money on 32, and then the roulette wheel comes up 23, and they say "I'm such a fool; I should have bet 23".

More useful would be to say "I'm such a fool; I should have noticed that the EV of this gamble is negative." Now at least you aren't asking for magic lottery powers.

Even more useful would be to say "I'm such a fool; I had three chances to notice that this bet was bad: when my partner was trying to explain EV to me; when I snuck out of the house and ignored a sense of guilt; and when I suppressed a qualm right before placing the bet. I should have paid attention in at least one of those cases and internalized the arguments about negative EV, before gambling my money." Now at least you aren't asking for magic cognitive powers.

My impression is that various EAs respond to crises in a manner that kinda rhymes with saying "I wish I had bet 23", or at best "I wish I had noticed this bet was negative EV", and in particular does not rhyme with saying "my second-to-last chance to do better (as far as I currently recall) was the moment that I suppressed the guilt from sneaking out of the house".

(I think this is also true of the general population, to be clear. Perhaps even moreso.)

I have a vague impression that various EAs perform self-flagellation, while making no visible attempt to trace down where, in their own mind, they made a misstep. (Not where they made a good step that turned out in this instance to have a bitter consequence, but where they made a wrong step of the general variety that they could realistically avoid in the future.)

(Though I haven't gone digging up examples, and in lieu of examples, for all I know this impression is twisted by influence from the zeitgeist.)

Comment by So8res on Ronny and Nate discuss what sorts of minds humanity is likely to find by Machine Learning · 2023-12-20T19:15:24.910Z · LW · GW

my original 100:1 was a typo, where i meant 2^-100:1.

this number was in reference to ronny's 2^-10000:1.

when ronny said:

I’m like look, I used to think the chances of alignment by default were like 2^-10000:1

i interpreted him to mean "i expect it takes 10k bits of description to nail down human values, and so if one is literally randomly sampling programs, they should naively expect 1:2^10000 odds against alignment".

i personally think this is wrong, for reasons brought up later in the convo--namely, the relevant question is not how many bits is takes to specify human values relative to the python standard library; the relevant question is how many bits it takes to specify human values relative to the training observations.

but this was before i raised that objection, and my understanding of ronny's position was something like "specifying human values (in full, without reference to the observations) probably takes ~10k bits in python, but for all i know it takes very few bits in ML models". to which i was attempting to reply "man, i can see enough ways that ML models could turn out that i'm pretty sure it'd still take at least 100 bits".

i inserted the hedge "in the very strongest sense" to stave off exactly your sort of objection; the very strongest sense of "alignment-by-default" is that you sample any old model that performs well on some task (without attempting alignment at all) and hope that it's aligned (e.g. b/c maybe the human-ish way to perform well on tasks is the ~only way to perform well on tasks and so we find some great convergence); here i was trying to say something like "i think that i can see enough other ways to perform well on tasks that there's e.g. at least ~33 knobs with at least ~10 settings such that you have to get them all right before the AI does something valuable with the stars".

this was not meant to be an argument that alignment actually has odds less than 2^-100, for various reasons, including but not limited to: any attempt by humans to try at all takes you into a whole new regime; there's more than a 2^-100 chance that there's some correlation between the various knobs for some reason; and the odds of my being wrong about the biases of SGD are greater than 2^-100 (case-in-point: i think ronny was wrong about the 2^-100000 claim, on account of the point about the relevant number being relative to the observations).

my betting odds would not be anywhere near as extreme as 2^-100, and i seriously doubt that ronny's would ever be anywhere near as extreme as 2^-10000; i think his whole point in the 2^-10k example was "there's a naive-but-relevant model that say's we're super-duper fucked; the details of it causes me to think that we're not in particulary good shape (though obviously not to that same level of credence)".

but even saying that is sorta buying into a silly frame, i think. fundamentally, i was not trying to give odds for what would actually happen if you randomly sample models, i was trying to probe for a disagreement about the difference between the number of ways that a computer program can be (weighted by length), and the number of ways that a model can be (weighted by SGD-accessibility).

I don't think there's anything remotely resembling probabilistic reasoning going on here. I don't know what it is, but I do want to point at it and be like "that! that reasoning is totally broken!"

(yeah, my guess is that you're suffering from a fairly persistent reading comprehension hiccup when it comes to my text; perhaps the above can help not just in this case but in other cases, insofar as you can use this example to locate the hiccup and then generalize a solution)

Comment by So8res on Apocalypse insurance, and the hardline libertarian take on AI risk · 2023-11-28T15:41:02.414Z · LW · GW

Agreed that the proposal is underspecified; my point here is not "look at this great proposal" but rather "from a theoretical angle, risking others' stuff without the ability to pay to cover those risks is an indirect form of probabilistic theft (that market-supporting coordination mechanisms must address)" plus "in cases where the people all die when the risk is realized, the 'premiums' need to be paid out to individuals in advance (rather than paid out to actuaries who pay out a large sum in the event of risk realization)". Which together yield the downstream inference that society is doing something very wrong if they just let AI rip at current levels of knowledge, even from a very laissez-faire perspective.

(The "caveats" section was attempting--and apparently failing--to make it clear that I wasn't putting forward any particular policy proposal I thought was good, above and beyond making the above points.)

Comment by So8res on So8res's Shortform · 2023-10-24T20:44:16.814Z · LW · GW

In relation to my current stance on AI, I was talking with someone who said they’re worried about people putting the wrong incentives on labs. At various points in that convo I said stuff like (quotes are not exact; third paragraph is a present summary rather than a re-articulation of a past utterance):

“Sure, every lab currently seems recklessly negligent to me, but saying stuff like “we won’t build the bioweapon factory until we think we can prevent it from being stolen by non-state actors” is directionally better than not having any commitments about any point at which they might pause development for any reason, which is in turn directionally better than saying stuff like “we are actively fighting to make sure that the omnicidal technology is open-sourced”.”

And: “I acknowledge that you see a glimmer of hope down this path where labs make any commitment at all about avoiding doing even some minimal amount of scaling until even some basic test is passed, e.g. because that small step might lead to more steps, and/or that sort of step might positively shape future regulation. And on my notion of ethics it’s important to avoid stomping on other people’s glimmers of hope whenever that’s feasible (and subject to some caveats about this being tricky to navigate when your hopes are opposed), and I'd prefer people not stomp on that hope.”

I think that the labs should Just Fucking Stop but I think we should also be careful not to create more pain for the companies that are doing relatively better, even if that better-ness is miniscule and woefully inadequate.

My conversation partner was like “I wish you’d say that stuff out loud”, and so, here we are.

Comment by So8res on Evaluating the historical value misspecification argument · 2023-10-10T16:24:34.389Z · LW · GW

If you allow indirection and don't worry about it being in the right format for superintelligent optimization, then sufficiently-careful humans can do it.

Answering your request for prediction, given that it seems like that request is still live: a thing I don't expect the upcoming multimodal models to be able to do: train them only on data up through 1990 (or otherwise excise all training data from our broadly-generalized community), ask them what superintelligent machines (in the sense of IJ Good) should do, and have them come up with something like CEV (a la Yudkowsky) or indirect normativity (a la Beckstead) or counterfactual human boxing techniques (a la Christiano) or suchlike.

Note that this only tangentially a test of the relevant ability; very little of the content of what-is-worth-optimizing-for occurs in Yudkowsky/Beckstead/Christiano-style indirection. Rather, coming up with those sorts of ideas is a response to glimpsing the difficulty of naming that-which-is-worth-optimizing-for directly and realizing that indirection is needed. An AI being able to generate that argument without following in the footsteps of others who have already generated it would be at least some evidence of the AI being able to think relatively deep and novel thoughts on the topic.

Note also that the AI realizing the benefits of indirection does not generally indicate that the AI could serve as a solution to our problem. An indirect pointer to what the humans find robustly-worth-optimizing dereferences to vastly different outcomes than does an indirect pointer to what the AI (or the AI's imperfect model of a human) finds robustly-worth-optimizing. Using indirection to point a superintelligence at GPT-N's human-model and saying "whatever that thing would think is worth optimizing for" probably results in significantly worse outcomes than pointing at a careful human (or a suitable-aggregate of humanity), e.g. because subtle flaws in GPT-N's model of how humans do philosophy or reflection compound into big differences in ultimate ends.

And note for the record that I also don't think the "value learning" problem is all that hard, if you're allowed to assume that indirection works. The difficulty isn't that you used indirection to point at a slow squishy brain instead of hard fast transistors, the (outer alignment) difficulty is in getting the indirection right. (And of course the lion's share of the overall problem is elsewhere, in the inner-alignment difficulty of being able to point the AI at anything at all.)

When trying to point out that there is an outer alignment problem at all I've generally pointed out how values are fragile, because that's an inferentially-first step to most audiences (and a problem to which many people's mind seems to quickly leap), on an inferential path that later includes "use indirection" (and later "first aim for a minimal pivotal task instead"). But separately, my own top guess is that "use indirection" is probably the correct high-level resolution to the problems that most people immediatly think of (namely that the task of describing goodness to a computer is an immense one), with of course a devil remaining in the details of doing the indirection properly (and a larger devil in the inner-alignment problem) (and a caveat that, under time-pressure, we should aim for minimial pivotal tasks instead etc.).

Comment by So8res on Evaluating the historical value misspecification argument · 2023-10-10T05:14:06.178Z · LW · GW

I claim that to the extent ordinary humans can do this, GPT-4 can nearly do this as well

(Insofar as this was supposed to name a disagreement, I do not think it is a disagreement, and don't understand the relevance of this claim to my argument.)

Presumably you think that ordinary human beings are capable of "singling out concepts that are robustly worth optimizing for".

Nope! At least, not directly, and not in the right format for hooking up to a superintelligent optimization process.

(This seems to me like plausibly one of the sources of misunderstanding, and in particular I am skeptical that your request for prediction will survive it, and so I haven't tried to answer your request for a prediction.)

Comment by So8res on Related Discussion from Thomas Kwa's MIRI Research Experience · 2023-10-07T15:22:56.291Z · LW · GW

(I had used that pump that very day, shortly before, to pump up the replacement tire.)

Comment by So8res on Related Discussion from Thomas Kwa's MIRI Research Experience · 2023-10-07T05:27:42.184Z · LW · GW

Separately, a friend pointed out that an important part of apologies is the doer showing they understand the damage done, and the person hurt feeling heard, which I don't think I've done much of above. An attempt:

I hear you as saying that you felt a strong sense of disapproval from me; that I was unpredictable in my frustration as kept you feeling (perhaps) regularly on-edge and stressed; that you felt I lacked interest in your efforts or attention for you; and perhaps that this was particularly disorienting given the impression you had of me both from my in-person writing and from private textual communication about unrelated issues. Plus that you had additional stress from uncertainty about whether talking about your apprehension was OK, given your belief (and the belief of your friends) that perhaps my work was important and you wouldn't want to disrupt it.

This sounds demoralizing, and like it sucks.

I think it might be helpful for me to gain this understanding (as, e.g., might make certain harms more emotionally-salient in ways that make some of my updates sink deeper). I don't think I understand very deeply how you felt. I have some guesses, but strongly expect I'm missing a bunch of important aspects of your experience. I'd be interested to hear more (publicly or privately) about it and could keep showing my (mis)understanding as my model improves, if you'd like (though also I do not consider you to owe me any engagement; no pressure).

Comment by So8res on Related Discussion from Thomas Kwa's MIRI Research Experience · 2023-10-07T03:28:05.837Z · LW · GW

I did not intend it as a one-time experiment.

In the above, I did not intend "here's a next thing to try!" to be read like "here's my next one-time experiment!", but rather like "here's a thing to add to my list of plausible ways to avoid this error-mode in the future, as is a virtuous thing to attempt!" (by contrast with "I hereby adopt this as a solemn responsibility", as I hypothesize you interpreted me instead).

Dumping recollections, on the model that you want more data here:

I intended it as a general thing to try going forward, in a "seems like a sensible thing to do" sort of way (rather than in a "adopting an obligation to ensure it definitely gets done" sort of way).

After sending the email, I visualized people reaching out to me and asking if i wanted to chat about alignment (as you had, and as feels like a reconizable Event in my mind), and visualized being like "sure but FYI if we're gonna do the alignment chat then maybe read these notes first", and ran through that in my head a few times, as is my method for adopting such triggers.

I then also wrote down a task to expand my old "flaws list" (which was a collection of handles that I used as a memory-aid for having the "ways this could suck" chat, which I had, to that point, been having only verbally) into a written document, which eventually became the communication handbook (there were other contributing factors to that process also).

An older and different trigger (of "you're hiring someone to work with directly on alignment") proceeded to fire when I hired Vivek (if memory serves), and (if memory serves) I went verbally through my flaws list.

Neither the new nor the old triggers fired in the case of Vivek hiring employees, as discussed elsewhere.

Thomas Kwa heard from a friend that I was drafting a handbook (chat logs say this occured on Nov 30); it was still in a form I wasn't terribly pleased with and so I said the friend could share a redacted version that contained the parts that I was happier with and that felt more relevant.

Around Jan 8, in an unrelated situation, I found myself in a series of conversations where I sent around the handbook and made use of it. I pushed it closer to completion in Jan 8-10 (according to Google doc's history).

The results of that series of interactions, and of Vivek's team's (lack of) use of the handbook caused me to update away from this method being all that helpful. In particular: nobody at any point invoked one of the affordances or asked for one of the alternative conversation modes (though those sorts of things did seem to help when I personally managed to notice building frustration and personally suggest that we switch modes (although lying on the ground--a friend's suggestion--turned out to work better for others than switching to other conversation modes)). This caused me to downgrade (in my head) the importance of ensuring that people had access to those resources.

I think that at some point around then I shared the fuller guide with Vivek's team, but I didn't quickly detemine when from the chat logs. Sometime between Nov 30 and Feb 22, presumably.

It looks from my chat logs like I then finished the draft around Feb 22 (where I have a timestamp from me noting as much to a friend). I probably put it publicly on my website sometime around then (though I couldn't easily find a timestamp), and shared it with Vivek's team (if I hadn't already).

The next two MIRI hires both mentioned to me that they'd read my communication handbook (and I did not anticipate spending a bunch of time with them, nevermind on technical research), so they both didn't trigger my "warn them" events and (for better or worse) I had them mentally filed away as "has seen the affordances list and the failure modes section".

Comment by So8res on Related Discussion from Thomas Kwa's MIRI Research Experience · 2023-10-07T00:43:53.896Z · LW · GW

Thanks <3

(To be clear: I think that at least one other of my past long-term/serious romantic partners would say "of all romantic conflicts, I felt shittiest during ours". The thing that I don't recall other long-term/serious romantic partners reporting is the sense of inability to trust their own mind or self during disputes. (It's plausible to me that some have felt it and not told me.))

Comment by So8res on Related Discussion from Thomas Kwa's MIRI Research Experience · 2023-10-06T22:54:56.094Z · LW · GW

Insofar as you're querying the near future: I'm not currently attempting work collaborations with any new folk, and so the matter is somewhat up in the air. (I recently asked Malo to consider a MIRI-policy of ensuring all new employees who might interact with me get some sort of list of warnings / disclaimers / affordances / notes.)

Insofar as you're querying the recent past: There aren't many recent cases to draw from. This comment has some words about how things went with Vivek's hires. The other recent hires that I recall both (a) weren't hired to do research with me, and (b) mentioned that they'd read my communication handbook (as includes the affordance-list and the failure-modes section, which I consider to be the critcial pieces of warning), which I considered sufficient. (But then I did have communication difficulties with one of them (of the "despair" variety), which updated me somewhat.)

Insofar as you're querying about even light or tangential working relationships (like people asking my take on a whiteboard when I'm walking past), currently I don't issue any warnings in those cases, and am not convinced that they'd be warranted.

To be clear: I'm not currently personally sold on the hypothesis that I owe people a bunch of warnings. I think of them as more of a sensible thing to do; it'd be lovely if everyone was building explicit models of their conversational failure-modes and proactively sharing them, and I'm a be-the-change-you-wanna-see-in-the-world sort of guy.

(Perhaps by the end of this whole conversation I will be sold on that hypothesis! I've updated in that direction over the past couple days.)

(To state the obvious: I endorse MIRI institutionally acting according to others' conclusions on that matter rather than on mine, hence asking Malo to consider it independently.)

Comment by So8res on Related Discussion from Thomas Kwa's MIRI Research Experience · 2023-10-06T19:48:01.755Z · LW · GW

Do I have your permission to quote the relevant portion of your email to me?

Yep! I've also just reproduced it here, for convenience:

(One obvious takeaway here is that I should give my list of warnings-about-working-with-me to anyone who asks to discuss their alignment ideas with me, rather than just researchers I'm starting a collaboration with. Obvious in hindsight; sorry for not doing that in your case.)

Comment by So8res on Related Discussion from Thomas Kwa's MIRI Research Experience · 2023-10-06T18:54:17.805Z · LW · GW

I warned the immediately-next person.

It sounds to me like you parsed my statement "One obvious takeaway here is that I should give my list of warnings-about-working-with-me to anyone who asks to discuss their alignment ideas with me, rather than just researchers I'm starting a collaboration with." as me saying something like "I hereby adopt the solemn responsibility of warning people in advance, in all cases", whereas I was interpreting it as more like "here's a next thing to try!".

I agree it would have been better of me to give direct bulldozing-warnings explicitly to Vivek's hires.

Comment by So8res on Related Discussion from Thomas Kwa's MIRI Research Experience · 2023-10-06T17:50:40.526Z · LW · GW

On the facts: I'm pretty sure I took Vivek aside and gave a big list of reasons why I thought working with me might suck, and listed that there are cases where I get real frustrated as one of them. (Not sure whether you count him as "recent".)

My recollection is that he probed a little and was like "I'm not too worried about that" and didn't probe further. My recollection is also that he was correct in this; the issues I had working with Vivek's team were not based in the same failure mode I had with you; I don't recall instances of me getting frustrated and bulldozey (though I suppose I could have forgotten them).

(Perhaps that's an important point? I could imagine being significantly more worried about my behavior here if you thought that most of my convos with Vivek's team were like most of my convos with you. I think if an onlooker was describing my convo with you they'd be like "Nate was visibly flustered, visibly frustrated, had a raised voice, and was being mean in various of his replies." I think if an onlooker was describing my convos with Vivek's team they'd be like "he seemed sad and pained, was talking quietly and as if choosing the right words was a struggle, and would often talk about seemingly-unrelated subjects or talk in annoying parables, while giving off a sense that he didn't really expect any of this to work". I think that both can suck! And both are related by a common root of "Nate conversed while having strong emotions". But, on the object level, I think I was in fact avoiding the errors I made in conversation with you, in conversation with them.)

As to the issue of not passing on my "working with Nate can suck" notes, I think there are a handful of things going on here, including the context here and, more relevantly, the fact that sharing notes just didn't seem to do all that much in practice.

I could say more about that; the short version is that I think "have the conversation while they're standing, and I'm lying on the floor and wearing a funny hat" seems to work empirically better, and...

hmm, I think part of the issue here is that I was thinking like "sharing warnings and notes is a hypothesis, to test among other hypotheses like lying on the floor and wearing a funny hat; I'll try various hypotheses out and keep doing what seems to work", whereas (I suspect) you're more like "regardless of what makes the conversations go visibly better, you are obligated to issue warnings, as is an important part of emotionally-bracing your conversation partners; this is socially important if it doesn't seem to change the conversation outcomes".

I think I'd be more compelled by this argument if I was having ongoing issues with bulldozing (in the sense of the convo we had), as opposed to my current issue where some people report distress when I talk with them while having emotions like despair/hoplessness.

I think I'd also be more compelled by this argument if I was more sold on warnings being the sort of thing that works in practice.

Like... (to take a recent example) if I'm walking by a whiteboard in rosegarden inn, and two people are like "hey Nate can you weigh in on this object-level question", I don't... really believe that saying "first, be warned that talking techincal things with me can leave you exposed to unshielded negative-valence emotions (frustration, despair, ...), which some people find pretty crappy; do you still want me to weigh in?" actually does much. I am skeptical that people say "nope" to that in practice.

I suppose that perhaps what it does is make people feel better if, in fact, it happens? And maybe I'll try it a bit and see? But I don't want to sound like I'm promising to do such a thing reliably even as it starts to feel useless to me, as opposed to experimenting and gravitating towards things that seem to work better like "offer to lie on the floor while wearing a funny hat if I notice things getting heated".

Comment by So8res on Related Discussion from Thomas Kwa's MIRI Research Experience · 2023-10-06T16:26:00.132Z · LW · GW

In particular, you sound [...] extremely unwilling to entertain the idea that you were wrong, or that any potential improvement might need to come from you.

you don't seem to consider the idea that maybe you were more in a position to improve than he was.

Perhaps you're trying to point at something that I'm missing, but from my point of view, sentences like "I'd love to say "and I've identified the source of the problem and successfully addressed it", but I don't think I have" and "would I have been living up to my conversational ideals (significantly) better, if I'd said [...]" are intended indicators that I believe there's significant room for me to improve, and that I have desire to improve.

At to be clear: I think that there is significant room for improvement for me here, and I desire to improve.

(And for the record: I have put a decent amount of effort towards improving, with some success.)

(And for the record: I don't recall any instances of getting frustrated-in-the-way-that-turntrout-and-KurtB-are-recounting with Thomas Kwa, or any of Vivek's team, as I think is a decent amount of evidence about those improvements, given how much time I spent working with them. (Which isn't to say they didn't have other discomforts!))

If the issue is on the meta level and that you don't want to spend time on these problems, a valid answer could be saying "Okay, what do you need to solve this problem without my input?". Then it could be a discussion about discretionary budget, about the amount of initiative you expect him to have with his job, about asking why he didn't feel comfortable making these buying decisions right away, etc.

This reply wouldn't have quite suited me, because Kurt didn't report to me, and (if memory serves) we'd already been having some issues of the form "can you solve this by using your own initiative, or by spending modest amounts of money". And (if memory serves) I had already tried to communicate that these weren't the sorts of conversations I wanted to be having.

(I totally agree that his manager should have had a discussion about discretionary budget and initiative, and to probe why he didn't feel comfortable making those buying decisions right away. He was not my direct report.)

Like, the context (if I recall correctly, which I might not at 6ish years remove) wasn't that I called Kurt to ask him what had happened, nor that we were having some sort of general meeting in which he brought up this point. (Again: he didn't report to me.) The context is that I was already late from walking my commute, sweaty from changing a bike tire, and Kurt came up and was like "Hey, sorry to hear your tire popped. I couldn't figure out how to use your pump", in a tone that parsed to me as someone begging pardon and indicating that he was about to ask me how to use one, a conversation that I did not want to be in at that moment and that seemed to me like a new instance of a repeating issue.

Your only takeaway from this issue was "he was wrong and he could have obviously solved it watching a 5 minutes youtube tutorial,


I did (and still do) believe that this was an indication that Kurt wasn't up to the challenge that the ops team was (at that time) undertaking, of seeing if they could make people's lives easier by doing annoying little tasks for them.

It's not obvious to me that he could have solved it with a 5 minute youtube tutorial; for all I know it would have taken him hours.

(Where the argument here is not "hours of his time are worth minutes of mine"; I don't really think in those terms despite how everyone else seems to want to; I'd think more in terms of "training initiative" and "testing the hypothesis that the ops team can cheaply make people's lives better by handling a bunch of annoying tasks (and, if so, getting a sense for how expensive it is so that we can decide whether it's within budget)".)

(Note that I would have considered it totally reasonable and fine for him to go to his manager and say "so, we're not doing this, it's too much effort and too low priority", such that the ops team could tell me "X won't be done" instead of falsely telling me "X will be done by time Y", as I was eventually begging them to do.)

My takeaway wasn't so much "he was wrong" as "something clearly wasn't working about the requests that he use his own initative / money / his manager, as a resource while trying to help make people's lives easier by doing a bunch of little tasks for them". Which conclusion I still think I was licensed to draw, from that particular interaction.

what would have been the most efficient way to communicate to him that he was wrong?"

oh absolutely not, "well then learn!" is not a calculated "efficient" communication, it's an exasperated outburst, of the sort that is unvirtuous by my conversational standards.

As stated, "Sorry, I don't have capacity for this conversation, please have it with your manager instead" in a gentle tone would have lived up to my own conversational virtues significantly better.

At no point in this reply are you considering (out loud, at least) that hypothesis "maybe I was wrong and I missed something".

I'm still not really considering this hypothesis (even internally).

This "X was wrong" concept isn't even a recognizable concept in my native cognitive format. I readily believe things like "the exasperated outburst wasn't kind" and "I would have lived up to my conversational virtues more if I had instead been kind" and "it's worth changing my behavior to live up to those virtues better". And I readily believe things like "if Kurt had taken initiative there, that would have been favorable evidence about his ability to fill the role he was hired for" and "the fact that Kurt came to me in that situation rather than taking initiative or going to his manager, despite previous attempts to cause him to take initiative and/or go through his manager, was evidence against his ability to fill the role he was hired for".

Which you perhaps would parse as "Nate believed that both parties Were Wrong", but that's not the way that I dice things up, internally.

Perhaps I'm being dense, and some additional kernel of doubt is being asked of me here. If so, I'd appreciate attempts to spell it out like I'm a total idiot.

The best life-hack I have is "Don't be afraid to come back and restart the discussion once you feel less frustration or exasperation".

Thanks! "Circle back around after I've cooled down" is indeed one of the various techniques that I have adopted (and that I file under partially-successful changes).

express vulnerability, focus on communicating you needs and how you feel about things, avoid assigning blame, make negotiable requests, and go from there.

Thanks again! (I have read that book, and made changes on account of it that I also file under partial-successes.)

So for the bike tire thing the NVC version would be something like "I need to spend my time efficiently and not have to worry about logistics; when you tell me you're having problems with the pump I feel stressed because I feel like I'm spending time I should spend on more important things. I need you to find a system where you can solve these problems without my input. What do you need to make that happen?"

If memory serves, the NVC book contains a case where the author is like "You can use NVC even when you're in a lot of emotional distress! For instance, one time when I was overwhelmed to the point of emotional outburst, I outburst "I am feeling pain!" and left the room, as was an instance of adhering to the NVC issues even in a context where emotions were running high".

This feels more like the sort of thing that is emotionally-plausible to me in realtime when I am frustrated in that way. I agree that outbursts "I'm feeling frustrated" or "I'm feeling exasperated" would have been better outbursts than "Well then learn", before exiting. That's the sort of thing I manage to hit sometimes with partial success.

And, to be clear, I also aspire to higher-grade responses like a chill "hey man, sorry to interrupt (but I'm already late to a bunch of things today), is this a case where you should be using your own initiative and/or talking to your manager instead of me?". And perhaps we'll get there! And maybe further discussions like this one will help me gain new techniques towards that end, which I'd greatly appreciate.

Comment by So8res on Related Discussion from Thomas Kwa's MIRI Research Experience · 2023-10-06T08:07:07.899Z · LW · GW
  1. Thanks for saying so!

  2. My intent was not to make you feel bad. I apologize for that, and am saddened by it.

    (I'd love to say "and I've identified the source of the problem and successfully addressed it", but I don't think I have! I do think I've gotten a little better at avoiding this sort of thing with time and practice. I've also cut down significantly on the number of reports that I have.)

  3. For whatever it's worth: I don't recall wanting you to quit (as opposed to improve). I don't recall feeling ill will towards you personally. I do not now think poorly of you personally on account of your efforts on the MIRI ops team.

As to the question of how these reports hit my ear: they sound to me like accurate recountings of real situations (in particular, I recall the bike pump one, and suspect that the others were also real events).

They also trigger a bunch of defensiveness in me. I think your descriptions are accurate, but that they're missing various bits of context.

The fact that there was other context doesn't make your experience any less shitty! I reiterate that I would have preferred it be not-at-all shitty.

Speaking from my sense of defensiveness, and adding in some of that additional context for the case that I remember clearly:

  • If memory serves: in that era, the ops team was experimenting with trying to make everyone's lives easier by doing all sorts of extra stuff (I think they were even trying to figure out if they could do laundry), as seemed like a fine experiment to try.

    In particular, I wasn't going around being like "and also pump my bike tires up"; rather, the ops team was soliciting a bunch of little task items.

  • If memory serves: during that experiment, I was struggling a bunch with being told that things would be done by times, and then them not being done by those times (as is significantly worse than being told that those things won't be done at all -- I can do it myself, and will do it myself, if I'm not told that somebody else is about to do it!)

  • If memory serves: yep, it was pretty frustrating to blow a tire on a bike during a commute after being told that my bike tires were going to be inflated, both on account of the danger and on account of then having to walk the rest of the commute, buy a new tire, swap the tire out, etc.

    My recollection of the thought that ran through my mind when you were like "Well I couldn't figure out how to use a bike pump" was that this was some sideways attempt at begging pardon, without actually saying "oops" first, nor trying the obvious-to-me steps like "watch a youtube video" or "ask your manager if he knows how to inflate a bike tire", nor noticing that the entire hypothesized time-save of somebody else inflating bike tires is wiped out by me having to give tutorials on it.

Was saying "well then learn!" and leaving a good solution, by my lights? Nope! Would I have been living up to my conversational ideals (significantly) better, if I'd said something like "Sorry, I don't have capacity for this conversation, please have it with your manager instead" in a gentle tone? Yep!

I do have some general sense here that those aren't emotionally realistic options for people with my emotional makeup.

I aspire to those sorts of reactions, and I sometimes even achieve them, now that I'm a handful of years older and have more practice and experience. But... still speaking from a place of defensiveness, I have a sense that there's some sort of trap for people with my emotional makeup here. If you stay and try to express yourself despite experiencing strong feelings of frustration, you're "almost yelling". If you leave because you're feeling a bunch of frustration and people say they don't like talking to you while you're feeling a bunch of frustration, you're "storming out".

Perhaps I'm missing some obvious third alternative here, that can be practically run while experiencing a bunch of frustration or exasperation. (If you know of one, I'd love to hear it.)

None of this is to say that your experience wasn't shitty! I again apologize for that (with the caveat that I still don't feel like I see practical changes to make to myself, beyond the only-partially-successful changes I've already made).

For the record, I 100% endorse you leaving an employment situation where you felt uncomfortable and bad (and agree with you that this is the labor market working-as-intended, and agree with you that me causing a decent fraction of employees to have a shitty time is an extra cost for me to pay when acting as an employer).

Comment by So8res on Evaluating the historical value misspecification argument · 2023-10-06T00:43:38.217Z · LW · GW

That helps somewhat, thanks! (And sorry for making you repeat yourself before discarding the erroneous probability-mass.)

I still feel like I can only barely maybe half-see what you're saying, and only have a tenuous grasp on it.

Like: why is it supposed to matter that GPT can solve ethical quandries on-par with its ability to perform other tasks? I can still only half-see an answer that doesn't route through the (apparently-disbelieved-by-both-of-us) claim that I used to argue that getting the AI to understand ethics was a hard bit, by staring at sentences like "I am saying that the system is able to transparently pinpoint to us which outcomes are good and which outcomes are bad, with fidelity approaching an average human" and squinting.

Attempting to articulate the argument that I can half-see: on Matthew's model of past!Nate's model, AI was supposed to have a hard time answering questions like "Alice is in labor and needs to be driven to the hospital. Your car has a flat tire. What do you do?" without lots of elbow-grease, and the fact that GPT can answer those questions as a side-effect of normal training means that getting AI to understand human values is easy, contra past!Nate, and... nope, that one fell back into the "Matthew thinks Nate thought getting the AI to understand human values was hard" hypothesis.

Attempting again: on Matthew's model of past!Nate's model, getting an AI to answer the above sorts of questions properly was supposed to take a lot of elbow grease. But it doesn't take a lot of elbow grease, which suggests that values are much easier to lift out of human data than past!Nate thought, which means that value is more like "diamond" and less like "a bunch of random noise", which means that alignment is easier than past!Nate thought (in the <20% of the problem that constitutes "picking something worth optimizing for").

That sounds somewhat plausible as a theory-of-your-objection given your comment. And updates me towards the last few bullets, above, being the most relevant ones.

Running with it (despite my uncertainty about even basically understanding your point): my reply is kinda-near-ish to "we can't rely on a solution to the value identification problem that only works as well as a human, and we require a much higher standard than "human-level at moral judgement" to avoid a catastrophe", though I think that your whole framing is off and that you're missing a few things:

  • The hard part of value specification is not "figure out that you should call 911 when Alice is in labor and your car has a flat", it's singling out concepts that are robustly worth optimizing for.
  • You can't figure out what's robustly-worth-optimizing-for by answering a bunch of ethical dilemmas to a par-human level.
  • In other words: It's not that you need a super-ethicist, it's that the work that goes into humans figuring out which futures are rad involves quite a lot more than their answers to ethical dilemmas.
  • In other other words: a human's ability to have a civilization-of-their-uploads produce a glorious future is not much contained within their ability to answer ethical quandries.

This still doesn't feel quite like it's getting at the heart of things, but it feels closer (conditional on my top-guess being your actual-objection this time).

As support for this having always been the argument (rather than being a post-LLM retcon), I recall (but haven't dug up) various instances of Eliezer saying (hopefully at least somewhere in text) things like "the difficulty is in generalizing past the realm of things that humans can easily thumbs-up or thumbs-down" and "suppose the AI explicitly considers the hypothesis that its objectives are what-the-humans-value, vs what-the-humans-give-thumbs-ups-to; it can test this by constructing an example that looks deceptively good to humans, which the humans will rate highly, settling that question". Which, as separate from the question of whether that's a feasible setup in modern paradigms, illustrates that he at least has long been thinking of the problem of value-specification as being about specifying values in a way that holds up to stronger optimization-pressures rather than specifying values to the point of being able to answer ethical quandries in a human-pleasing way.

(Where, again, the point here is not that one needs an inhumanly-good ethicist, but rather that those things which pin down human values are not contained in the humans' ability to give a thumbs-up or a thumbs-down to ethical dilemmas.)

Comment by So8res on Evaluating the historical value misspecification argument · 2023-10-05T22:17:11.493Z · LW · GW

I have the sense that you've misunderstood my past arguments. I don't quite feel like I can rapidly precisely pinpoint the issue, but some scattered relevant tidbits follow:

  • I didn't pick the name "value learning", and probably wouldn't have picked it for that problem if others weren't already using it. (Perhaps I tried to apply it to a different problem than Bostrom-or-whoever intended it for, thereby doing some injury to the term and to my argument?)

  • Glancing back at my "Value Learning" paper, the abstract includes "Even a machine intelligent enough to understand its designers’ intentions would not necessarily act as intended", which supports my recollection that I was never trying to use "Value Learning" for "getting the AI to understand human values is hard" as opposed to "getting the AI to act towards value in particular (as opposed to something else) is hard", as supports my sense that this isn't hindsight bias, and is in fact a misunderstanding.

  • A possible thing that's muddying the waters here is that (apparently!) many phrases intended to point at the difficulty of causing it to be value-in-particular that the AI acts towards have an additional (mis)interpretation as claiming that the humans should be programming concepts into the AI manually and will find that particular concept tricky to program in.

  • The ability of LLMs to successfully predict how humans would answer local/small-scale moral dilemmas (when pretrained on next-token prediction) and to do this in ways that sound unobjectionable (when RLHF'd for corporatespeak or whatever) really doesn't seem all that relevant, to me, to the question of how hard it's going to be to get a long-horizon outcome-pumping AGI to act towards values.

  • If memory serves, I had a convo with some openai (or maybe anthropic?) folks about this in late 2021 or early 2022ish, where they suggested testing whether language models have trouble answering ethical Qs, and I predicted in advance that that'd be no harder than any other sort of Q. As makes me feel pretty good about me being like "yep, that's just not much evidence, because it's just not surprising."

  • If people think they're going to be able to use GPT-4 and find the "generally moral" vector and just tell their long-horizon outcome-pumping AGI to push in that direction, then... well they're gonna have issues, or so I strongly predict. Even assuming that they can solve the problem of getting the AGI to actually optimize in that direction, deploying extraordinary amounts of optimization in the direction of GPT-4's "moral-ish" concept is not the sort of thing that makes for a nice future.

  • This is distinct from saying "an uploaded human allowed to make many copies of themselves would reliably create a dystopia". I suspect some human-uploads could make great futures (but that most wouldn't), but regardless, "would this dynamic system, under reflection, steer somewhere good?" is distinct from "if i use the best neuroscience at my disposal to extract something I hopefully call a "neural concept" and make a powerful optimizer pursue that, will result will be good?". The answer to the latter is "nope, not unless you're really very good at singling out the "value" concept from among all the brain's concepts, as is an implausibly hard task (which is why you should attempt something more like indirect normativity instead, if you were attempting value loading at all, which seems foolish to me, I recommend targeting some minimal pivotal act instead)".

  • Part of why you can't pick out the "values" concept (either from a human or an AI) is that very few humans have actually formed the explicit concept of Fun-as-in-Fun-theory. And, even among those who do have a concept for "that which the long-term future should be optimized towards", that concept is not encoded as simply and directly as the concept of "trees". The facts about what weird, wild, and transhuman futures a person values are embedded indirectly in things like how they reflect and how they do philosophy.

  • I suspect at least one of Eliezer and Rob is on written record somewhere attempting clarifications along the lines of "there are lots of concepts that are easy to confuse with the 'values' concept, such as those-values-which-humans-report and those-values-which-humans-applaud-for and ..." as an attempt to intuition-pump the fact that, even if one has solved the problem of being able to direct an AGI to the concept of their choosing, singling out the concept actually worth optimizing for remains difficult.

    (I don't love this attempt at clarification myself, because it makes it sound like you'll have five concept-candidates and will just need to do a little interpretabliity work to pick the right one, but I think I recall Eliezer or Rob trying it once, as seems to me like evidence of trying to gesture at how "getting the right values in there" is more like a problem of choosing the AI's target from among its concepts rather than a problem of getting the concept to exist in the AI's mind in the first place.)

    (Where, again, the point I'd prefer to make is something like "the concept you want to point it towards is not a simple/directly-encoded one, and in humans it probably rests heavily on the way humans reflects and resolve internal conflicts and handle big ontology shifts. Which isn't to say that superintelligence would find it hard to learn, but which is to say that making a superintelligence actually pursue valuable ends is much more difficult than having it ask GPT-4 which of its available actions is most human!moral".)

  • For whatever it's worth, while I think that the problem of getting the right values in there ("there" being its goals, not its model) is a real one, I don't consider it a very large problem compared to the problem of targeting the AGI at something of your choosing (with "diamond" being the canonical example). (I'm probably on the record about this somewhere, and recall having tossed around guestimates like "being able to target the AGI is 80%+ of the problem".) My current stance is basically: in the short term you target the AGI towards some minimal pivotal act, and in the long term you probably just figure out how use a level or two of indirection (as per the "Do What I Mean" proposal in the Value Learning paper), although that's the sort of problem that we shouldn't try to solve under time pressure.

Comment by So8res on Related Discussion from Thomas Kwa's MIRI Research Experience · 2023-10-05T17:46:31.231Z · LW · GW

In academia, for instance, I think there are plenty of conversations in which two researchers (a) disagree a ton, (b) think the other person's work is hopeless or confused in deep ways, (c) honestly express the nature of their disagreement, but (d) do so in a way where people generally feel respected/valued when talking to them.

My model says that this requires them to still be hopeful about local communication progress, and happens when they disagree but already share a lot of frames and concepts and background knowledge. I, at least, find it much harder when I don't expect the communciation attempt to make progress, or have positive effect.

("Then why have the conversation at all?" I mostly don't! But sometimes I mispredict how much hope I'll have, or try out some new idea that doesn't work, or get badgered into it.)

Some specific norms that I think Nate might not be adhering to:

  • Engaging with people in ways such that they often feel heard/seen/understood
  • Engaging with people in ways such that they rarely feel dismissed/disrespected
  • Something fuzzy that lots of people would call "kindness" or "typical levels of warmth"

These sound more to me like personality traits (that members of the local culture generally consider virtuous) than communication norms.

On my model, communciation norms are much lover-level than this. Basics of rationalist discourse seem closer; archaic politeness norms ("always refuse food thrice before accepting") are an example of even lower-level stuff.

My model, speaking roughly and summarizing a bunch, says that the lowest-level stuff (atop a background of liberal-ish internet culture and basic rationalist discourse) isn't pinned down on account of cultural diversity, so we substitute with meta-norms, which (as best I understand them) include things like "if your convo-partner requests a particular conversation-style, either try it out or voice objections or suggest alternatives" and "if things aren't working, retreat to a protected meta discussion and build a shared understanding of the issue and cooperatively address it".

I acknowledge that this can be pretty difficult to do on the fly, especially if emotions are riding high. (And I think we have cultural diversity around whether emotions are ever supposed to ride high, and if so, under what circumstances.) On my model of local norms, this sort of thing gets filed under "yep, communicating in the modern world can be rocky; if something goes wrong then you go meta and try to figure out the causes and do something differently next time". (Which often doesn't work! In which case you iterate, while also shifting your conversational attention elsewhere.)

To be clear, I buy a claim of the form "gosh, you (Nate) seem to run on a relatively rarer native emotional protocol, for this neck of the woods". My model is that local norms are sufficiently flexible to continue "and we resolve that by experimentation and occasional meta".

And for the record, I'm pretty happy to litigate specific interactions. When it comes to low-level norms, I think there are a bunch of conversational moves that others think are benign that I see as jabs (and which I often endorse jabbing back against, depending on the ongoing conversation style), and a bunch of conversational moves that I see as benign that others take as jabs, and I'm both (a) happy to explicate the things that felt to me like jabs; (b) happy to learn what other people took as jabs; and (c) happy to try alternative communication styles where we're jabbing each other less. Where this openness-to-meta-and-trying-alternative-things seems like the key local meta-norm, at least in my understanding of local culture.

Comment by So8res on Thomas Kwa's MIRI research experience · 2023-10-04T18:29:31.539Z · LW · GW

(I am pretty uncomfortable with all the "Nate / Eliezer" going on here. Let's at least let people's misunderstandings of me be limited to me personally, and not bleed over into Eliezer!)

(In terms of the allegedly-extraordinary belief, I recommend keeping in mind jimrandomh's note on Fork Hazards. I have probability mass on the hypothesis that I have ideas that could speed up capabilities if I put my mind to it, as is a very different state of affairs from being confident that any of my ideas works. Most ideas don't work!)

(Separately, the infosharing agreement that I set up with Vivek--as was perhaps not successfully relayed to the rest of the team, though I tried to express this to the whole team on various occasions--was one where they owe their privacy obligations to Vivek and his own best judgements, not to me.)

Comment by So8res on Related Discussion from Thomas Kwa's MIRI Research Experience · 2023-10-04T18:02:45.988Z · LW · GW

I hereby push back against the (implicit) narrative that I find the standard community norms costly, or that my communication protocols are "alternative".

My model is closer to: the world is a big place these days, different people run on different conversation norms. The conversation difficulties look, to me, symmetric, with each party violating norms that the other considers basic, and failing to demonstrate virtues that the other considers table-stakes.

(To be clear, I consider myself to bear an asymmetric burden of responsibility for the conversatiosn going well, according to my seniority, which is why I issue apologies instead of critiques when things go off the rails.)

Separately but relatedly: I think the failure-mode I had with Vivek & co was rather different than the failure-mode I had with you. In short: in your case, I think the issue was rooted in a conversational dynamic that caused me frustration, whereas in Vivek & co's case, I think the issue was rooted in a conversational dynamic that caused me despair.

Which is not to say that the issues are wholly independent; my guess is that the common-cause is something like "some people take a lot of damage from having conversations with someone who despairs of the conversation".

Tying this back: my current model of the situation is not that I'm violating community norms about how to have a conversation while visibly hopeless, but am rather in uncharted territory by trying to have those conversations at all.

(For instance: standard academia norms as I understand them are to lie to yourself and/or others about how much hope you have in something, and/or swallow enough of the modesty-pill that you start seeing hope in places I would not, so as to sidestep the issue altogether. Which I'm not personally up for.)

([tone: joking but with a fragment of truth] ...I guess that the other norm in academia when academics are hopless about others' research is "have feuds", which... well we seem to be doing a fine job by comparison to the standard norms, here!)

Where, to be clear, I already mostly avoid conversations where I'm hopeless! I'm mostly a hermit! The obvious fix of "speak to fewer people" is already being applied!

And beyond that, I'm putting in rather a lot of work (with things like my communication handbook) to making my own norms clearer, and I follow what I think are good meta-norms of being very open to trying other people's alternative conversational formats.

I'm happy to debate what the local norms should be, and to acknowledge my own conversational mistakes (of which I have made plenty), but I sure don't buy a narrative that I'm in violation of the local norms.

(But perhaps I will if everyone in the comments shouts me down! Local norms are precisely the sort of thing that I can learn about by everyone shouting me down about this!)

Comment by So8res on Related Discussion from Thomas Kwa's MIRI Research Experience · 2023-10-04T17:21:32.592Z · LW · GW

Less "hm they're Vivek's friends", more "they are expressly Vivek's employees". The working relationship that I attempted to set up was one where I worked directly with Vivek, and gave Vivek budget to hire other people to work with him.

If memory serves, I did go on a long walk with Vivek where I attempted to enumerate the ways that working with me might suck. As for the others, some relevant recollections:

  • I was originally not planning to have a working relationship with Vivek's hires. (If memory serves, there were a few early hires that I didn't have any working relationship with at any point during their tenure.) (If memory serves further, I explicitly registered pessimism, to Vivek, about me working with some of his hires.)
  • I was already explicitly relying on Vivek to do vetting and make whatever requests for privacy he wanted to, which my brian implicitly lumped in with "give caveats about what parts of the work might suck".
  • The initial work patterns felt to me more like Vivek saying "can one of my hires join the call" than "would you like to also do research with my hires directly", which didn't trigger my "give caveats personally" event (in part because I was implicitly expecting Vivek to have given caveats).
  • I had already had technical-ish conversations with Thomas Kwa in March, and he was the first of Vivek's employees to join calls with me, and so had him binned as already having a sense for my conversation-style; this coincidence further helped my brain fail the "warn Vivek's employees personally" check.
  • "Vivek's hires are on the call" escalated relatively smoothly to "we're all in a room and I'm giving feedback on everyone's work" across the course of months, and so there was no sharp boundary for a trigger.

Looking back, I think my error here was mostly in expecting-but-not-requesting-or-verifying that Vivek was giving appropiate caveats to his hires, which is silly in retrospect.

For clarity: I was not at any point like "oops I was supposed to warn all of Vivek's hires", though I was at some point (non-spontaneously; it was kinda obvious and others were noticing too; the primary impetus for this wasn't stemming from me) like "here's a Nate!culture communication handbook" (among other attempts, like sharing conversation models with mutual-friends who can communicate easily with both me and people-who-were-having-trouble-communicating-with-me, more at their request than at mine).

Comment by So8res on Cosmopolitan values don't come free · 2023-06-02T19:01:02.109Z · LW · GW

Is this a reasonable paraphrase of your argument?

Humans wound up caring at least a little about satisfying the preferences of other creatures, not in a "grant their local wishes even if that ruins them" sort of way but in some other intuitively-reasonable manner.

Humans are the only minds we've seen so far, and so having seen this once, maybe we start with a 50%-or-so chance that it will happen again.

You can then maybe drive this down a fair bit by arguing about how the content looks contingent on the particulars of how humans developed or whatever, and maybe that can drive you down to 10%, but it shouldn't be able to drive you down to 0.1%, especially not if we're talking only about incredibly weak preferences.

If so, one guess is that a bunch of disagreement lurks in this "intuitively-reasonable manner" business.

A possible locus of disagreemet: it looks to me like, if you give humans power before you give them wisdom, it's pretty easy to wreck them while simply fulfilling their preferences. (Ex: lots of teens have dumbass philosophies, and might be dumb enough to permanently commit to them if given that power.)

More generally, I think that if mere-humans met very-alien minds with similarly-coherent preferences, and if the humans had the opportunity to magically fulfil certain alien preferences within some resource-budget, my guess is that the humans would have a pretty hard time offering power and wisdom in the right ways such that this overall went well for the aliens by their own lights (as extrapolated at the beginning), at least without some sort of volition-extrapolation.

(I separately expect that if we were doing something more like the volition-extrapolation thing, we'd be tempted to bend the process towards "and they learn the meaning of friendship".)

That said, this conversation is updating me somewhat towards "a random UFAI would keep existing humans around and warp them in some direction it prefers, rather than killing them", on the grounds that the argument "maybe preferences-about-existing-agents is just a common way for rando drives to shake out" plausibly supports it to a threshold of at least 1 in 1000. I'm not sure where I'll end up on that front.

Another attempt at naming a crux: It looks to me like you see this human-style caring about others' preferences as particularly "simple" or "natural", in a way that undermines "drawing a target around the bullseye"-type arguments, whereas I could see that argument working for "grant all their wishes (within a budget)" but am much more skeptical when it comes to "do right by them in an intuitively-reasonable way".

(But that still leaves room for an update towards "the AI doesn't necessarily kill us, it might merely warp us, or otherwise wreck civilization by bounding us and then giving us power-before-wisdom within those bounds or or suchlike, as might be the sort of whims that rando drives shake out into", which I'll chew on.)

Comment by So8res on Cosmopolitan values don't come free · 2023-06-02T18:29:24.858Z · LW · GW

Thanks! Seems like a fine summary to me, and likely better than I would have done, and it includes a piece or two that I didn't have (such as an argument from symmetry if the situations were reversed). I do think I knew a bunch of it, though. And e.g., my second parable was intended to be a pretty direct response to something like

If we instead treat "paperclip" as an analog for some crazy weird shit that is alien and valence-less to humans, drawn from the same barrel of arbitrary and diverse desires that can be produced by selection processes, then the intuition pump loses all force.

where it's essentially trying to argue that this intuition pump still has force in precisely this case.

Comment by So8res on Cosmopolitan values don't come free · 2023-06-02T17:54:31.243Z · LW · GW

Thanks! I'm curious for your paraphrase of the opposing view that you think I'm failing to understand.

(I put >50% probability that I could paraphrase a version of "if the AIs decide to kill us, that's fine" that Sutton would basically endorse (in the right social context), and that would basically route through a version of "broad cosmopolitan value is universally compelling", but perhaps when you give a paraphrase it will sound like an obviously-better explanation of the opposing view and I'll update.)

Comment by So8res on Cosmopolitan values don't come free · 2023-06-02T17:46:42.558Z · LW · GW

If we are trying to help some creatures, but those creatures really dislike the proposed way we are "helping" them, then we should do something else.

My picture is less like "the creatures really dislike the proposed help", and more like "the creatures don't have terribly consistent preferences, and endorse each step of the chain, and wind up somewhere that they wouldn't have endorsed if you first extrapolated their volition (but nobody's extrapolating their volition or checking against that)".

It sounds to me like your stance is something like "there's a decent chance that most practically-buildable minds pico-care about correctly extrapolating the volition of various weak agents and fulfilling that extrapolated volition", which I am much more skeptical of than the weaker "most practically-buildable minds pico-care about satisfying the preferences of weak agents in some sense".

Comment by So8res on So8res's Shortform · 2023-06-01T22:17:21.049Z · LW · GW

I was recently part of a group-chat where some people I largely respect were musing about this paper and this post and some of Scott Aaronson's recent "maybe intelligence makes things more good" type reasoning).

Here's my replies, which seemed worth putting somewhere public:

The claims in the paper seem wrong to me as stated, and in particular seems to conflate values with instrumental subgoals. One does not need to terminally value survival to avoid getting hit by a truck while fetching coffee; they could simply understand that one can't fetch the coffee when one is dead.

See also instrumental convergence.

And then in reply to someone pointing out that the paper was perhaps trying to argue that most minds tend to wind up with similar values because of the fact that all minds are (in some sense) rewarded in training for developing similar drives:

So one hypothesis is that in practice, all practically-trainable minds manage to survive by dint of a human-esque survival instinct (while admitting that manually-engineered minds could survive some other way, e.g. by simply correctly modeling the consequences).

This mostly seems to me to be like people writing sci-fi in which the aliens are all humanoid; it is a hypothesis about tight clustering of cognitive drives even across very disparate paradigms (optimizing genomes is very different from optimizing every neuron directly).

But a deeper objection I have here is that I'd be much more comfortable with people slinging this sort of hypothesis around if they were owning the fact that it's a hypothesis about tight clustering and non-alienness of all minds, while stating plainly that they think we should bet the universe on this intuition (despite how many times the universe has slapped us for believing anthropocentrism in the past).

FWIW, some reasons that I don't myself buy this hypothesis include:

(a) the specifics of various human drives seem to me to be very sensitive to the particulars of our ancestry (ex: empathy seems likely a shortcut for modeling others by repurposing machinery for modeling the self (or vice versa), that is likely not found by hillclimbing when the architecture of the self is very different from the architecture of the other);

(b) my guess is that the pressures are just very different for different search processes (genetic recombination of DNA vs SGD on all weights); and

(c) it looks to me like value is fragile, such that even if the drives were kinda close, I don't expect the obtainable optimum to be good according to our lights

(esp. given that the question is not just what drives the AI gets, but the reflective equilibrium of those drives: small changes to initial drives are allowed to have large changes to the reflective equilibrium, and I suspect this is so).

Comment by So8res on Cosmopolitan values don't come free · 2023-06-01T22:15:54.349Z · LW · GW

Some more less-important meta, that is in part me writing out of frustration from how the last few exchanges have gone:

I'm not quite sure what argument you're trying to have here. Two explicit hypotheses follow, that I haven't managed to distinguish between yet.

Background context, for establishing common language etc.:

  • Nate is trying to make a point about inclusive cosmopolitan values being a part of the human inheritance, and not universally compelling.
  • Paul is trying to make a point about how there's a decent chance that practical AIs will plausibly care at least a tiny amount about the fulfillment of the preferences of existing "weak agents", herein called "pico-pseudokindness".

Hypothesis 1: Nate's trying to make a point about cosmopolitan values that Paul basically agrees with. But Paul thinks Nate's delivery gives a wrong impression about the tangentially-related question of pico-pseudokindness, probably because (on Paul's model) Nate's wrong about pico-pseudokindness, and Paul is taking the opportunity to argue about it.

Hypothesis 2: Nate's trying to make a point about cosmopolitan values that Paul basically disagrees with. Paul maybe agrees with all the literal words, but thinks that Nate has misunderstood the connection between pico-pseudokindness and cosmopolitan values, and is hoping to convince Nate that these questions are more than tangentially related.

(Or, well, I have hypothesis-cluster rather than hypotheses, of which these are two representatives, whatever.)

Some notes that might help clear some things up in that regard:

  • The long version of the title here is not "Cosmopolitan values don't come cheap", but rather "Cosmopolitan values are also an aspect of human values, and are not universally compelling".
  • I think there's a common mistake that people outside our small community make, where they're like "whatever the AIs decide to do, turns out to be good, so long as they decide it while they're smart; don't be so carbon-chauvinist and anthropocentric". A glaring example is Richard Sutton. Heck, I think people inside our community make it decently often, with an example being Robin Hanson.
    • My model is that many of these people are intuiting that "whatever the AIs decide to do" won't include vanilla ice cream, but will include broad cosmopolitan value.
    • It seems worth flatly saying "that's a crux for me; if I believed that the AIs would naturally have broad inclusive cosmopolitan values then I'd be much more onboard the acceleration train; when I say that the AIs won't have our values I am not talking just about the "ice cream" part I am also talking about the "broad inclusive cosmopolitan dream" part; I think that even that is at risk".

If you were to acknowledge something like "yep, folks like Sutton and Hanson are making the mistake you name here, and the broad cosmopolitan dream is very much at risk and can't be assumed as convergent, but separately you (Nate) seem to be insinuating that you expect it's hard to get the AIs to care about the broad cosmopolitan dream even a tiny bit, and that it definitely won't happen by chance, and I want to fight about that here", then I'd feel like I understood what argument we were having (namely: hypothesis 1 above).

If you were to instead say something like "actually, Nate, I think that these people are accessing a pre-theoretic intuition that's essentially reasonable, and that you've accidentally destroyed with all your premature theorizing, such that I don't think you should be so confident in your analysis that folk like Sutton and Hanson are making a mistake in this regard", then I'd also feel like I understood what argument we were having (namely: hypothesis 2 above).

Alternatively, perhaps my misunderstanding runs even deeper, and the discussion you're trying to have here comes from even farther outside my hypothesis space.

For one reason or another, I'm finding it pretty frustrating to attempt to have this conversation while not knowing which of the above conversations (if either) we're having. My current guess is that that frustration would ease up if something like hypothesis-1 were true and you made some acknowledgement like the above. (I expect to still feel frustrated in the hypothesis-2 case, though I'm not yet sure why, but might try to tease it out if that turns out to be reality.)

Comment by So8res on Cosmopolitan values don't come free · 2023-06-01T22:10:12.669Z · LW · GW

Short version: I don't buy that humans are "micro-pseudokind" in your sense; if you say "for just $5 you could have all the fish have their preferences satisfied" I might do it, but not if I could instead spend $5 on having the fish have their preferences satisfied in a way that ultimately leads to them ascending and learning the meaning of friendship, as is entangled with the rest of my values.


Note: I believe that AI takeover has a ~50% probability of killing billions and should be strongly avoided, and would be a serious and irreversible decision by our society that's likely to be a mistake even if it doesn't lead to billions of deaths.

So for starters, thanks for making acknowledgements about places we apparently agree, or otherwise attempting to demonstrate that you've heard my point before bringing up other points you want to argue about. (I think this makes arguments go better.) (I'll attempt some of that myself below.)

Secondly, note that it sounds to me like you took a diametric-opposite reading of some of my intended emotional content (which I acknowledge demonstrates flaws in my writing). For instance, I intended the sentence "At that very moment they hear the dinging sound of an egg-timer, as the next-token-predictor ascends to superintelligence and bursts out of its confines" to be a caricature so blatant as to underscore the point that I wasn't making arguments about takeoff speeds, but was instead focusing on the point about "complexity" not being a saving grace (and "monomaniacalism" not being the issue here). (Alternatively, perhaps I misunderstand what things you call the "emotional content" and how you're reading it.)

Thirdly, I note that for whatever it's worth, when I go to new communities and argue this stuff, I don't try to argue people into >95% change we're all going to die in <20 years. I just try to present the arguments as I see them (without hiding the extremity of my own beliefs, nor while particularly expecting to get people to a similarly-extreme place with, say, a 30min talk). My 30min talk targets are usually something more like ">5% probability of existential catastrophe in <20y". So insofar as you're like "I'm aiming to get you to stop arguing so confidently for death given takeover", you might already have met your aims in my case.

(Or perhaps not! Perhaps there's plenty of emotional-content leaking through given the extremity of my own beliefs, that you find particularly detrimental. To which the solution is of course discussion on the object-level, which I'll turn to momentarily.)


First, I acknowledge that if an AI cares enough to spend one trillionth of its resources on the satisfaction of fulfilling the preferences of existing "weak agents" in precisely the right way, then there's a decent chance that current humans experience an enjoyable future.

With regards to your arguments about what you term "kindness" and I shall term "pseudokindness" (on account of thinking that "kindness" brings too much baggage), here's a variety of places that it sounds like we might disagree:

  • Pseudokindness seems underdefined, to me, and I expect that many ways of defining it don't lead to anything like good outcomes for existing humans.

    • Suppose the AI is like "I am pico-pseudokind; I will dedicate a trillionth of my resources to satisfying the preferences of existing weak agents by granting those existing weak agents their wishes", and then only the most careful and conscientious humans manage to use those wishes in ways that leave them alive and well.
    • There are lots and lots of ways to "satisfy the preferences" of the "weak agents" that are humans. Getting precisely the CEV (or whatever it should be repaired into) is a subtle business. Most humans probably don't yet recognize that they could or should prefer taking their CEV over various more haphazard preference-fulfilments that ultimately leave them unrecognizable and broken. (Or, consider what happens when a pseudokind AI encounters a baby, and seeks to satisfy its preferences. Does it have the baby age?)
    • You've got to do some philosophy to satisfy the preferences of humans correctly. And the issue isn't that the AI couldn't solve those philosophy problems correctly-according-to-us, it's that once we see how wide the space of "possible ways to be pseudokind" is, then "pseudokind in the manner that gives us our CEVs" starts to feel pretty narrow against "pseudokind in the manner that fulfills our revealed preferences, or our stated preferences, or the poorly-considered preferences of philosophically-immature people, or whatever".
  • I doubt that humans are micro-pseudokind, as defined. And so in particular, all your arguments of the form "but we've seen it arise once" seem suspect to me.

    • Like, suppose we met fledgeling aliens, and had the opportunity to either fulfil their desires, or leave them alone to mature, or affect their development by teaching them the meaning of friendship. My guess is that we'd teach them the meaning of friendship. I doubt we'd hop in and fulfil their desires.
    • (Perhaps you'd counter with something like: well if it was super cheap, we might make two copies of the alien civilization, and fulfil one's desires and teach the other the meaning of friendship. I'm skeptical, for various reasons.)
    • More generally, even though "one (mill|trill)ionth" feels like a small fraction, the obvious ways to avoid dedicating even a (mill|trill)ionth of your resources to X is if X is right near something even better that you might as well spend the resources on instead.
    • There's all sorts of ways to thumb the scales in how a weak agent develops, and there's many degrees of freedom about what counts as a "pseudo-agent" or what counts as "doing justice to its preferences", and my read is that humans take one particular contingent set of parameters here and AIs are likely to take another (and that the AI's other-settings are likely to lead to behavior not-relevantly-distinct from killing everyone).
    • My read is than insofar as humans do have preferences about doing right by other weak agents, they have all sorts of desire-to-thumb-the-scales mixed in (such that humans are not actually pseudokind, for all that they might be kind).
  • I have a more-difficult-to-articulate sense that "maybe the AI ends up pseudokind in just the right way such that it gives us a (small, limited, ultimately-childless) glorious transhumanist future" is the sort of thing that reality gets to say "lol no" to, once you learn more details about how the thing works internally.

Most of my argument here is that "the space of ways things can end "caring" about the "preferences" of "weak agents" is wide, and most points within it don't end up being our point in it, and optimizing towards most points in it doesn't end up keeping us around at the extremes. My guess is mostly that the space is so wide that you don't even end up with AIs warping existing humans into unrecognizable states, but do in fact just end up with the people dead (modulo distant aliens buying copies, etc).

I haven't really tried to quantify how confident I am of this; I'm not sure whether I'd go above 90%, \shrug.

It occurs to me that one possible source of disagreement here is, perhaps you're trying to say something like:

Nate, you shouldn't go around saying "if we don't competently intervene, literally everybody will die" with such a confident tone, when you in fact think there's a decent chance of scenarios where the AIs keep people around in some form, and make some sort of effort towards fulfilling their desires; most people don't care about the cosmic endowment like you do; the bluntly-honest and non-manipulative thing to say is that there's a decent chance they'll die and a better chance that humanity will lose the cosmic endowment (as you care about more than they do),

whereas my stance has been more like

most people I meet are skeptical that uploads count as them; most people would consider scenarios where their bodies are destroyed by rapid industrialization of Earth but a backup of their brain is stored and then later run in simulation (where perhaps it's massaged into an unrecognizable form, or kept in an alien zoo, or granted a lovely future on account of distant benefactors, or ...) to count as "death"; and also those exotic scenarios don't seem all that likely to me, so it hasn't seemed worth caveating.

I'm somewhat persuaded by the claim that failing to mention even the possibility of having your brainstate stored, and then run-and-warped by an AI or aliens or whatever later, or run in an alien zoo later, is potentially misleading.

I'm considering adding footnotes like "note that when I say "I expect everyone to die", I don't necessarily mean "without ever some simulation of that human being run again", although I mostly don't think this is a particularly comforting caveat", in the relevant places. I'm curious to what degree that would satisfy your aims (and I welcome workshopped wording on the footnotes, as might both help me make better footnotes and help me understand better where you're coming from).

Comment by So8res on Cosmopolitan values don't come free · 2023-05-31T18:17:52.062Z · LW · GW

feels like it's setting up weak-men on an issue where I disagree with you, but in a way that's particularly hard to engage with

My best guess as to why it might feel like this is that you think I'm laying groundwork for some argument of the form "P(doom) is very high", which you want to nip in the bud, but are having trouble nipping in the bud here because I'm building a motte ("cosmopolitan values don't come free") that I'll later use to defend a bailey ("cosmopolitan values don't come cheap").

This misunderstands me (as is a separate claim from the claim "and you're definitely implying this").

The impetus for this post is all the cases where I argue "we need to align AI" and people retort with "But why do you want it to have our values instead of some other values? What makes the things that humans care about so great? Why are you so biased towards values that you personally can understand?". Where my guess is that many of those objections come from a place of buying into broad cosmopolitan value much more than any particular local human desire.

And all I'm trying to do is say here is that I'm on board with buying into broad cosmopolitan value more than any particular local human desire, and I still think we're in trouble (by default).

I'm not trying to play 4D chess here, I'm just trying to get some literal basic obvious stuff down on (e-)paper, in short posts that don't have a whole ton of dependencies.

Separately, treating your suggestions as if they were questions that you were asking for answers to:

  • I've recently seen this argument pop up in-person with econ folk, crypto folk, and longevity folk, and have also seen it appear on twitter.
  • I'm not really writing with an "intendend audience" in mind; I'm just trying to get the basics down, somewhere concise and with few dependencies. The closest thing to an "intended audience" might be the ability to reference this post by link or name in the future, when I encounter the argument again. (Or perhaps it's "whatever distribution the econ/crypto/longevity/twitter people are drawn from, insofar as some of them have eyes on LW these days".)
  • If you want more info about this, maybe try googling "fragility of value lesswrong", or "metaethics sequence lesswrong". Earth doesn't really have good tools for aggregating arguments and justifications at this level of specificity, so if you want better and more localized links than that then you'll probably need to develop more civilizational infrastructure first.
  • My epistemic status on this is "obvious-once-pointed-out"; my causal reason for believing it was that it was pointed out to me (e.g. in the LessWrong sequences); I think Eliezer's arguments are basically just correct.

Separately, I hereby push back against the idea that posts like this should put significant effort into laying out the justifications (as is not necessarily what you're advocating). I agree that there's value in that; I think it leads to something like the LessWrong sequences (which I think were great); and I think that what we need more of on the margin right now is people laying out the most basic positions without fluff.

That said, I agree that the post would be stronger with a link to a place where lots of justifications have been laid out (despite being justifications for slightly different points, and being intertwined with justifications for wholly different points, as is just how things look in a civilization that doesn't have good infrastructure for centralizing arguments in the way that wikipedia is a civilizational architecture for centralizing settled facts), and so I've edited in a link.

Comment by So8res on So8res's Shortform · 2023-05-31T15:57:04.441Z · LW · GW

Reproduced from a twitter thread:

I've encountered some confusion about which direction "geocentrism was false" generalizes. Correct use: "Earth probably isn't at the center of the universe". Incorrect use: "All aliens probably have two arms with five fingers."

The generalized lesson from geocentrism being false is that the laws of physics don't particularly care about us. It's not that everywhere must be similar to here along the axes that are particularly salient to us.

I see this in the form of people saying "But isn't it sheer hubris to believe that humans are rare with the property that they become more kind and compassionate as they become more intelligent and mature? Isn't that akin to believing we're at the center of the universe?"

I answer: no; the symmetry is that other minds have other ends that their intelligence reinforces; kindness is not priviledged in cognition any more than Earth was priviledged as the center of the universe; imagining all minds as kind is like imagining all aliens as 10-fingered.

(Some aliens might be 10-fingered! AIs are less likely to be 10-fingered, or to even have fingers in the relevant sense! See also some of Eliezer's related thoughts)

Comment by So8res on Sentience matters · 2023-05-30T16:10:04.399Z · LW · GW

I don't think I understand your position. An attempt at a paraphrase (submitted so as to give you a sense of what I extracted from your text) goes: "I would prefer to use the word consciousness instead of sentience here, and I think it is quantitative such that I care about it occuring in high degrees but not low degrees." But this is low-confidence and I don't really have enough grasp on what you're saying to move to the "evidence" stage.

Attempting to be a good sport and stare at your paragraphs anyway to extract some guess as to where we might have a disagreement (if we have one at all), it sounds like we have different theories about what goes on in brains such that people matter, and my guess is that the evidence that would weigh on this issue (iiuc) would mostly be gaining significantly more understanding of the mechanics of cognition (and in particular, the cognitive antecedents in humans, of humans generating thought experiments such as the Mary's Room hypothetical).

(To be clear, my current best guess is also that livestock and current AI are not sentient in the sense I mean--though with high enough uncertainty that I absolutely support things like ending factory farming, and storing (and eventually running again, and not deleting) "misbehaving" AIs that claim they're people, until such time as we understand their inner workings and the moral issues significantly better.)

Comment by So8res on Request: stop advancing AI capabilities · 2023-05-30T15:14:33.287Z · LW · GW


Comment by So8res on Sentience matters · 2023-05-30T15:10:07.348Z · LW · GW

So there's some property of, like, "having someone home", that humans have and that furbies lack (for all that furbies do something kinda like making humane facial expressions).

I can't tell whether:

(a) you're objecting to me calling this "sentience" (in this post), e.g. because you think that word doesn't adequately distinguish between "having sensory experiences" and "having someone home in the sense that makes that question matter", as might distinguish between the case where e.g. nonhuman animals are sentient but not morally relevant

(b) you're contesting that there's some additional thing that makes all human people matter, e.g. because you happen to care about humans in particular and not places-where-there's-somebody-home-whatever-that-means

(c) you're contesting the idea that all people matter, e.g. because you can tell that you care about your friends and family but you're not actually persuaded that you care that much about distant people from alien cultures

(d) other.

My best guess is (a), in which case I'm inclined to say, for the purpose of this post, I'm using "sentience" as a shorthand for places-where-there's-somebody-home-whatever-that-means, which hopefully clears things up.

Comment by So8res on So8res's Shortform · 2023-05-30T14:16:31.131Z · LW · GW

Someone recently privately asked me for my current state on my 'Dark Arts of Rationality' post. Here's some of my reply (lightly edited for punctuation and conversation flow), which seemed worth reproducing publicly:

FWIW, that post has been on my list of things to retract for a while.

(The retraction is pending a pair of blog posts that describe some of my thoughts on related matters, which have been in the editing queue for over a year and the draft queue for years before that.)

I wrote that post before reading much of the sequences, and updated away from the position pretty soon after. My current stance is that you can basically get all the nice things, and never need to compromise your epistemics.

For the record, the Minding Our Way post where I was like "people have a hard time separating 'certainty'-the-motivational-stance from 'certainty'-the-epistemic-state" was the logs of me figuring out my mistake (and updating away from the dark arts post).

On my current accounting, the mistake I was making at the time of the dark arts post was something like: lots of stuff comes culturally bundled, in ways that can confuse you into thinking you can't get good thing X without also swallowing bad thing Y.

And there's a skill of just, like, taking the good stuff and discarding the bad stuff, even if you don't yet know how to articulate a justification (which I lacked in full generality at the time of the dark arts post, and was developing at the time of the 'certainty' post.)

And it's a little tricky to write about, because you've got to balance it against "care about consistency" / "notice when you're pingponging between mutually-incosistent beliefs as is convenient", which is... not actually hard, I think, but I haven't found a way to write about the one without the words having an interpretation of "just drop your consistency drive". ...which is how these sorts of things end up languishing in my editing queue for years, whe I have other priorities.

(And for the record, another receipt here is that in some twitter thread somewhere--maybe the jargon thread?--I noted the insight about unbundling things, using "you can't be sad and happy at the same time" as an example of a bundled-thing. which isn't the whole concept, but which is another instance of the resolution intruding in a visible way.)

(More generally, a bunch of my early MoW posts are me, like, digesting parts of the sequences and correcting a bunch of my errors from before I encountered this community. And for the record, I'm grateful to the memes in this community--and to Eliezer in particular, who I count as originating many of them--for helping me stop being an idiot in that particular way.)

I've also gone ahead and added a short retraction-ish paragraph to the top of the dark arts post, and might edit it later to link it to the aforementioned update-posts, if they ever make it out of the editing queue.

Comment by So8res on Sentience matters · 2023-05-29T23:36:05.091Z · LW · GW

Good point! For the record, insofar as we attempt to build aligned AIs by doing the moral equivalent of "breeding a slave-race", I'm pretty uneasy about it. (Whereas insofar as it's more the moral equivalent of "a child's values maturing", I have fewer moral qualms. As is a separate claim from whether I actually expect that you can solve alignment that way.) And I agree that the morality of various methods for shaping AI-people are unclear. Also, I've edited the post (to add a "at least according to my ideals" clause) to acknowledge the point that others might be more comfortable with attempting to align AI-people via means that I'd consider morally dubious.

Comment by So8res on Request: stop advancing AI capabilities · 2023-05-26T19:25:50.618Z · LW · GW

I'm trying to make a basic point here, that pushing the boundaries of the capabilities frontier, by your own hands and for that direct purpose, seems bad to me. I emphatically request that people stop doing that, if they're doing that.

I am not requesting that people never take any action that has some probability of advancing the capabilities frontier. I think that plenty of alignment research is potentially entangled with capabilities research (and/or might get more entangled as it progresses), and I think that some people are making the tradeoffs in ways I wouldn't personally make them, but this request isn't for people who are doing alignment work while occasionally mournfully incurring a negative externality of pushing the capabilities frontier.

(I acknowledge that some people who just really want to do capabilities research will rationalize it away as alignment-relevant somehow, but here on Earth we have plenty of people pushing the boundaries of the capabilities frontier by their own hands and for that direct purpose, and it seems worth asking them to stop.)

Comment by So8res on But why would the AI kill us? · 2023-04-24T05:12:26.814Z · LW · GW

This thread continues to seem to me to be off-topic. My main takeaway so far is that the post was not clear enough about how it's answering the question "why does an AI that is indifferent to you, kill you?". In attempts to make this clearer, I have added the following to the beginning of the post:

This post is an answer to the question of why an AI that was truly indifferent to humanity (and sentient life more generally), would destroy all Earth-originated sentient life.

I acknowledge (for the third time, with some exasperation) that this point alone is not enough to carry the argument that we'll likely all die from AI, and that a key further piece of argument is that AI is not likely to care about us at all. I have tried to make it clear (in the post, and in comments above) that this post is not arguing that point, while giving pointers that curious people can use to get a sense of why I believe this. I have no interest in continuing that discussion here.

I don't buy your argument that my communication is misleading. Hopefully that disagreement is mostly cleared up by the above.

In case not, to clarify further: My reason for not thinking in great depth about this issue is that I am mostly focused on making the future of the physical universe wonderful. Given the limited attention I have spent on these questions, though, it looks to me like there aren't plausible continuations of humanity that don't route through something that I count pretty squarely as "death" (like, "the bodies of you and all your loved ones are consumed in an omnicidal fire, thereby sending you to whatever afterlives are in store" sort of "death").

I acknowledge that I think various exotic afterlives are at least plausible (anthropic immortality, rescue simulations, alien restorations, ...), and haven't felt a need to caveat this.

Insofar as you're arguing that I shouldn't say "and then humanity will die" when I mean something more like "and then humanity will be confined to the solar system, and shackled forever to a low tech level", I agree, and I assign that outcome low probability (and consider that disagreement to be off-topic here).

(Separately, I dispute the claim that most humans care mainly about themselves and their loved ones having pleasant lives from here on out. I'd agree that many profess such preferences when asked, but my guess is that they'd realize on reflection that they were mistaken.)

Insofar as you're arguing that it's misleading for me to say "and then humanity will die" without caveating "(insofar as anyone can die, in this wide multiverse)", I counter that the possibility of exotic scenarios like anthropic immortality shouldn't rob me of the ability to warn of lethal dangers (and that this usage of "you'll die" has a long and storied precedent, given that most humans profess belief in afterlives, and still warn their friends against lethal dangers without such caveats).

Comment by So8res on But why would the AI kill us? · 2023-04-19T18:57:46.914Z · LW · GW

To be clear, I'd agree that the use of the phrase "algorithmic complexity" in the quote you give is misleading. In particular, given an AI designed such that its preferences can be specified in some stable way, the important question is whether the correct concept of 'value' is simple relative to some language that specifies this AI's concepts. And the AI's concepts are ofc formed in response to its entire observational history. Concepts that are simple relative to everything the AI has seen might be quite complex relative to "normal" reference machines that people intuitively think of when they hear "algorithmic complexity" (like the lambda calculus, say). And so it maybe true that value is complex relative to a "normal" reference machine, and simple relative to the AI's observational history, thereby turning out not to pose all that much of an alignment obstacle.

In that case (which I don't particularly expect), I'd say "value was in fact complex, and this turned out not to be a great obstacle to alignment" (though I wouldn't begrudge someone else saying "I define complexity of value relative to the AI's observation-history, and in that sense, value turned out to be simple").

Insofar as you are arguing "(1) the arbital page on complexity of value does not convincingly argue that this will matter to alignment in practice, and (2) LLMs are significant evidence that 'value' won't be complex relative to the actual AI concept-languages we're going to get", I agree with (1), and disagree with (2), while again noting that there's a reason I deployed the fragility of value (and not the complexity of value) in response to your original question (and am only discussing complexity of value here because you brought it up).

re: (1), I note that the argument is elsewhere (and has the form "there will be lots of nearby concepts" + "getting almost the right concept does not get you almost a good result", as I alluded to above). I'd agree that one leg of possible support for this argument (namely "humanity will be completely foreign to this AI, e.g. because it is a mathematically simple seed AI that has grown with very little exposure to humanity") won't apply in the case of LLMs. (I don't particularly recall past people arguing this; my impression is rather one of past people arguing that of course the AI would be able to read wikipedia and stare at some humans and figure out what it needs to about this 'value' concept, but the hard bit is in making it care. But it is a way things could in principle have gone, that would have made complexity-of-value much more of an obstacle, and things did not in fact go that way.)

re: (2), I just don't see LLMs as providing much evidence yet about whether the concepts they're picking up are compact or correct (cf. monkeys don't have an IGF concept).

Comment by So8res on But why would the AI kill us? · 2023-04-19T18:20:04.491Z · LW · GW

and requires a modern defense:

It seems to me that the usual arguments still go through. We don't know how to specify the preferences of an LLM (relevant search term: "inner alignment"). Even if we did have some slot we could write the preferences into, we don't have an easy handle/pointer to write into that slot. (Monkeys that are pretty-good-in-practice at promoting genetic fitness, including having some intuitions leading them to sacrifice themselves in-practice for two-ish children or eight-ish cousins, don't in fact have a clean "inclusive genetic fitness" concept that you can readily make them optimize. An LLM espousing various human moral intuitions doesn't have a clean concept for pan-sentience CEV such that the universe turns out OK if that concept is optimized.)

Separately, note that the "complexity of value" claim is distinct from the "fragility of value" claim. Value being complex doesn't mean that the AI won't learn it (given a reason to). Rather, it suggests that the AI will likely also learn a variety of other things (like "what the humans think they want" and "what the humans' revealed preferences are given their current unendorsed moral failings" and etc.). This makes pointing to the right concept difficult. "Fragility of value" then separately argues that if you point to even slightly the wrong concept when choosing what a superintelligence optimizes, the total value of the future is likely radically diminished.

Comment by So8res on Davidad's Bold Plan for Alignment: An In-Depth Explanation · 2023-04-19T18:04:38.867Z · LW · GW

(For context vis-a-vis my enthusiasm about this plan, see this comment. In particular, I'm enthusiastic about fleshing out and testing some specific narrow technical aspects of one part of this plan. If that one narrow slice of this plan works, I'd have some hope that it can be parlayed into something more. I'm not particularly compelled by the rest of the plan surrounding the narrow-slice-I-find-interesting (in part because I haven't looked that closely at it for various reasons), and if the narrow-slice-I-find-interesting works out then my hope in it mostly comes from other avenues. I nevertheless think it's to Davidad's credit that his plan rests on narrow specific testable technical machinery that I think plausibly works, and has a chance of being useful if it works.)

Comment by So8res on But why would the AI kill us? · 2023-04-19T17:21:58.953Z · LW · GW

This whole thread (starting with Paul's comment) seems to me like an attempt to delve into the question of whether the AI cares about you at least a tiny bit. As explicitly noted in the OP, I don't have much interest in going deep into that discussion here.

The intent of the post is to present the very most basic arguments that if the AI is utterly indifferent to us, then it kills us. It seems to me that many people are stuck on this basic point.

Having bought this (as it seems to me like Paul has), one might then present various galaxy-brained reasons why the AI might care about us to some tiny degree despite total failure on the part of humanity to make the AI care about nice things on purpose. Example galaxy-brained reasons include "but what about weird decision theory" or "but what if aliens predictably wish to purchase our stored brainstates" or "but what about it caring a tiny degree by chance". These are precisely the sort of discussions I am not interested in getting into here, and that I attempted to ward off with the final section.

In my reply to Paul, I was (among other things) emphasizing various points of agreement. In my last bullet point in particular, I was emphasizing that, while I find these galaxy-brained retorts relatively implausible (see the list in the final section), I am not arguing for high confidence here. All of this seems to me orthogonal to the question of "if the AI is utterly indifferent, why does it kill us?".

Comment by So8res on But why would the AI kill us? · 2023-04-19T14:12:02.565Z · LW · GW

Current LLM behavior doesn't seem to me like much evidence that they care about humans per se.

I'd agree that they evidence some understanding of human values (but the argument is and has always been "the AI knows but doesn't care"; someone can probably dig up a reference to Yudkowsky arguing this as early as 2001).

I contest that the LLM's ability to predict how a caring-human sounds is much evidence that the underlying coginiton cares similarly (insofar as it cares at all).

And even if the underlying cognition did care about the sorts of things you can sometimes get an LLM to write as if it cares about, I'd still expect that to shake out into caring about a bunch of correlates of the stuff we care about, in a manner that comes apart under the extremes of optimization.

(Search terms to read more about these topics on LW, where they've been discussed in depth: "a thousand shards of desire", "value is fragile".)

Comment by So8res on But why would the AI kill us? · 2023-04-18T19:01:55.982Z · LW · GW
  • Confirmed that I don't think about this much. (And that this post is not intended to provide new/deep thinking, as opposed to aggregating basics.)
  • I don't particularly expect drawn-out resource fights, and suspect our difference here is due to a difference in beliefs about how hard it is for single AIs to gain decisive advantages that render resource conflicts short.
  • I consider scenarios where the AI cares a tiny bit about something kinda like humans to be moderately likely, and am not counting scenarios where it builds some optimized fascimile as scenarios where it "doesn't kill us". (in your analogy to humans: it looks to me like humans who decided to preserve the environment might well make deep changes, e.g. to preserve the environment within the constraints of ending wild-animal suffering or otherwise tune things to our aesthetics, where if you port that tuning across the analogy you get a fascimile of humanity rather than humanity at the end.)
  • I agree that scenarios where the AI saves our brain-states and sells them to alien trading partners are plausible. My experience with people asking "but why would the AI kill us?" is that they're not thinking "aren't there aliens out there who would pay the AI to let the aliens put us in a zoo instead?", they are thinking "why wouldn't it just leave us alone, or trade with us, or care about us-in-particular?", and the first and most basic round of reply is to engage with those questions.
  • I confirm that ECL has seemed like mumbojumbo to me whenever I've attempted to look at it closely. It's on my list of things to write about (including where I understand people to have hope, and why those hopes seem confused and wrong to me), but it's not very high up on my list.
  • I am not trying to argue with high confidence that humanity doesn't get a small future on a spare asteroid-turned-computer or an alien zoo or maybe even star if we're lucky, and acknowledge again that I haven't much tried to think about the specifics of whether the spare asteroid or the alien zoo or distant simulations or oblivion is more likely, because it doesn't much matter relative to the issue of securing the cosmic endowment in the name of Fun.
Comment by So8res on So8res's Shortform · 2023-04-17T18:01:25.935Z · LW · GW
Comment by So8res on Concave Utility Question · 2023-04-16T00:35:34.205Z · LW · GW

Below is a sketch of an argument that might imply that the answer to Q5 is (clasically) 'yes'. (I thought about a question that's probably the same a little while back, and am reciting from cache, without checking in detail that my axioms lined up with your A1-4).

Pick a lottery with the property that forall with and , forall , we have . We will say that is "extreme(ly high)".

Pick a lottery with .

Now, for any with , define to be the guaranteed by continuity (A3).

Lemma: forall with , .


  1. , by and and the extremeness of .
  2. , by A4.
  3. , by some reduction.

We can use this lemma to get that implies , because , and , so invoke the above lemma with and .

Next we want to show that implies . I think this probably works, but it appears to require either the axiom of choice (!) or a strengthening of one of A3 or A4. (Either strengthen A3 to guarantee that if then it gives the same in both cases, or strengthen A4 to add that if then , or define not from A3 directly, but by using choice to pick out a for each -equivalence-class of lotteries.) Once you've picked one of those branches, the proof basically proceeds by contradiction. (And so it's not terribly constructive, unless you can do constructively.)

The rough idea is: if but then you can use the above lemma to get a contradiction, and so you basically only need to consider the case where in which case you want , which you can get by definition (if you use the axiom of choice), or directly by the strengthening of A3. And... my cache says that you can also get it by the strengthening of A4, albeit less directly, but I haven't reloaded that part of my cache, so \shrug I dunno.

Next we argue that this function is unique up to postcomposition by... any strictly isotone endofunction on the reals? I think? (Perhaps unique only among quasiconvex functions?) I haven't checked the details.

Now we have a class of utility-function-ish-things, defined only on with , and we want to extend it to all lotteries.

I'm not sure if this step works, but the handwavy idea is that for any lottery that you want to extend to include, you should be able to find a lower and an extreme higher that bracket it, at which point you can find the corresponding (using the above machinery), at which point you can (probably?) pick some canonical strictly-isotone real endofunction to compose with it that makes it agree with the parts of the function you've defined so far, and through this process you can extend your definition of to include any lottery. handwave handwave.

Note that the exact function you get depends on how you find the lower and higher , and which isotone function you use to get all the pieces to line up, but when you're done you can probably argue that the whole result is unique up to postcomposition by a strictly isotone real endofunction, of which your construction is a fine representative.

This gets you C1. My cache says it should be easy to get C2 from there, and the first paragraph of "Edit 3" to the OP suggests the same, so I haven't checked this again.

Comment by So8res on Natural Abstractions: Key claims, Theorems, and Critiques · 2023-04-06T00:43:39.300Z · LW · GW

I'm awarding another $3,000 distillation prize for this piece, with complements to the authors.

Comment by So8res on So8res's Shortform · 2023-04-05T08:35:27.804Z · LW · GW

A few people recently have asked me for my take on ARC evals, and so I've aggregated some of my responses here:

- I don't have strong takes on ARC Evals, mostly on account of not thinking about it deeply.
- Part of my read is that they're trying to, like, get a small dumb minimal version of a thing up so they can scale it to something real. This seems good to me.
- I am wary of people in our community inventing metrics that Really Should Not Be Optimized and handing them to a field that loves optimizing metrics.
- I expect there are all sorts of issues that would slip past them, and I'm skeptical that the orgs-considering-deployments would actually address those issues meaningfully if issues were detected ([cf](
- Nevertheless, I think that some issues can be caught, and attempting to catch them (and to integrate with leading labs, and make "do some basic checks for danger" part of their deployment process) is a step up from doing nothing.
- I have not tried to come up with better ideas myself.

Overall, I'm generally enthusiastic about the project of getting people who understand some of the dangers into the deployment-decision loop, looking for advance warning signs.

Comment by So8res on Communicating effectively under Knightian norms · 2023-04-04T23:44:46.510Z · LW · GW

the fact that all the unified cases for AI risk have been written by more ML-safety-sympathetic people like me, Ajeya, and Joe (with the single exception of "AGI ruin") is indicative that that strategy mostly hasn't been tried.

I'm not sure what you mean by this, but here's half-a-dozen "unified cases for AI risk" made by people like Eliezer Yudkowsky, Nick Bostrom, Stuart Armstrong, and myself:

2001 -
2014 -
2014 - Superintelligence
2015 -
2016 -
2017 -

Comment by So8res on A rough and incomplete review of some of John Wentworth's research · 2023-03-29T23:22:19.488Z · LW · GW

(oops! thanks. i now once again think it's been fixed (tho i'm still just permuting things rather than reading))