Posts

Open Thread Summer 2024 2024-06-11T20:57:18.805Z
"AI Safety for Fleshy Humans" an AI Safety explainer by Nicky Case 2024-05-03T18:10:12.478Z
Goal oriented cognition in "a single forward pass" 2024-04-22T05:03:18.649Z
Express interest in an "FHI of the West" 2024-04-18T03:32:58.592Z
Structured Transparency: a framework for addressing use/mis-use trade-offs when sharing information 2024-04-11T18:35:44.824Z
LessWrong's (first) album: I Have Been A Good Bing 2024-04-01T07:33:45.242Z
How useful is "AI Control" as a framing on AI X-Risk? 2024-03-14T18:06:30.459Z
Open Thread Spring 2024 2024-03-11T19:17:23.833Z
Is a random box of gas predictable after 20 seconds? 2024-01-24T23:00:53.184Z
Will quantum randomness affect the 2028 election? 2024-01-24T22:54:30.800Z
Vote in the LessWrong review! (LW 2022 Review voting phase) 2024-01-17T07:22:17.921Z
AI Impacts 2023 Expert Survey on Progress in AI 2024-01-05T19:42:17.226Z
Originality vs. Correctness 2023-12-06T18:51:49.531Z
The LessWrong 2022 Review 2023-12-05T04:00:00.000Z
Open Thread – Winter 2023/2024 2023-12-04T22:59:49.957Z
Complex systems research as a field (and its relevance to AI Alignment) 2023-12-01T22:10:25.801Z
How useful is mechanistic interpretability? 2023-12-01T02:54:53.488Z
My techno-optimism [By Vitalik Buterin] 2023-11-27T23:53:35.859Z
"Epistemic range of motion" and LessWrong moderation 2023-11-27T21:58:40.834Z
Debate helps supervise human experts [Paper] 2023-11-17T05:25:17.030Z
How much to update on recent AI governance moves? 2023-11-16T23:46:01.601Z
AI Timelines 2023-11-10T05:28:24.841Z
How to (hopefully ethically) make money off of AGI 2023-11-06T23:35:16.476Z
Integrity in AI Governance and Advocacy 2023-11-03T19:52:33.180Z
What's up with "Responsible Scaling Policies"? 2023-10-29T04:17:07.839Z
Trying to understand John Wentworth's research agenda 2023-10-20T00:05:40.929Z
Trying to deconfuse some core AI x-risk problems 2023-10-17T18:36:56.189Z
How should TurnTrout handle his DeepMind equity situation? 2023-10-16T18:25:38.895Z
The Lighthaven Campus is open for bookings 2023-09-30T01:08:12.664Z
Navigating an ecosystem that might or might not be bad for the world 2023-09-15T23:58:00.389Z
Long-Term Future Fund Ask Us Anything (September 2023) 2023-08-31T00:28:13.953Z
Open Thread - August 2023 2023-08-09T03:52:55.729Z
Long-Term Future Fund: April 2023 grant recommendations 2023-08-02T07:54:49.083Z
Final Lightspeed Grants coworking/office hours before the application deadline 2023-07-05T06:03:37.649Z
Correctly Calibrated Trust 2023-06-24T19:48:05.702Z
My tentative best guess on how EAs and Rationalists sometimes turn crazy 2023-06-21T04:11:28.518Z
Lightcone Infrastructure/LessWrong is looking for funding 2023-06-14T04:45:53.425Z
Launching Lightspeed Grants (Apply by July 6th) 2023-06-07T02:53:29.227Z
Yoshua Bengio argues for tool-AI and to ban "executive-AI" 2023-05-09T00:13:08.719Z
Open & Welcome Thread – April 2023 2023-04-10T06:36:03.545Z
Shutting Down the Lightcone Offices 2023-03-14T22:47:51.539Z
Review AI Alignment posts to help figure out how to make a proper AI Alignment review 2023-01-10T00:19:23.503Z
Kurzgesagt – The Last Human (Youtube) 2022-06-29T03:28:44.213Z
Replacing Karma with Good Heart Tokens (Worth $1!) 2022-04-01T09:31:34.332Z
Apply to the ML for Alignment Bootcamp (MLAB) in Berkeley [Jan 3 - Jan 22] 2021-11-03T18:22:58.879Z
The LessWrong Team is now Lightcone Infrastructure, come work with us! 2021-10-01T01:20:33.411Z
Welcome & FAQ! 2021-08-24T20:14:21.161Z
Berkeley, CA – ACX Meetups Everywhere 2021 2021-08-23T08:50:51.898Z
The Death of Behavioral Economics 2021-08-22T22:39:12.697Z
Open and Welcome Thread – August 2021 2021-08-15T05:59:05.270Z

Comments

Comment by habryka (habryka4) on Index of rationalist groups in the Bay July 2024 · 2024-07-26T20:00:15.152Z · LW · GW

Thanks for doing this! This list seems roughly accurate. 

Comment by habryka (habryka4) on Universal Basic Income and Poverty · 2024-07-26T07:25:57.656Z · LW · GW

Note: I crossposted this for Eliezer, after asking him for permission, because I thought it was a good essay. It was originally written for Twitter, so is not centrally aimed at a LW audience, but I still think it's a good essay ot have on the site.

Comment by habryka (habryka4) on Open Thread Summer 2024 · 2024-07-24T18:15:12.196Z · LW · GW

Yeah, it's a mod-internal alternative to the AI algorithm for the recommendations tab (it uses Google Vertex instead).

Comment by habryka (habryka4) on Closed Limelike Curves's Shortform · 2024-07-24T02:14:32.471Z · LW · GW

I mean, I think it would be totally reasonable for someone who is doing some decision theory or some epistemology work, to come up with new "dutch book arguments" supporting whatever axioms or assumptions they would come up with. 

I think I am more compelled that there is a history here of calling money pump arguments that happen to relate to probabilism "dutch books", but I don't think there is really any clear definition that supports this. I agree that there exists the dutch book theorem, and that that one importantly relates to probabilism, but I've just had dozens of conversations with academics and philosophers and academics and decision-theorists, where in the context of both decision-theory and epistemology question, people brought up dutch books and money pumps interchangeably.

Comment by habryka (habryka4) on Closed Limelike Curves's Shortform · 2024-07-23T22:33:12.453Z · LW · GW

I've pretty consistently (by many different people) seen "Dutch Book arguments" used interchangeably with money pumps. My understanding (which is also the SEP's) is that "what is a money pump vs. a dutch book argument" is not particularly well-defined and the structure of the money pump arguments is basically the same as the structure of the dutch book arguments. 

This is evident from just the basic definitions: 

"A Dutch book is a set of bets that ensures a guaranteed loss, i.e. the gambler will lose money no matter what happens." 

Which is of course exactly what a money pump is (where you are the person offering the gambles and therefore make guaranteed money).

The money pump Wikipedia article also links to the Dutch book article, and the book/paper I linked describes dutch books as a kind of money pump argument. I have never heard anyone make a principled distinction between a money pump argument and a dutch book argument (and I don't see how you could get one without the other).

Indeed, the Oxford Reference says explicitly: 

money pump

A pattern of intransitive or cyclic preferences causing a decision maker to be willing to pay repeated amounts of money to have these preferences satisfied without gaining any benefit. [...] Also called a Dutch book [...]

(Edit: It's plausible that for weird historical reasons the exact same argument, when applied to probabilism would be called a "dutch book" and when applied to anything else would be called a "money pump", but I at least haven't seen anyone defend that distinction, and it doesn't seem to follow from any of the definitions)

Comment by habryka (habryka4) on Closed Limelike Curves's Shortform · 2024-07-23T20:56:02.970Z · LW · GW

Well, thinking harder about this, I do think your critiques on some of these is wrong. For example, it is the case that the VNM axioms frequently get justified by invoking dutch books (the most obvious case is the argument for transitivity, where the standard response is "well, if you have circular preferences I can charge you a dollar to have you end up where you started").

Of course, justifying axioms is messy, and there isn't any particularly objective way of choosing axioms here, but in as much as informal argumentation happens, it tends to use a dutch book like structure. I've had many conversations with formal academic experience in academia and economics here, and this is definitely a normal way for dutch books to go. 

For a concrete example of this, see this recent book/paper: https://www.iffs.se/media/23568/money-pump-arguments.pdf 

Comment by habryka (habryka4) on jacobjacob's Shortform Feed · 2024-07-23T18:37:28.965Z · LW · GW

Huh, this is a good quote.

Comment by habryka (habryka4) on Closed Limelike Curves's Shortform · 2024-07-23T18:26:34.036Z · LW · GW

Or to let me know that some of the issues I mention were already on Wikipedia beforehand. I’d be happy to try to edit those.

None of these changes are new as far as I can tell (I checked the first three), so I think your basic critique falls through. You can check the edit history yourself by just clicking on the "View History" button and then pressing the "cur" button next to the revision entry you want to see the diff for. 

Like, indeed, the issues you point out are issues, but it is not the case that people reading this have made the articles worse. The articles were already bad, and "acting with considerable care" in a way that implies inaction would mean leaving inaccuracies uncorrected. 

I think people should edit these pages, and I expect them to get better if people give it a real try. I also think you could give it a try and likely make things better.

Edit: Actually, I think my deeper objection is that most of the critiques here (made by Sammy) are just wrong. For example, of course Dutch books/money pumps frequently get invoked to justify VNM axioms. See for example this.

Comment by habryka (habryka4) on Towards more cooperative AI safety strategies · 2024-07-22T09:42:07.500Z · LW · GW

I have spent like 40% of the last 1.5 years trying to reform EA. I think I had a small positive effect, but it's also been extremely tiring and painful and I consider my duty with regards to this done. Buy in for reform in leadership is very low, and people seem primarily interested in short term power seeking and ass-covering.

The memo I mentioned in another comment has a bunch of analysis I'll send it to you tomorrow when I am at my laptop.

For some more fundamental analysis I also have this post, though it's only a small part of the picture: https://www.lesswrong.com/posts/HCAyiuZe9wz8tG6EF/my-tentative-best-guess-on-how-eas-and-rationalists

Comment by habryka (habryka4) on Towards more cooperative AI safety strategies · 2024-07-22T09:38:37.191Z · LW · GW

The leadership of these is mostly shared. There are many good parts of EA, and reform would be better than shutting down, but reform seems unlikely at this point.

My world model mostly predicts effects on technological development and the long term future dominate, so in as much as the non-AI related parts of EA are good or bad, I think what matters is their effect on that. Mostly the effect seems small, and quibbling over the sign doesn't super seem worth it.

I do think there is often an annoying motte and bailey going on where people try to critique EA for their negative effects in the important things, and those get redirected to "but you can't possibly be against bednets", and in as much as the bednet people are willingly participating in that (as seems likely the case for e.g. Open Phil's reputation), that seems bad.

Comment by habryka (habryka4) on quila's Shortform · 2024-07-21T22:28:00.942Z · LW · GW

As a moderator: I do think sunwillrise was being a bit obnoxious here. I think the norms they used here were fine for frontpage LW posts, but shortform is trying to do something that is more casual and more welcoming of early-stage ideas, and this kind of psychologizing I think has reasonably strong chilling-effects on people feeling comfortable with that. 

I don't think it's a huge deal, my best guess is I would just ask sunwillrise to comment less on quila's stuff in-particular, and if it becomes a recurring theme, to maybe more generally try to change how they comment on shortforms.

I do think the issue here is kind of subtle. I definitely notice an immune reaction to sunwillrise's original comment, but I can't fully put into words why I have that reaction, and I would also have that reaction if it was made as a comment on a frontpage post (but I would just be more tolerant of it). 

I think the fact that you don't expect this to happen is more due to you improperly generalizing from the community of LW-attracted people (including yourself), whose average psychological make-up appears to me to be importantly different from that of the broader public.

Like, I think my key issue here is that sunwillrise just started a whole new topic that quila had expressed no interest in talking about, which is the topic of "what are my biases on this topic, and if I am wrong, what would be the reason I am wrong?", which like, IDK, is a fine topic, but it is just a very different topic that doesn't really have anything to do with the object level. Like, whether quila is biased on this topic does not make a difference to question of whether this policy-esque proposal would be a good idea, and I think quila (and most other readers) are usually more interested in discussing that then meta-level bias stuff.

There is also a separate thing, where making this argument in some sense assumes that you are right, which I think is a fine thing to do, but does often make good discussion harder. Like, I think for comments, its usually best to focus on the disagreement, and not to invoke random other inferences about the world about what is true if you are right. There can be a place for that, especially if it helps illucidate your underlying world model, but I think in this case little of that happened.

Comment by habryka (habryka4) on The $100B plan with "70% risk of killing us all" w Stephen Fry [video] · 2024-07-21T20:23:15.793Z · LW · GW

Huh, the transcript had surprisingly few straightforwardly wrong things than I am used to for videos like this, and it got the basics of the situation reasonably accurate. 

The one straightforwardly false quite I did catch was that it propagated the misunderstanding that OpenAI went back on some kind of promise to not work with militaries. As I've said in some other comments, OpenAI did prevent military users from using their API for a while, and then started allowing them to do that, but there was no promise or pledge attached to this, it was just a standard change in their terms of service.

Comment by habryka (habryka4) on Towards more cooperative AI safety strategies · 2024-07-21T20:10:32.477Z · LW · GW

Sure, sent a DM.

Comment by habryka (habryka4) on Towards more cooperative AI safety strategies · 2024-07-21T05:07:48.981Z · LW · GW

I mean, I also think there is continuity from the beliefs I held in my high-school essays and my present beliefs, but it's also enough time and distance that if you straightforwardly attribute claims to me that I made in my high-school essays, that I have explicitly disavowed and told you I do not believe, that I will be very annoyed with you and will model you as not actually trying to understand what I believe.

Comment by habryka (habryka4) on Towards more cooperative AI safety strategies · 2024-07-21T02:00:18.583Z · LW · GW

Some things that feel incongruent with this: 

  • Eliezer talks a lot in the Arbital article on CEV about how useful it is to have a visibly neutral alignment target
  • Right now Eliezer is pursuing a strategy which does not meaningfully empower him at all (just halting AGI progress)
  • Eliezer complaints a lot about various people using AI alignment under the guise of mostly just achieving their personal objectives (in-particular the standard AI censorship stuff being thrown into the same bucket)
  • Lots of conversations I've had with MIRI employees

I would be happy to take bets here about what people would say. 

Comment by habryka (habryka4) on An AI Race With China Can Be Better Than Not Racing · 2024-07-21T01:47:54.120Z · LW · GW

Huh, I do think the "correct" game theory is not sensitive in these respects (indeed, all LDTs cooperate in a 1-shot mirrored prisoner's dilemma). I agree that of course you want to be sensitive to some things, but the kind of sensitivity here seems silly.

Comment by habryka (habryka4) on An AI Race With China Can Be Better Than Not Racing · 2024-07-20T21:50:28.246Z · LW · GW

Yep, it's definitely possible to get cooperation in a pure CDT-frame, but it IMO is also clearly silly how sensitive the cooperative equilibrium is to things like this (and also doesn't track how I think basically any real-world decision-making happens).

Comment by habryka (habryka4) on Towards more cooperative AI safety strategies · 2024-07-20T20:08:14.518Z · LW · GW

I think they talked explicitly about planning to deploy the AI themselves back in the early days(2004-ish) then gradually transitioned to talking generally about what someone with a powerful AI could do.

I agree that very old MIRI (explicitly disavowed by present MIRI and mostly modeled as "one guy in a basement somewhere") looked a bit more like this, but I think making inferences from that to modern MIRI is about as confused as making inferences from people's high-school essays about what they will do when they become president. I don't think it has zero value in forecasting the future, but going and reading someone's high-school political science essay, and inferring they would endorse that position in the modern day, is extremely dubious.

My model of them would definitely think very hard about the signaling and coordination problems that come with people trying to build an AGI themselves, and then act on those. I think Eliezer's worldview here would totally output actions that include very legible precommitments about what the AI system would be used for, and would absolutely definitely not include the ability of whoever builds AGI to just take over the world with it. Eliezer has written a lot about this stuff and clearly takes considerations like that extremely seriously.

Comment by habryka (habryka4) on An AI Race With China Can Be Better Than Not Racing · 2024-07-20T20:01:59.839Z · LW · GW

Sure, you can think about this stuff in a CDT framework (especially over iterated games), though it is really quite hard. Remember, the default outcome in a n-round prisoners dilemma in CDT is still constant defect, because you just argue inductively that you will definitely be defected on in the last round. So it being single shot isn't necessary.

Of course, the whole problem with TDT-ish arguments is that we have very little principled foundation of how to reason when two actors are quite imperfect decision-theoretic copies of each other (like the U.S. and China almost definitely are). This makes technical analysis of the domains where the effects from this kind of stuff is large quite difficult.

Comment by habryka (habryka4) on Will quantum randomness affect the 2028 election? · 2024-07-19T22:49:40.965Z · LW · GW

(Also, this question is about 2028, it's not particularly clear to me what effect even a successful assasination would have had on the 2028 election)

Comment by habryka (habryka4) on Towards more cooperative AI safety strategies · 2024-07-19T21:23:10.485Z · LW · GW

I mean, it really matters whether you are suggesting someone else to take that action or whether you are planning to take that action yourself. Asking the U.S. government to use AI to prevent anyone from building more powerful and more dangerous AI is not in any way a power-grabbing action, because it does not in any meaningful way make you more powerful (like, yes, you are part of the U.S. so I guess you end up with a bit more power as the U.S. ends up with more power, but that effect is pretty negligible). Even asking random AI capability companies to do that is also not a power-grabbing action, because you yourself do not end up in charge of those companies as part of that.

Yes, unilaterally deploying such a system yourself would be, but I have no idea what people are referring to when they say that MIRI was planning on doing that (maybe they were, but all I've seen them do is to openly discuss plans about what ideally someone with access to a frontier model should do in a way that really did not sound like it would end up with MIRI meaningfully in charge).

Comment by habryka (habryka4) on Towards more cooperative AI safety strategies · 2024-07-19T18:43:18.013Z · LW · GW

I don't super buy this. I don't think MIRI was trying to accumulate a lot of power. In my model of the world they were trying to design a blueprint for some institution or project that would mostly have highly conditional power, that they would personally not wield. 

In the metaphor of classical governance, I think what MIRI was doing was much more "design a blueprint for a governance agency" not "put themselves in charge of a governance agency". Designing a blueprint is not a particularly power-seeking move, especially if you expect other people to implement it.

Comment by habryka (habryka4) on AI #73: Openly Evil AI · 2024-07-19T02:49:50.046Z · LW · GW

Pliny the Prompter: gg

This was a troll: https://x.com/elder_plinius/status/1813183970298757285 

Comment by habryka (habryka4) on Towards more cooperative AI safety strategies · 2024-07-19T02:21:23.035Z · LW · GW

I would also be interested in more of your thoughts on this

I have a memo I thought I had shared with you at one point that I wrote for EA Coordination Forum 2023. It has a bunch of wrong stuff in it, and fixing it has been too difficult, but I could share it with you privately (with disclaimers on what is wrong). Feel free to DM me if I haven't. 

@habryka are you able to share details/examples RE the actions you've taken to get the EA community to shut down or disappear?

Sharing my memo at the coordination forum is one such action I have taken. I have also advocated for various people to be fired, and have urged a number of external and internal stakeholders to reconsider their relationship with EA. Most of this has been kind of illegible and flaily , with me not really knowing how to do anything in the space without ending up with a bunch of dumb collateral damage and reciprocal escalation.

Comment by habryka (habryka4) on Towards more cooperative AI safety strategies · 2024-07-18T19:52:46.187Z · LW · GW

This is the thing that feels most like talking past each other. You're treating this as a binary and it's really, really not a binary. Some examples:

Yeah, I think this makes sense. I wasn't particularly trying to treat it as just a binary, and I agree that there are levels of abstraction where it makes sense to model these things as one, and this also applies to the whole extended AI-Alignment/EA/Rationality ecosystem. 

I do feel like this lens loses a lot of its validity at the highest levels of abstraction (like, I think there is a valid sense in which you should model AI x-risk concerned people as part of big-tech, but also, if you do that, you kind of ignore the central dynamic that is going on with the x-risk concerned people, and maybe that's the right call sometimes, but I think in terms of "what will the future of humanity be" in making that simplification you have kind of lost the plot)

If I'm wrong about this, I'd love to know.

My best guess is you are underestimating the level of adversarialness going on, though I am also uncertain about this. I would be interested in sharing notes some time. 

As one concrete example, my guess is we both agree it would not make sense to model OpenAI as part of the same power base. Like, yeah, a bunch of EAs used to be on OpenAIs board, but even during that period, they didn't have much influence on OpenAI. I think basically all-throughout it made most sense to model these as separate communities/institutions/groups with regards to power-seeking.

I also personally do straightforwardly think that most of the efforts of the extended EA-Alignment ecosystem are bad, and would give up a large chunk of my resources to reduce their influence on the world. Not because I am in a competition between them (indeed, I think I do tend to get more power as they get more power), but because I think they genuinely have really bad consequences for the world. I also care a lot about cooperativeness, and so I don't tend to go around getting into conflicts with lots of collateral damage or reciprocal escalation, but also, I have definitely taken actions within the bounds of what seems reasonable that have aimed at getting the EA community to shut down or disappear (and will probably continue to do so).

Comment by habryka (habryka4) on Towards more cooperative AI safety strategies · 2024-07-18T19:10:13.675Z · LW · GW

I think you're making an important point here, and I agree that given the moral valence people here will be quite tempted to gerrymander themselves out of the relevant categories (also, pretending to be the underdog, or participating in bravery debates, is an extremely common pattern in conversations like this).

I do agree that a few years ago things would have been better modeled as a shared power base, but I think a lot of this has genuinely changed post-FTX.

I also think there are really crucial differences in how much different sub-parts of this ecosystem are structurally-power-seeking, and that those are important to model (and also importantly that some of the structural power-seeking-ness of some these parts puts those parts into conflict with the others, in as much as they they are not participating in the same power-seeking strategies).

Like, the way I have conceptualized most of my life's work so far has been to try to build neutral non-power-seeking institutions, that inform other people and help them make better decisions, and that generally try to actively avoid plans that route through "me and my friends get powerful and then solve our problems" because I think this kind of plan will almost inevitably end up just running into conflict with other power-seeking entities and then spend most of its resources on that.

And I think there are thousands of others who have similar intuitions about how to relate to the world, within the broader AI-Alignment/Rationality ecosystem, and I think those parts are genuinely not structurally power-seeking in the same way. And I agree they are all very enmeshed with parts that are power-seeking, and this makes distinguishing them harder, but I think there are really quite important differences. 

I don't actually know how much we disagree. I do think that modeling the AI Safety space as a single power-base is wrong and not really carving reality along structural lines. Like, I don't think the situation is "look, we often argue theological disagreements", I think the situation is often much more "these two things that care about safety are actively in-conflict with each other and are taking active steps to eradicate the other party" and at that point I just really don't think it makes sense to model these as one.

Comment by habryka (habryka4) on Towards more cooperative AI safety strategies · 2024-07-18T06:50:51.942Z · LW · GW

I don't know, I think I'll defend that Lightcone is genuinely not very structurally power-seeking, and neither is MIRI, and also that both of these organizations are not meaningfully part of some kind of shared power-base with most of the EA AI Alignment community in Berkeley (Lightcone is banned from receiving any kind of OpenPhil funding, for example). 

I think you would at least have to argue that there are two separate power-seeking institutions here, each seeking power for themselves, but I also do genuinely think that Lightcone is not a very structurally power-seeking organization (I feel a bit more confused about MIRI, though would overall still defend that).

Comment by habryka (habryka4) on Baking vs Patissing vs Cooking, the HPS explanation · 2024-07-18T02:42:57.490Z · LW · GW

Hmm, I am not fully sure what people meant in context, but I would guess it referred to the "margin of error" thing. Like, people have told me many times that they have much more frequently produced approximately inedible batches of baked goods than they have produced inedible batches of other food.

Comment by habryka (habryka4) on Baking vs Patissing vs Cooking, the HPS explanation · 2024-07-17T22:00:52.599Z · LW · GW

Also—baking cakes, less reliable than cooking? Well, all I can say is that this is pretty much the opposite of my experience…

I am not competent at either baking or cooking, but I've definitely heard this 10+ times over the years (and never the opposite).

Comment by habryka (habryka4) on Towards more cooperative AI safety strategies · 2024-07-17T21:33:58.674Z · LW · GW

I think this is a better pointer, but I think "Bay Area alignment community" is still a bit too broad. I think e.g. Lightcone and MIRI are very separate from Constellation and Open Phil and it doesn't make sense to put them into the same bucket. 

Comment by habryka (habryka4) on Algon's Shortform · 2024-07-17T16:45:04.035Z · LW · GW

Yeah, I think the hide author feature should replace everyone with single letters or something, or give you the option to do that. If someone wants to make a PR with that, that would be welcome, we might also get around to it otherwise at some point (but it might take a while)

Comment by habryka (habryka4) on Superbabies: Putting The Pieces Together · 2024-07-17T02:28:15.296Z · LW · GW

Promoted to curated: I think this topic is quite important and there has been very little writing that helps people get an overview over what is happening in the space, especially with some of the recent developments seeming quite substantial and I think being surprising to many people who have been forecasting much longer genetic engineering timelines when I've talked to them over the last few years. 

I don't think this post is the perfect overview. It's more like a fine starting point and intro, and I think there is space for a more comprehensive overview, and I would curate that as well, but it's the best I know of right now.

Thanks a lot for writing this! 

(And for people who are interested in this topic I also recommend Sarah's latest post on the broader topic of multiplex gene editing)

Comment by habryka (habryka4) on Alexander Gietelink Oldenziel's Shortform · 2024-07-16T23:36:57.038Z · LW · GW

Oh, I thought this was relatively straightforward and has been discussed a bunch. There are two lines of argument I know for why superintelligent AI, even if unaligned, might not literally kill everyone, but keep some humans alive: 

  1. The AI might care a tiny bit about our values, even if it mostly doesn't share them
  2. The AI might want to coordinate with other AI systems that reached superintelligence to jointly optimize the universe. So in a world where there is only a 1% chance that we align AI systems to our values, then even in unaligned worlds we might end up with AI systems that adopt our values as a 1% mixture in its utility function (and also consequently in those 1% of worlds, we might still want to trade away 99% of the universe to the values that the counterfactual AI systems would have had)

Some places where the second line of argument has been discussed: 

  1. ^

    This is due to:

    • The potential for the AI to be at least a tiny bit "kind" (same as humans probably wouldn't kill all aliens). [1]
    • Decision theory/trade reasons
  2. ^

    Note that in this comment I’m not touching on acausal trade (with successful humans) or ECL. I think those are very relevant to whether AI systems kill everyone, but are less related to this implicit claim about kindness which comes across in your parables (since acausally trading AIs are basically analogous to the ants who don't kill us because we have power).

Comment by habryka (habryka4) on Alexander Gietelink Oldenziel's Shortform · 2024-07-16T20:30:32.186Z · LW · GW

As one relevant consideration, I think the topic of "will AI kill all humans" is a question whose answer relies in substantial parts on TDT-ish considerations, and is something that a bunch of value systems I think reasonably care a lot about. Also I think what  superintelligent systems will do will depend a lot on decision-theoretic considerations that seem very hard to answer from a CDT vs. EDT-ish frame.

Comment by habryka (habryka4) on Will quantum randomness affect the 2028 election? · 2024-07-14T17:41:41.746Z · LW · GW

Eh, I don't buy it, or like, I think it's just restating the underlying question. My best guess is wind direction is pretty strongly overdetermined (like, even on just the extremely dumb first order approximation you can often get to 95%+ confidence about wind direction, because places tend to have pretty consistent wind patterns).

But even granting that, it's still not settled because there might be other reasons that would have overdetermined the outcome of the election. For example, it might be overdetermined that Trump dies before the election due to old age. We don't know that, but an omniscient observer probably would. To settle this, it's not enough to find one event that seemingly affects the result from the perspective of our present uncertainty, you need to confirm that the effects of that event were not screened off on the variable that you are measuring via any other pathway.

And granting even that, while the question here was ambiguously phrased, the relevant variable measurement here was "which party will win the election" not "which president will win the election", so it's not particularly relevant. 

That said, it's still an interesting case of small variations having large effects.

Comment by habryka (habryka4) on A simple case for extreme inner misalignment · 2024-07-14T04:11:39.815Z · LW · GW

Yeah, I am not super happy with the UI for inline reacts in posts, both for reading and for writing them. It's been on my to-do list for a while to improve them.

Comment by habryka (habryka4) on A simple case for extreme inner misalignment · 2024-07-14T03:18:05.975Z · LW · GW

Yeah, the principled reason (though I am not like super confident of this) is that posts are almost always too big and have too many claims in them to make a single agree/disagree vote make sense. Inline reacts are the intended way for people to express agreement and disagreement on posts.

I am not super sure this is right, but I do want to avoid agreement/disagreement becoming disconnected from truth values, and I think applying them to elements that clearly don't have a single truth value weakens that connection.

Comment by habryka (habryka4) on A simple case for extreme inner misalignment · 2024-07-14T02:32:51.695Z · LW · GW

Early stage votes are pretty noisy (and I think have been getting noisier over time, which is somewhat of a proxy of polarization, which makes me sad). 

Comment by habryka (habryka4) on Alignment: "Do what I would have wanted you to do" · 2024-07-13T22:10:26.489Z · LW · GW

I think it's still relevant because it creates a rallying point around what to do after you made substantial progress aligning AGI, which helps coordination in the run up to it, but I agree that most effort should go into other approaches.

Comment by habryka (habryka4) on Alignment: "Do what I would have wanted you to do" · 2024-07-13T18:33:39.796Z · LW · GW

I think your most recent comment fails to disambiguate between "the output of the extrapolation process", which I agree will be nonempty (similarly to how my current set of values is nonempty), and "the coherent output of the extrapolation process", which I think might very well be empty, and in any case will mostly likely be very small in size (or measure) compared to the first one.

Hmm, this paragraph maybe points to some linguistic disagreement we have (and my guess is causing confusion in other cases). 

I feel like you are treating "coherent" as a binary, when I am treating it more as a sliding scale. Like, I think various embedded agency issues prevent an agent from being fully coherent (for example, a classical bayesian formulation of coherence requires logical omniscience and is computationally impossible), but it's also clearly the case that when I notice a dutch-book against me, and I update in a way that avoids future dutch-bookings, that I have in a meaningful way (but not in a way I could formalize) become more coherent. 

So what I am saying is something like "CEV will overall increase the degree of coherence of your values". I totally agree that it will not get you all the way (whatever that means), and I also think we don't have a formalization of coherence that we can talk about fully formally (though I think we have some formal tools that are useful). 

This gives me some probability we don't disagree that much, but my sense is you are throwing out the baby with the bathwater in your response to Roger, and that that points to a real disagreement. 

Like, yes, I think for many overdetermined reasons we will not get something that looks like a utility function out of CEV (because computable utility functions over world-histories aren't even a thing that can meaningfully exist in the real world). But it seems to me like something like "The Great Reflection" would be extremely valuable and should absolutely be the kind of thing we aim for with an AI, since I sure have updated a lot on what I reflectively endorse, and would like to go down further that path by learning more true things and getting smarter in ways that don't break me.

Comment by habryka (habryka4) on Alignment: "Do what I would have wanted you to do" · 2024-07-13T18:14:07.193Z · LW · GW

I think you misunderstood what I meant by "collapse to nothingness". I wasn't referring to you collapsing into nothingness under CEV. I meant your logical argument outputting a contradiction (where the contradiction would be that you prefer to have no preferences right now).

The thing I am saying is that I am pretty confident you don't have meta preferences that when propagated will cause you to stop wanting things, because like, I think it's just really obvious to both of us that wanting things is good. So in as much as that is a preference, you'll take it into account in a reasonable CEV set up.

We clearly both agree that there are ways to scale you up that are better or worse by your values. CEV is the process of doing our best to choose the better ways. We probably won't find the very best way, but there are clearly ways through reflection space that are better than others and that we endorse more going down.

You might stop earlier than I do, or might end up in a different place, but that doesn't change the validity of the process that much, and clearly doesn't result in you suddenly having no wants or preferences anymore (because why would you want that, and if you are worried about that, you can just make a hard commit at the beginning to never change in ways that causes that).

And yeah, maybe some reflection process will cause us to realize that actually everything is meaningless in a way that I would genuinely endorse. That seems fine but it isn't something I need to weigh from my current vantage point. If it's true, nothing I do matters anyways, but also it honestly seems very unlikely because I just have a lot of things I care about and I don't see any good arguments that would cause me to stop caring about them.

Comment by habryka (habryka4) on Alignment: "Do what I would have wanted you to do" · 2024-07-13T17:56:48.721Z · LW · GW

It's not a gotcha, I just really genuinely don't get how the model you are explaining doesn't just collapse into nothingness.

Like, you currently clearly think that some of your preferences are more stable under reflection. And you have guesses and preferences over the type of reflection that makes your preferences better by your own lights. So seems like you want to apply one to the other. Doing that intellectual labor is the core of CEV.

If you really have no meta level preferences (though I have no idea what that would mean since it's part of everyday life to balance and decide between conflicting desires) then CEV outputs something at least as coherent as you are right now, which is plenty coherent given that you probably acquire resources and have goals. My guess is you can do a bunch better. But I don't see any way for CEV to collapse into nothingness. It seems like it has to output something at least as coherent as you are now.

So when you say "there is no coherence" that just seems blatantly contradicted by you standing before me and having coherent preferences, and not wanting to collapse into a puddle of incoherence.

Comment by habryka (habryka4) on Alignment: "Do what I would have wanted you to do" · 2024-07-13T17:26:53.853Z · LW · GW

I mean, send me all the money in your bank account right now then. You seem to claim you have no coherent preferences or are incapable of telling which of your preferences are ones you endorse, so seems like you wouldn't mind.

(Or insert any of the other standard reductio arguments here. You clearly about some stuff. In as much as you do, you have a speck of coherence in you. If you don't, I don't know how to do help you in any way and seems like we don't have any trade opportunities, and like, maybe I should just take your stuff because you don't seem to mind)

Comment by habryka (habryka4) on Alignment: "Do what I would have wanted you to do" · 2024-07-13T16:42:09.294Z · LW · GW

Marcello, commenting 16 years ago during Eliezer's Metaethics Sequence, pointed out that there is no particular reason to expect extrapolation to be coherent at all, because of butterfly effects and the manner in which "the mood you were in upon hearing them and so forth could influence which of the arguments you came to trust." (By the way, Marcello's objection, as far as I know, has never been convincingly or even concretely addressed by CEV-proponents. Even the official CEV document and all the writing that came after it mentioned its concerns in a few paragraphs and just... moved on without any specific explanations)

I don't understand the writing in italics here. Eliezer and others responded pretty straightforwardly: 

It seems to me that if you build a Friendly AI, you ought to build it to act where coherence exists and not act where it doesn't.

Or orthonormal says more precisely: 

We can consider a reference class of CEV-seeking procedures; one (massively-underspecified, but that's not the point) example is "emulate 1000 copies of Paul Christiano living together comfortably and immortally and discussing what the AI should do with the physical universe; once there's a large supermajority in favor of an enactable plan (which can include further such delegated decisions), the AI does that".

I agree that this is going to be chaotic, in the sense that even slightly different elements of this reference class might end up steering the AI to different basins of attraction.

I assert, however, that I'd consider it a pretty good outcome overall if the future of the world were determined by a genuinely random draw from this reference class, honestly instantiated. (Again with the massive underspecification, I know.)

CEV may be underdetermined and many-valued, but that doesn't mean paperclipping is as good an answer as any.

Comment by habryka (habryka4) on Alignment: "Do what I would have wanted you to do" · 2024-07-13T16:37:21.408Z · LW · GW

Maybe I am missing some part of this discussion, but I don't get the last paragraph. It's clear there are a lot of issues with CEV, but I also have no idea what the alternative to something like CEV as a point of comparison is supposed to be.  In as much as I am a godshatter of wants, and I want to think about my preferences, I need to somehow come to a conclusion about how to choose between different features, and the basic shape of CEV feels like the obvious (and approximately only) option that I see in front of me. 

I agree there is no "canonical" way to scale me up, but that doesn't really change the need for some kind of answer to the question of "what kind of future do I want and how good could it be?".

How does "instruction-following AI" have anything to do with this? Like, OK, now you have an AI that in some sense follows your instructions. What are you going to do with it? 

My best guess is you are going to do something CEV like, where you figure out what you want, and you have it help you reflect on your preferences and then somehow empower you to realize more of them. Ideally it would fully internalize that process so it doesn't need to rely on your slow biological brain and weak body, though of course you want to be very careful with that since changes to values under reflection seem very sensitive to small changes in initial conditions.

It's also what seems to me relatively broad consensus on LW that you should not aim for CEV as a first thing to do with an AGI. It's a thing you will do eventually, but aiming for it early does indeed seem doomed, but like, that's not really what the concept or process is about. It's about setting a target for what you want to eventually allow AI systems to help you with.

The Arbital article is also very clear about this: 

CEV is meant to be the literally optimal or ideal or normative thing to do with an autonomous superintelligence, if you trust your ability to perfectly align a superintelligence on a very complicated target. (See below.)

CEV is rather complicated and meta and hence not intended as something you’d do with the first AI you ever tried to build. CEV might be something that everyone inside a project agreed was an acceptable mutual target for their second AI. (The first AI should probably be a Task AGI.)

Comment by habryka (habryka4) on AI #72: Denying the Future · 2024-07-12T19:40:29.047Z · LW · GW

It's also empty on the original blog, so it's at least not an import error.

Comment by habryka (habryka4) on Thoughts to niplav on lie-detection, truthfwl mechanisms, and wealth-inequality · 2024-07-12T00:56:09.487Z · LW · GW

Hmm, this sure is a kind of weird edge-case of the coauthor system for dialogues. 

I do think there really should be some indication that niplav hasn't actually responded here, but also, I don't want to just remove them and via that remove their ability to respond in the dialogue. I'll leave it as is for now and hope this comment is enough to clear things up, but if people are confused I might temporarily remove niplav.

Comment by habryka (habryka4) on plex's Shortform · 2024-07-09T18:06:39.739Z · LW · GW

These datapoints just feel like the result of random fluctuations. Both Writer and Eliezer mostly drove people to participate on the LK-99 stuff where lots of people were confidently wrong. In-general you can see that basically all the top referrers have negative income: 

Among the top 10, Eliezer and Writer are somewhat better than the average (and yaboi is a huge outlier, which my guess is would be explained by them doing something quite different than the other people). 

Comment by habryka (habryka4) on Habryka's Shortform Feed · 2024-07-09T17:01:28.988Z · LW · GW

I really think the above was meant to imply that the non disparagement agreements were merely unclear on whether they were covered by a non disclosure clause (and I would be happy to take bets on how a randomly selected reader would interpret it).

My best guess is Sam was genuinely confused on this and that there are non disparagement agreements with Anthropic that clearly are not covered by such clauses.

Comment by habryka (habryka4) on Reflections on Less Online · 2024-07-09T02:24:30.462Z · LW · GW

I attended two LWCW weekends all the way back in 2013 and 2014!

Despite that, I don't actually think they were that big of an inspiration for at least my input into LessOnline. Other conferences and events that Lightcone organized were bigger influences, in particular two private events we ran in 2021 and 2022 (Sanity and Survival Summit and Palmcone).