Morality is Scary 2021-12-02T06:35:06.736Z
(USA) N95 masks are available on Amazon 2021-01-18T10:37:40.296Z
Anti-EMH Evidence (and a plea for help) 2020-12-05T18:29:31.772Z
A tale from Communist China 2020-10-18T17:37:42.228Z
Everything I Know About Elite America I Learned From ‘Fresh Prince’ and ‘West Wing’ 2020-10-11T18:07:52.623Z
Tips/tricks/notes on optimizing investments 2020-05-06T23:21:53.153Z
Have epistemic conditions always been this bad? 2020-01-25T04:42:52.190Z
Against Premature Abstraction of Political Issues 2019-12-18T20:19:53.909Z
What determines the balance between intelligence signaling and virtue signaling? 2019-12-09T00:11:37.662Z
Ways that China is surpassing the US 2019-11-04T09:45:53.881Z
List of resolved confusions about IDA 2019-09-30T20:03:10.506Z
Don't depend on others to ask for explanations 2019-09-18T19:12:56.145Z
Counterfactual Oracles = online supervised learning with random selection of training episodes 2019-09-10T08:29:08.143Z
AI Safety "Success Stories" 2019-09-07T02:54:15.003Z
Six AI Risk/Strategy Ideas 2019-08-27T00:40:38.672Z
Problems in AI Alignment that philosophers could potentially contribute to 2019-08-17T17:38:31.757Z
Forum participation as a research strategy 2019-07-30T18:09:48.524Z
On the purposes of decision theory research 2019-07-25T07:18:06.552Z
AGI will drastically increase economies of scale 2019-06-07T23:17:38.694Z
How to find a lost phone with dead battery, using Google Location History Takeout 2019-05-30T04:56:28.666Z
Where are people thinking and talking about global coordination for AI safety? 2019-05-22T06:24:02.425Z
"UDT2" and "against UD+ASSA" 2019-05-12T04:18:37.158Z
Disincentives for participating on LW/AF 2019-05-10T19:46:36.010Z
Strategic implications of AIs' ability to coordinate at low cost, for example by merging 2019-04-25T05:08:21.736Z
Please use real names, especially for Alignment Forum? 2019-03-29T02:54:20.812Z
The Main Sources of AI Risk? 2019-03-21T18:28:33.068Z
What's wrong with these analogies for understanding Informed Oversight and IDA? 2019-03-20T09:11:33.613Z
Three ways that "Sufficiently optimized agents appear coherent" can be false 2019-03-05T21:52:35.462Z
Why didn't Agoric Computing become popular? 2019-02-16T06:19:56.121Z
Some disjunctive reasons for urgency on AI risk 2019-02-15T20:43:17.340Z
Some Thoughts on Metaphilosophy 2019-02-10T00:28:29.482Z
The Argument from Philosophical Difficulty 2019-02-10T00:28:07.472Z
Why is so much discussion happening in private Google Docs? 2019-01-12T02:19:19.332Z
Two More Decision Theory Problems for Humans 2019-01-04T09:00:33.436Z
Two Neglected Problems in Human-AI Safety 2018-12-16T22:13:29.196Z
Three AI Safety Related Ideas 2018-12-13T21:32:25.415Z
Counterintuitive Comparative Advantage 2018-11-28T20:33:30.023Z
A general model of safety-oriented AI development 2018-06-11T21:00:02.670Z
Beyond Astronomical Waste 2018-06-07T21:04:44.630Z
Can corrigibility be learned safely? 2018-04-01T23:07:46.625Z
Multiplicity of "enlightenment" states and contemplative practices 2018-03-12T08:15:48.709Z
Online discussion is better than pre-publication peer review 2017-09-05T13:25:15.331Z
Examples of Superintelligence Risk (by Jeff Kaufman) 2017-07-15T16:03:58.336Z
Combining Prediction Technologies to Help Moderate Discussions 2016-12-08T00:19:35.854Z
[link] Baidu cheats in an AI contest in order to gain a 0.24% advantage 2015-06-06T06:39:44.990Z
Is the potential astronomical waste in our universe too small to care about? 2014-10-21T08:44:12.897Z
What is the difference between rationality and intelligence? 2014-08-13T11:19:53.062Z
Six Plausible Meta-Ethical Alternatives 2014-08-06T00:04:14.485Z
Look for the Next Tech Gold Rush? 2014-07-19T10:08:53.127Z
Outside View(s) and MIRI's FAI Endgame 2013-08-28T23:27:23.372Z


Comment by Wei_Dai on Why do we need a NEW philosophy of progress? · 2022-01-26T05:10:46.947Z · LW · GW

The failures of communism must also have soured a lot of people on "progress", given that it fit really well into the old philosophy of progress and then turned out really badly. (See this related comment.)

How can we make moral and social progress at least as fast as we make scientific, technological and industrial progress? How do we prevent our capabilities from outrunning our wisdom?

This seems to be the key to everything else, but it may just be impossible. It seems pretty likely that moral and social progress are just inherently harder problems, given that you can't do controlled experiments nor have fast feedback cycles from reality (like you do when trying to make scientific, technological and industrial progress).

Comment by Wei_Dai on Lives of the Cambridge polymath geniuses · 2022-01-26T04:50:43.607Z · LW · GW

Jason Crawford's recent post on 19th-century philosophy of progress seems relevant. Some quotes from it:

  • deep belief in the power of human reason
  • had forecast progress in morality and society just as much as in science, technology and industry
  • progress was inevitable
  • the conviction that “the Idea or the Dialectic or Natural Law, functioning through the conscious purposes or the unconscious activities of men, could be counted on to safeguard mankind against future hazards

From this it doesn't seem surprising that smart people would have initially seen something like communism as the next step in the inevitable moral and social progress of humanity, powered by reason. Combine this high prior that communism would be good with lack of strong evidence of communism's problems (it probably looked pretty good from the outside, and any unfavorable info that did leak out, you couldn't be sure wasn't anti-communist propaganda)... and you almost don't need to invoke human irrationality to explain them being enamored with communism.

Maybe a more distal cause is that the Enlightenment was too successful, in that the values it settled upon through "reason", like freedom and democracy, turned out to work pretty well (relative to the old norms), which made people trust reason and progress too much, when in retrospect, the Enlightenment philosophers seem to have just gotten lucky. (Or maybe there are some deeper explanations than "luck" for why they were right, but it sure doesn't seem to have much to do with their explicit reasoning.)

Comment by Wei_Dai on The ignorance of normative realism bot · 2022-01-18T21:58:43.104Z · LW · GW

We can say similar stuff about other a priori domains like modality, logic, and philosophy as a whole. [...] Whether there are, ultimately, important differences here is a question beyond the scope of this post (I, personally, expect at least some).

I would be interested in your views on metaphilosophy and how it relates to your metaethics.

Suppose we restrict our attention to the subset of philosophy we call metaethics, then it seems to me that meta-metaethical realism is pretty likely (i.e., there are metanormative facts, or facts about the nature of normativity/morality) and therefore metaethical realism is at least pretty plausible. In other words, perhaps there are normative facts in the same way that there are metanormative facts, even though I don't understand the nature of these facts, e.g., whether they're "non-naturalist" or "interventionist". I think this line of thinking provides a major source of support for moral realism within my metaethical uncertainty, so I'm curious if you have any arguments against it.

Comment by Wei_Dai on General alignment plus human values, or alignment via human values? · 2022-01-16T02:21:19.061Z · LW · GW

In contrast, something like a threat doesn’t count, because you know that the outcome if the threat is executed is not something you want; the problem comes because you don’t know how to act in a way that both disincentivizes threats and also doesn’t lead to (too many) threats being enforced. In particular, the problem is not that you don’t know which outcomes are bad.

I see, but I think at least part of the problem with threats is that I'm not sure what I care about, which greatly increases my "attack surface". For example, if I knew that negative utilitarianism is definitely wrong, then threats to torture some large number of simulated people wouldn't be effective on me (e.g., under total utilitarianism, I could use the resources demanded by the attacker to create more than enough happy people to counterbalance whatever they threaten to do).

Alignment is 100x more likely to be an existentially risky problem at all (think of this as the ratio between probabilities of existential catastrophe by the given problem assuming no intervention from longtermists).

This seems really extreme, if I'm not misunderstanding you. (My own number is like 1x-5x.) Assuming your intent alignment risk is 10%, your AI persuasion risk is only 1/1000?

Putting on my “what would I do” hat, I’m imagining that the AI doesn’t know that it was specifically optimized to be persuasive, but it does know that there are other persuasive counterarguments that aren’t being presented, and so it says that it looks one-sided and you might want to look at these other counterarguments.

Given that humans are liable to be persuaded by bad counterarguments too, I'd be concerned that the AI will always "know that there are other persuasive counterarguments that aren’t being presented, and so it says that it looks one-sided and you might want to look at these other counterarguments." Since it's not safe to actually look the counterarguments found by your own AI, it's not really helping at all. (Or it makes things worse if the user isn't very cautious and does look at their AI's counterarguments and gets persuaded by them.)

I totally expect them to ask AI for help with such games. I don’t expect (most of) them to lock in their values such that they can’t change their mind.

I think most people don't think very long term and aren't very rational. They'll see some people within their group do AI-enabled value lock-in, get a lot of status reward for it, and emulate that behavior in order to not fall behind and become low status within the group. (This might be a gradual process resembling "purity spirals" of the past, i.e., people ask AI to do more and more things that have the effect of locking in their values, or a sudden wave of explicit value lock-ins.)

I expect AIs will be able to do the sort of philosophical reasoning that we do, and the question of whether we should care about simulations seems way way easier than the question about which simulations of me are being run, by whom, and what they want.

This seems plausible to me, but I don't see how one can have enough confidence in this view that one isn't very worried about the opposite being true and constituting a significant x-risk.

Comment by Wei_Dai on Zvi’s Thoughts on the Survival and Flourishing Fund (SFF) · 2022-01-16T00:39:15.867Z · LW · GW

some types of bad (or bad on some people’s preferences) outcomes from markets can be thought of as missing components of the objective function that those markets are systematically optimizing for.

This framing doesn't make a lot of sense to me. From my perspective, markets are unlike AI in that there isn't a place in a market's "source code" where you can set or change an objective function. A market is just a group of people, each pursuing their own interests, conducting individual voluntary trades. Bad outcomes of markets come not from wrong objective functions given by some designers, but are instead caused by game theoretic dynamics that make it difficult or impossible for a group of people pursuing their own interests to achieve Pareto efficiency. (See The Second Best for some pointers in this direction.)

Can you try to explain your perspective to someone like me, or point me to any existing writings on this?

To us it seems very likely that both kinds of bad outcomes occur at some rate, and the goal of the AI Objectives Institute is to reduce rates of both market and regulatory failures.

There is a big literature in economics on both market and government/regulatory failures. How familiar are you with it, and how does your approach compare with the academic mainstream on these topics?

Comment by Wei_Dai on General alignment plus human values, or alignment via human values? · 2022-01-08T22:36:23.243Z · LW · GW

To be clear, my original claim was for hypothetical scenarios where the failure occurs because the AI didn’t know human values, rather than cases where the AI knows what the human would want but still a failure occurs.

I'm not sure I understand the distinction that you're drawing here. (It seems like my scenarios could also be interpreted as failures where AI don't know enough human values, or maybe where humans themselves don't know enough human values.) What are some examples of what your claim was about?

I do still think they are not as important as intent alignment.

As in, the total expected value lost through such scenarios isn't as large as the expected value lost through the risk of failing to solve intent alignment? Can you give some ballpark figures of how you see each side of this inequality?

Mostly I’d hope that AI can tell what philosophy is optimized for persuasion

How? How would you train an AI to distinguish between philosophy optimized for persuasion, and correct or well-intentioned philosophy that just happens to be very persuasive?

or at least is capable of presenting counterarguments persuasively as well.

You mean every time you hear a philosophical argument, you ask you AI to produce some counterarguments optimized for persuasion? If so, won't your friends be afraid to send you any arguments they think of, for fear of your AI superhumanly persuading you to the opposite conclusion?

And I don’t expect a large number of people to explicitly try to lock in their values.

A lot of people are playing status games where faith/loyalty to their cause/ideology gains them a lot of status points. Why wouldn't they ask their AI for help with this? Or do you imagine them asking for something like "more faith", but AIs understand human values well enough to not interpret that as "lock in values"?

It seems odd to me that it’s sufficiently competent to successfully reason about simulations enough that an acausal threat can actually be made, but then not competent at reasoning about exotic philosophical cases, and I don’t particularly expect this to happen.

The former seems to just require that the AI is good at reasoning about mathematical/empirical matters (e.g., are there many simulations of me actually being run in some universe or set of universes) which I think AIs will be good at by default, whereas dealing with the threats seems to require reasoning about hard philosophical problems like decision theory and morality. For example, how much should I care about my copies in the simulations or my subjective future experience, versus the value that would be lost in the base reality if I were to give in to the simulators' demands? Should I make a counterthreat? Are there any thoughts I or my AI should avoid having, or computations we should avoid doing?

I don’t expect AIs to have clean crisp utility functions of the form “maximize paperclips” (at least initially).

I expect that AIs (or humans) who are less cautious or who think their values can be easily expressed as utility functions will do this first, and thereby gain an advantage over everyone else and maybe forcing them to follow.

I expect this to be way less work than the complicated plans that the AI is enacting, so it isn’t a huge competitiveness hit.

I don't think it's so much that the coordination involving humans is a lot of work, but rather that we don't know how to do it in a way that doesn't cause a lot of waste, similar to a democratically elected administration implementing a bunch of policies only to be reversed by the next administration that takes power, or lawmakers pursuing pork barrel projects that collectively make almost everyone worse off, or being unable to establish and implement easy policies (see COVID again). (You may well have something in mind that works well in the context of intent aligned AI, but I have a prior that says this class of problems is very difficult in general so I'd need to see more details before I update.)

Comment by Wei_Dai on Morality is Scary · 2022-01-08T20:42:19.614Z · LW · GW

This seems interesting and novel to me, but (of course) I'm still skeptical.

I gave the relevant example of relatively well-understood values, preference for lower x-risks.

Preference for lower x-risk doesn't seem "well-understood" to me, if we include in "x-risk" things like value drift/corruption, premature value lock-in, and other highly consequential AI-enabled decisions (potential existential mistakes) that depend on hard philosophical questions. I gave some specific examples in this recent comment. What do you think about the problems on that list? (Do you agree that they are serious problems, and if so how do you envision them being solved or prevented in your scenario?)

Comment by Wei_Dai on Selfless Dating · 2022-01-08T17:40:57.357Z · LW · GW
  • How many "first dates" did you have to go through before you found a suitable partner for selfless dating?
  • How long on average did it take for you to decide that someone wasn't a suitable partner for selfless dating and break up with them?
  • Did you have to break up with someone who would have made a fine partner for "hunting rabbit" (conventional dating/romance) just because they weren't willing/able to "hunt stag" (selfless dating)? If so, what gave you the conviction that this would be a good idea?
  • Did you or would you suggest explaining what selfless dating is and what your expectations are on your first date with someone?
  • What were some problems you encountered with selfless dating (after you found your current partner) and how did you overcome them?
  • Do you have any additional evidence/arguments that you weren't just very lucky and that selfless dating is actually +EV for your readers (or some identifiable subset of your readers)?
Comment by Wei_Dai on General alignment plus human values, or alignment via human values? · 2022-01-08T07:12:01.639Z · LW · GW
  1. Your AI should tell you that it’s worried about your friend being compromised, make sure you have an understanding of the consequences, and then go with your decision.

I think unless we make sure the AI can distinguish between "correct philosophy" or "well-intentioned philosophy" and "philosophy optimized for persuasion", each human will become either compromised (if they're not very cautious and read such messages) or isolated from the rest of humanity with regard to philosophical discussion (if they are cautious and discard such messages). This doesn't seem like an ok outcome to me. Can you explain more why you aren't worried?

  1. Seems fine. Maybe your AI warns you about the risks before helping.

I can imagine that if you subscribe to a metaethics in which a person can't be wrong about morality (i.e., some version of anti-realism), then you might think it's fine to lock in whatever values one currently thinks they ought to have. Is this your reason for "seems fine", or something else? (If so, I think nobody should be so certain about metaethics at this point.)

  1. Seems like an important threat that you (and your AI) should try to resolve.

If the AI isn't very good at dealing with "exotic philosophical cases" then it's not going to be of much help with this problem, and a lot of humans (including me) probably aren't very good at thinking about this either, so we probably end up with a lot of humans succumbing to such acausal attacks.

  1. Mostly I would hope that this situation doesn’t arise, because none of the humans can come up with utility functions in this way, and the AIs that are aligned with humans have other ways of cooperating that don’t require eliciting a utility function over universe histories.

Do you have any suggestions for this? Or some other reason to think that AIs that are aligned with different humans will find ways to cooperate (as efficiently as merging utility functions probably will be), without either a full understanding of human values or risking permanent loss of some parts of their complex values?

  1. Idk, seems pretty unclear, but I’d hope that these situations can’t come up thanks to laws that prevent people from enforcing such threats.

Agreed that's a possible good outcome, but seems far from a sure thing. Such laws would have to more intrusive than anything people are currently used to, since attackers can create simulated suffering within the "privacy" of their own computers or minds. I suppose if such threats become a serious problem that causes a lot of damage, people might agree to trade off their privacy for security. The law might then constitute a risk in itself, as the implementation mechanism might be subverted/misused to create a form of totalitarianism.

Another issue is if there are powerful unaligned AIs or rogue states who think they can use such threats to asymmetrically gain advantage, they won't agree to such laws.

(4) can be solved through governance (laws, regulations, norms, etc)

I think COVID shows that we often can't do this even when it's relatively trivial (or can only do it with a huge time delay). For example COVID could have been solved at very low cost (relative to the actual human and economic damage it inflicted) if governments had stockpiled enough high filtration elastomeric respirators for everyone, mailed them out at the start of the pandemic, and mandated their use. (Some EAs are trying to convince some governments to do this now, in preparation for the next pandemic. I'm not sure how much success they're having.)

Comment by Wei_Dai on General alignment plus human values, or alignment via human values? · 2022-01-05T05:24:02.074Z · LW · GW

Generally with these sorts of hypotheticals, it feels to me like it either (1) isn’t likely to come up, or (2) can be solved by deferring to the human, or (3) doesn’t matter very much.

What do you think about the following examples:

  1. AI persuasion - My AI receives a message from my friend containing a novel moral argument relevant to some decision I'm about to make, but it's not sure if it's safe to show the message to me, because my friend may have been compromised by a hostile AI and is now in turn trying to compromise me.
  2. User-requested value lock-in - I ask my AI to help me become more faithful to some religion/cause/morality.
  3. Acausal attacks - I (or my AI) become concerned that I may be living in a simulation and will be punished unless I do what my simulators want.
  4. Bargaining/coordination - Many AIs are merging together for better coordination and economy of scale, for example by setting the utility function of the merged AI to a weighted average of their individual utility functions, so I have to come up with a utility function (or whatever the merger will be based on) if I want to join the bargain. If I fail to do this, I risk falling behind in subsequent military or economic competition.
  5. Threats - Someone (in the same universe) communicates to me that unless I do what they want (i.e., hand most of my resources to them), they'll create a vast amount of simulated suffering.
Comment by Wei_Dai on Risks from AI persuasion · 2022-01-02T10:34:56.043Z · LW · GW

Differentially make progress on alignment, decreasing the difficulty gap between training a model to be persuasive versus training a model to give a correct explanation. Currently, it is much easier to scale the former (just ask labellers if they were persuaded) than the latter (you need domain experts to check that the explanation was actually correct).

AFAICT, the biggest difficulty gap is (and probably will be) in philosophy, since it's just as easy as any other area to ask labellers if they are persuaded by some philosophical argument, but we have little idea (both compared to other areas, and in an absolute sense) what constitutes "philosophical truth" or what makes an explanation "correct" in philosophy. So I see solving these metaphilosophical problems as crucial to defending against AI persuasion. Do you agree, and if so why no mention of metaphilosophy in this otherwise fairly comprehensive post on AI persuasion?

Comment by Wei_Dai on ARC's first technical report: Eliciting Latent Knowledge · 2021-12-19T18:09:58.554Z · LW · GW

Thanks, very helpful to understand your motivations for that section better.

In the new formulation, the goal is to define the utility in terms of the answers to questions about the future that seem like they should be easy for the AI to answer because they are a combination of (i) easy predictions about humans that it is good at, (ii) predictions about the future that any power-seeking AI should be able to answer.

Not sure about the following, but it seems the new formulation requires that the AI answer questions about humans in a future that may have very low probability according to the AI's current beliefs (i.e., the current human through a delegation chain eventually delegates to a future human existing in a possible world with low probability). The AI may well not be able to answer questions about such a future human, because it wouldn't need that ability to seek power (it only needs to make predictions about high probability futures). Or to put it another way, the future human may exist in a world with strange/unfamiliar (from the AI's perspective) features that make it hard for the AI to predict correctly.

Ideally I’d even like all of the humans involved in the process to be indistinguishable from the “real” humans, so that no human ever looks at their situation and thinks “I guess I’m one of the humans responsible for figuring out the utility function, since this isn’t the kind of world that my AI would actually bring into existence rather than merely reasoning about hypothetically.”

How do you envision extracting or eliciting from the future human H_limit an opinion about what the current human should do, given that H_limit's mind is almost certainly entirely focused on their own life and problems? One obvious way I can think of is to make a copy of H_limit, put the copy in a virtual environment, tell them about H's situation, then ask them what to do. But that seems to run into the same kind of issue, as the copy is now aware that they're not living in the real world.

Comment by Wei_Dai on Morality is Scary · 2021-12-16T04:29:55.021Z · LW · GW

Well, I linked my toy model of partiality before. Are you asking about something more concrete?

Yeah, I mean aside from how much you care about various other people, what concrete things do you want in your utopia?

Comment by Wei_Dai on 25 Min Talk on MetaEthical.AI with Questions from Stuart Armstrong · 2021-12-16T02:54:10.534Z · LW · GW

What do you see as advantages and disadvantages of this design compared to something like Paul Christiano's 2012 formalization of indirect normativity? (One thing I personally like about Paul's design is that it's more agnostic about meta-ethics, and I worry about your stronger meta-ethical assumptions, which I'm not very convinced about. See metaethical policing for my general views on this.)

How worried are you about this kind of observation? People's actual moral views seem at best very under-determined by their "fundamental norms", with their environment and specifically what status games they're embedded in playing a big role. If many people are currently embedded in games that cause them to want to freeze their morally relevant views against further change and reflection, how will your algorithm handle that?

Comment by Wei_Dai on ARC's first technical report: Eliciting Latent Knowledge · 2021-12-16T01:26:13.218Z · LW · GW

Can you talk about the advantages or other motivations for the formulation of indirect normativity in this paper (section "Indirect normativity: defining a utility function"), compared to your 2012 formulation? (It's not clear to me what problems with that version you're trying to solve here.)

Comment by Wei_Dai on Zvi’s Thoughts on the Survival and Flourishing Fund (SFF) · 2021-12-15T23:22:48.767Z · LW · GW

In that case, perhaps copy/paste a longer description of the organization in a footnote, so the reader can figure out what the organization is trying to do, without having to look them up?

Comment by Wei_Dai on ARC's first technical report: Eliciting Latent Knowledge · 2021-12-15T21:57:05.622Z · LW · GW

Ok, this all makes sense now. I guess when I first read that section I got the impression that you were trying to do something more ambitious. You may want to consider adding some clarification that you're not describing a scheme designed to block only manipulation while letting helpful arguments through, or that "letting helpful arguments through" would require additional ideas outside of that section.

Comment by Wei_Dai on ARC's first technical report: Eliciting Latent Knowledge · 2021-12-15T08:26:45.253Z · LW · GW

ETA: I think I misunderstood your comment and there’s actually a more basic miscommunication. I’m imagining the counterfactual over different ads that the AI considered running, before settling on the paperclip-maximizing one (having realized that the others wouldn’t lead to me loving paperclips). I’m not imagining the counterfactual over different values that AI might have.

Oh I see. Why doesn't this symmetrically cause you to filter out good arguments for changing your values (told to you by a friend, say) as well as bad ones?

Comment by Wei_Dai on Ngo's view on alignment difficulty · 2021-12-15T05:21:55.182Z · LW · GW

One hope I have in this vein is that human genes don't contain any "metaphilosophical secret sauce" (instead all the secret sauce is in the culture) so we are able to build a competent philosopher just by doing (something like) fine-tuning GPT-n with a bunch of philosophy papers and/or human feedback. Then we use the artificial (black box) philosopher as part of an aligned AI or to help solve alignment problems.

Unfortunately, I expect that even in the scenario where this ends up working, the artificial philosophers will probably end up writing thousands of increasingly hard-to-follow papers on each philosophical problem exploring all the possible arguments/counterarguments, before reaching some consensus among themselves, and because we won't have a white-box understanding of metaphilosophy, we will just have to hope that they learned to do philosophy the "right way" whatever that actually is.

Comment by Wei_Dai on We'll Always Have Crazy · 2021-12-15T05:11:45.315Z · LW · GW

However, it does not seem to me to be the case that the multiple-orders-of-magnitude increase in the ready availability of information has led to an overall reduction in the frequency of disagreement, or a meaningful decrease in the heat/intensity/urgency of that disagreement.

The book The Status Game helped resolve a lot of similar questions/confusions for me. You can take a look at Morality is Scary where I quote a relevant section from it, and also this comment with another quote. Rob Henderson's idea of luxury beliefs comes from a similar vein.

In short, I think the answer is that there's a demand for using "crazy beliefs"/disagreements for tribal identification, signaling and playing status games. When available information makes some beliefs too obviously crazy to serve that role, the demand doesn't disappear but is just driven to some other topic that isn't quite as obviously settled yet.

Comment by Wei_Dai on ARC's first technical report: Eliciting Latent Knowledge · 2021-12-15T03:59:53.503Z · LW · GW

ETA: This comment was based on a misunderstanding of the paper. Please see the ETA in Paul's reply below.

From the section on Avoiding subtle manipulation:

But from my perspective in advance, there are many possible ads I could have watched. Because I don’t understand how the ads interact with my values, I don’t have very strong preferences about which of them I see. If you asked me-in-the-present to delegate to me-in-the-future, I would be indifferent between all of these possible copies of myself who watched different ads. And if I look across all of those possible copies of me, I will see that almost all of them actually think the paperclip outcome is pretty bad, there’s just this one copy (the one who sees the actual ad that happens to exist in the real world) who comes up with a weird conclusion.

What if in most possible worlds, most unaligned AIs do a multiverse negotiation/merger, adopt a weighted average of their utility functions, so most of the ads that possible copies of you see are promoting this same merged utility function? (The fact that you're trying to filter out manipulation this way gives them extra incentive to do this merger.)

Comment by Wei_Dai on Ngo and Yudkowsky on alignment difficulty · 2021-12-15T01:25:46.652Z · LW · GW

If instead you are thinking about humans, it seems like you totally could be corrigible if you tried, and it seems like you might totally have tried if you had been raised in the right way (e.g. if your parents had lovingly but strictly trained you to be corrigible-in-way-X.)

Are there any examples of this in history, where being corrigible-in-way-X wasn't being constantly incentivized/reinforced via a larger game (e.g., status game) that the human was embedded in? In other words, I think an apparently corrigible human can be modeled as trying to optimize for survival and social status as terminal values, and using "being corrigible" as an instrumental strategy as long as that's an effective strategy. In other words, it's unclear that they can be better described as "corrigible" than "deceptive" (in the AI alignment sense).

(Humans probably have hard-coded drives for survival and social status, so it may actually be harder to train humans than AIs to be actually corrigible. My point above is just that humans don't seem to be a good example of corrigibility being easy or possible.)

Comment by Wei_Dai on Zvi’s Thoughts on the Survival and Flourishing Fund (SFF) · 2021-12-14T19:51:04.082Z · LW · GW

Was following the principle of not linking to things I consider negative.

What's the thinking behind this? (Would putting the link in parentheses or a footnote work for your purposes? I'm just thinking of the amount of time being wasted by your readers trying to find out what the institute is about.)

Their principle is to bring AI ‘under democratic control’ and then use it as a tool to force AI to enforce their political agenda

Ok, thanks. I guess one of their webpages does mention "through democratic consultation" but that didn't jump out as very salient to me until now.

Comment by Wei_Dai on Zvi’s Thoughts on the Survival and Flourishing Fund (SFF) · 2021-12-14T18:54:21.269Z · LW · GW


Suggest linking this to, as I had to spend a few minutes searching for it. Also, would be interested in why you think it's strongly net negative. (It doesn't seem like a great use of money to me, but not obviously net negative either, so I'm interested in the part of your model that's generating this conclusion.)

Comment by Wei_Dai on Considerations on interaction between AI and expected value of the future · 2021-12-14T01:02:01.088Z · LW · GW

future reflection can be expected to correct those mistakes.

I'm pretty worried that this won't happen, because these aren't "innocent" mistakes. Copying from a comment elsewhere:

Why did the Malagasy people have such a silly belief? Why do many people have very silly beliefs today? (Among the least politically risky ones to cite, someone I’ve known for years who otherwise is intelligent and successful, currently believes, or at least believed in the recent past, that 2⁄3 of everyone will die as a result of taking the COVID vaccines.) I think the unfortunate answer is that people are motivated to or are reliably caused to have certain false beliefs, as part of the status games that they’re playing. I wrote about one such dynamic, but that’s probably not a complete account.

From another comment on why reflection might not fix the mistakes:

many people are not motivated to do “rational reflection on morality” or examine their value systems to see if they would “survive full logical and empirical information”. In fact they’re motivated to do the opposite, to protect their value systems against such reflection/examination. I’m worried that alignment researchers are not worried enough that if an alignment scheme causes the AI to just “do what the user wants”, that could cause a lock-in of crazy value systems that wouldn’t survive full logical and empirical information.

One crucial question is, assuming AI will enable value lock-in when humans want it, will they use that as part of their signaling/status games? In other words, try to obtain higher status within their group by asking their AIs to lock in their morally relevant empirical or philosophical beliefs? A lot of people in the past used visible attempts at value lock in (constantly going to church to reinforce their beliefs, avoiding talking with any skeptics/heretics, etc.) for signaling. Will that change when real lock in becomes available?

Comment by Wei_Dai on Morality is Scary · 2021-12-12T20:41:11.542Z · LW · GW

And this sounds silly to us, because we know that “kicking the sunrise” is impossible, because Sun is a planet, it is far away, and your kicking has no impact on it.

I think a lot of contemporary cultures back then would have found "kicking the sunrise" to be silly, because it was obviously impossible even given what they knew at the time, i.e., you can only kick something if you physically touch it with your foot, and nobody has ever even gotten close to touching the sun, and it's even more impossible while you're asleep.

So, we should distinguish between people having different moral feelings, and having different models of the world. If you actually believed that kicking the Sun is possible and can have astronomical consequences, you would probably also perceive people sleeping westwards as criminally negligent, possibly psychopathic.

Why did the Malagasy people have such a silly belief? Why do many people have very silly beliefs today? (Among the least politically risky ones to cite, someone I've known for years who otherwise is intelligent and successful, currently believes, or at least believed in the recent past, that 2/3 of everyone will die as a result of taking the COVID vaccines.) I think the unfortunate answer is that people are motivated to or are reliably caused to have certain false beliefs, as part of the status games that they're playing. I wrote about one such dynamic, but that's probably not a complete account.

Comment by Wei_Dai on The Anthropic Trilemma · 2021-12-12T19:53:03.368Z · LW · GW

I'm not really aware of any significant progress since 12 years ago. I've mostly given up working on this problem, or most object-level philosophical problems, due to slow pace of progress and perceived opportunity costs. (Spending time on ensuring a future where progress on such problems can continue to be made, e.g., fighting against x-risk and value/philosophical lock-in or drift, seems a better bet even for the part of me that really wants to solve philosophical problems.) It seems like there's a decline in other LWer's interest in the problem, maybe for similar reasons?

Comment by Wei_Dai on Morality is Scary · 2021-12-08T06:43:52.542Z · LW · GW

A utilitarian superintelligence would probably kill me and everyone I love, because we are made of atoms that could be used for minds that are more hedonic

This seems like a reasonable concern about some types of hedonic utilitarianism. To be clear, I'm not aware of any formulation of utilitarianism that doesn't have serious issues, and I'm also not aware of any formulation of any morality that doesn't have serious issues.

But, if there was a utilitarian TAI project along with a half-decent chance to do something better (by my lights), I would actively oppose the utilitarian project. From my perspective, such a project is essentially enemy combatants.

Just to be clear, this isn't in response to something I wrote, right? (I'm definitely not advocating any kind of "utilitarian TAI project" and would be quite scared of such a project myself.)

Moreover, from my perspective this kind of thing is hacks trying to work around the core issue, namely that I am not a utilitarian (along with the vast majority of people).

So what are you (and them) then? What would your utopia look like?

Comment by Wei_Dai on On the limits of idealized values · 2021-12-08T04:06:39.139Z · LW · GW

an epistemic challenge (why would we expect our normative beliefs to correlate with the non-natural normative facts?) that realists have basically no answer to except “yeah idk but maybe this is a problem for math and philosophy too?”

i think this doesn’t help at all (the basic questions about how the non-natural realm interacts with the natural one remain unanswered—and this is a classic problem for non-physicalist theories of consciousness as well), but that it gets its appeal centrally via running through people’s confusion/mystery relationship with phenomenal consciousness, which muddies the issue enough to make it seem like the move might help.

It seems that you have a tendency to take "X'ists don't have an answer to question Y" as strong evidence for "Y has no answer, assuming X" and therefore "not X", whereas I take it as weak evidence for such because it seems pretty likely that even if Y has an answer given X, humans are just not smart enough to have found it yet. It looks like this may be the main crux that explains our disagreement over meta-ethics (where I'm much more of an agnostic).

but my general feeling is that the process of stepping away from the Joe and looking at the world as a whole tends to reduce its investment in what happens to Joe in particular

This doesn't feel very motivating to me (i.e., why should I imagine idealized me being this way), absent some kind of normative force that I currently don't know about (i.e., if there was a normative fact that I should idealize myself in this way). So I'm still in a position where I'm not sure how idealization should handle status issues (among other questions/confusions about it).

Comment by Wei_Dai on Leaving Orbit · 2021-12-07T05:00:51.692Z · LW · GW

It’s important to be able to randomly exit conversations. Otherwise, people won’t add as much useful stuff to conversations in the first place (lest they be trapped).

I used to think the opposite. I'm no longer so sure but it's at least not clear which position is right. Yes, if people couldn't randomly exit, that has a cost in terms of some people being more reluctant to start/join conversations in the first place, but doesn't the same apply for many rationalist norms? It also has benefits in terms of attracting people who like knowing that a conversation won't just randomly end without them knowing why, and in terms of providing valuable info to the audience about why a conversation ended.

I really wish we could do an experiment to gather some empirical data about this, perhaps by implementing this 12-year-old feature request.

Comment by Wei_Dai on Where do selfish values come from? · 2021-12-06T23:26:57.218Z · LW · GW

Here's a related objection that may be easier to see as a valid objection than counterfactual mugging: Suppose you're about to be copied, then one of the copies will be given a choice, "A) 1 unit of pleasure to me, or B) 2 units of pleasure to the other copy." An egoist (with perception-determined utility function) before being copied would prefer that their future self/copy choose B, but if that future self/copy is an egoist (with perception-determined utility function) it would choose A instead. So before being copied, the egoist would want to self-modify to become some other kind of agent.

Comment by Wei_Dai on Where do selfish values come from? · 2021-12-06T22:44:56.775Z · LW · GW

Yes, I gave this explanation as #1 in the list in the OP, however as I tried to explain in the rest of the post, this explanation leads to other problems (that I don't know how to solve).

Comment by Wei_Dai on Speaking of Stag Hunts · 2021-12-06T22:37:16.248Z · LW · GW

Hire a team of well-paid moderators for a three-month high-effort experiment of responding to every bad comment with a fixed version of what a good comment making the same point would have looked like. Flood the site with training data.

Maybe we can start with a smaller experiment, like a group of (paid or volunteer) moderators do this for just one post? I sometimes wish that someone would point out all the flaws in my comments so I can tell what I can improve on, but I'm not sure if that won't be so unpleasant that I'd stop wanting to participate (or there would be some other negative consequence). Doing a small experiment seems like a good first step to finding out.

Assuming such experiments go well, however, I'm still worried about possible longer term unintended consequences to having a "high standards" culture. One that I think is fairly likely is that the standards will be selectively/unevenly enforced, against comments/posts that the "standards enforcers" disagree with, making it even more costly to make posts/comments that go against the consensus beliefs around here than it already is. I frequently see such selective enforcement/moderation in other "high standards" spaces, and am worried about the same thing happening here.

Comment by Wei_Dai on Morality is Scary · 2021-12-06T21:31:09.532Z · LW · GW

I don't think that's a viable alternative, given that I don't believe that egoism is certainly right (surely the right way to treat moral uncertainty can't be to just pick something and "adopt it"?), plus I don't even know how to adopt egoism if I wanted to:

Comment by Wei_Dai on "Solving" selfishness for UDT · 2021-12-06T21:30:48.172Z · LW · GW

Stuart, did you ever get around to doing that? I can't seem to find the sequel to this post.

Comment by Wei_Dai on Morality is Scary · 2021-12-05T10:50:24.799Z · LW · GW

I’m leaning towards the more ambitious version of the project of AI alignment being about corrigible anti-goodharting, with the AI optimizing towards good trajectories within scope of relatively well-understood values

Please say more about this? What are some examples of "relatively well-understood values", and what kind of AI do you have in mind that can potentially safely optimize "towards good trajectories within scope" of these values?

Comment by Wei_Dai on Morality is Scary · 2021-12-05T00:52:33.552Z · LW · GW

Bargaining assumes we can access the utility function. In reality, even if we solve the value learning problem in the single user case, once you go to the multi-user case it becomes a mechanism design problem: users have incentives to lie / misrepresent their utility functions. A perfect solution might be impossible, but I proposed mitigating this by assigning each user a virtual “AI lawyer” that provides optimal input on their behalf into the bargaining system. In this case they at least have no incentive to lie to the lawyer, and the outcome will not be skewed in favor of users who are better in this game, but we don’t get the optimal bargaining solution either.

Assuming each lawyer has the same incentive to lie as its client, it has an incentive to misrepresent that some preferable-to-death outcomes are "worse-than-death" (in order to force those outcomes out of the set of "feasible agreements" in hope of getting a more preferred outcome as the actual outcome), and this at equilibrium is balanced by the marginal increase in the probability of getting "everyone dies" as the outcome (due to feasible agreements becoming a null set) caused by the lie. So the probability of "everyone dies" in this game has to be non-zero.

(It's the same kind of problem as in the AI race or tragedy of commons: people not taking into account the full social costs of their actions as they reach for private benefits.)

Of course in actuality everyone dying may not be a realistic consequence of failure to reach agreement, but if the real consequence is better than that, and the AI lawyers know this, they would be more willing to lie since the perceived downside of lying would be smaller, so you end up with a higher chance of no agreement.

Comment by Wei_Dai on Morality is Scary · 2021-12-04T22:46:03.594Z · LW · GW

I don’t think people determine their values through either process. I think that they already have values, which are to a large extent genetic and immutable. Instead, these processes determine what values they pretend to have for game-theory reasons. So, the big difference between the groups is which “cards” they hold and/or what strategy they pursue, not an intrinsic difference in values.

This is not a theory that's familiar to me. Why do you think this is true? Have you written more about it somewhere or can link to a more complete explanation?

But also, if we do model values as the result of some long process of reflection, and you’re worried about the AI disrupting or insufficiently aiding this process, then this is already a single-user alignment issue and should be analyzed in that context first. The presumed differences in moralities are not the main source of the problem here.

This seems reasonable to me. (If this was meant to be an argument against something I said, there may have been anther miscommuncation, but I'm not sure it's worth tracking that down.)

Comment by Wei_Dai on Morality is Scary · 2021-12-04T21:56:54.278Z · LW · GW

I’m moderately sure what my values are, to some approximation. More importantly, I’m even more sure that, whatever my values are, they are not so extremely different from the values of most people [...]

Maybe you're just not part of the target audience of my OP then... but from my perspective, if I determine my values through the kind of process described in the first quote, and most people determine their values through the kind of process described in the second quote, it seems quite likely that the values end up being very different.

[...] that I should wage some kind of war against the majority instead of trying to arrive at a reasonable compromise.

The kind of solution I have in mind is not "waging war" but for example, solving metaphilososphy and building an AI that can encourage philosophical reflection in humans or enhance people's philosophical abilities.

And, in the unlikely possibility that most people (including me) will turn out to be some kind of utilitarians after all, it’s not a problem: value aggregation will then produce a universe which is pretty good for utilitarians.

What if you turn out to be some kind of utilitarian but most people don't (because you're more like the first group in the OP and they're more like the second group), or most people will eventually turn out to be some kind of utilitarian in a world without AI, but in a world with AI, this will happen?

Comment by Wei_Dai on Morality is Scary · 2021-12-04T20:42:18.662Z · LW · GW

Second… Yes, for a utilitarian this doesn’t mean “much”. But, tbh, who cares? I am not a utilitarian. The vast majority of people are not utilitarians. Maybe even literally no one is an (honest, not self-deceiving) utilitarian. From my perspective, disappointing the imaginary utilitarian is (in itself) about as upsetting as disappointing the imaginary paperclip maximizer.

I'm not a utilitarian either, because I don't know what my values are or should be. But I do assign significant credence to the possibility that something in the vincinity of utilitarianism is the right values (for me, or period). Given my uncertainties, I want to arrange the current state of the world so that (to the extent possible), whatever I end up deciding my values are, through things like reason, deliberation, doing philosophy, the world will ultimately not turn out to be a huge disappointment according to those values. Unfortunately, your proposed solution isn't very reassuring to this kind of view.

It's quite possible that I (and people like me) are simply out of luck, and there's just no feasible way to do what we want to do, but it sounds like you think I shouldn't even want what I want, or at least that you don't want something like this. Is it because you're already pretty sure what your values are or should be, and therefore think there's little chance that millennia from now you'll end up deciding that utilitarianism (or NU, or whatever) is right after all, and regret not doing more in 2021 to push the world in the direction of [your real values, whatever they are]?

Comment by Wei_Dai on Morality is Scary · 2021-12-04T20:15:06.216Z · LW · GW

I could make up some mechanisms for this, but probably you don’t need me for that.

I'm interested in your view on this, plus what we can potentially do to push the future in this direction.

Comment by Wei_Dai on Morality is Scary · 2021-12-04T18:37:39.864Z · LW · GW

But the future we’re discussing here is one where humans retain autonomy (?), and in that case, they’re allowed to change their mind over time, especially if humanity has access to a superintelligent aligned AI.

What if the humans ask the aligned AI to help them be more moral, and part of what they mean by "more moral" is having fewer doubts about their current moral beliefs? This is what a "status game" view of morality seems to predict, for the humans whose status games aren't based on "doing philosophy", which seems to be most of them.

Comment by Wei_Dai on Morality is Scary · 2021-12-04T17:56:49.002Z · LW · GW

Yes, I still prefer this (assuming my own private utopia) over paperclips.

For a utilitarian, this doesn't mean much. What's much more important is something like, "How close is this outcome to an actual (global) utopia (e.g., with optimized utilitronium filling the universe), on a linear scale?" For example, my rough expectation (without having thought about it much) is that your "lower bound" outcome is about midway between paperclips and actual utopia on a logarithmic scale. In one sense, this is much better than paperclips, but in another sense (i.e., on the linear scale), it's almost indistinguishable from paperclips, and a utilitarian would only care about the latter and therefore be nearly as disappointed by that outcome as paperclips.

Comment by Wei_Dai on Should we postpone getting a booster due to Omicron, till there are Omicron-specific boosters? · 2021-12-04T17:40:48.240Z · LW · GW

Thanks, I've seen a number of experts suggest that people get booster shots ASAP, but without any explicit reasoning attached. To push back a bit on this, it looks like Omicron will soon become the dominant variant almost everywhere, so subsequent variants will probably branch off it. So it might be worth taking additional precautions during the Omicron peak, and then get an Omicron-specific booster when it comes out to be better protected against future Omicron-branch variants. As mentioned in another comment, I'm waiting for some additional data to come out before making this decision.

Comment by Wei_Dai on Should we postpone getting a booster due to Omicron, till there are Omicron-specific boosters? · 2021-12-04T17:34:47.226Z · LW · GW

My current thinking is to wait at least a couple of weeks, for data to come out regarding how effective current 2-shot vaccines are against Omicron, and how much the current boosters help on top of that.

Comment by Wei_Dai on Morality is Scary · 2021-12-03T20:35:24.115Z · LW · GW

I'm suggesting that maybe some of us lucked into a status game where we use "reason" and "deliberation" and "doing philosophy" to compete for status, and that somehow "doing philosophy" etc. is a real thing that eventually leads to real answers about what values we should have (which may or may not depend on who we are). Of course I'm far from certain about this, but at least part of me wants to act as if it's true, because what other choice does it have?

Comment by Wei_Dai on Omicron Variant Post #1: We’re F***ed, It’s Never Over · 2021-12-03T07:56:15.349Z · LW · GW

A trio of vaccine experts made the same point in

Although we believe it is likely that the current vaccines will continue to protect against severe disease caused by omicron, the possible need for a booster shot targeting a vaccine-resistant variant could be a reason to hold off for now on a booster targeting the original variant. For one thing, if omicron proves resistant to vaccine-induced protection against serious disease, then a booster dose with the current vaccine may not help. It’s also possible that repeatedly “training” the immune system to fight the original variant could reduce the effectiveness of a variant-specific booster. This phenomenon, called “original antigenic sin,” has been observed with influenza and human papillomavirus vaccines. In other words, for those not in immediate need of a boost, there may be a significant advantage to waiting until a booster more closely aligned with circulating variants becomes available; boosting on the original antigen could be counterproductive.

Comment by Wei_Dai on Morality is Scary · 2021-12-03T07:39:20.098Z · LW · GW

They are prevented from simulating other pre-existing people without their consent

Why do you think this will be the result of the value aggregation (or a lower bound on how good the aggregation will be)? For example, if there is a big block of people who all want to simulate person X in order to punish that person, and only X and a few other people object, why won't the value aggregation be "nobody pre-existing except X (and Y and Z etc.) can be simulated"?

Comment by Wei_Dai on Morality is Scary · 2021-12-03T03:34:31.271Z · LW · GW

The point I was trying to make with the quote is that many people are not motivated to do "rational reflection on morality" or examine their value systems to see if they would "survive full logical and empirical information". In fact they're motivated to do the opposite, to protect their value systems against such reflection/examination. I'm worried that alignment researchers are not worried enough that if an alignment scheme causes the AI to just "do what the user wants", that could cause a lock-in of crazy value systems that wouldn't survive full logical and empirical information.

Comment by Wei_Dai on Morality is Scary · 2021-12-03T03:24:31.150Z · LW · GW

Upvoted for some interesting thoughts.

In a world where alignment has been solved to most everyone’s satisfaction, I think that the status-game / cultural narrative aspect of morality will necessarily have been taken into account. For example, imagine a post-Singularity world kind of like Scott Alexander’s Archipelago, where the ASI cooperates with each sub-community to create a customized narrative for the members to participate in. It might then slowly adjust this narrative (over decades? centuries?) to align better with human flourishing in other dimensions.

Can you say more about how you see us getting from here to there?