Wei Dai's Shortform 2024-03-01T20:43:15.279Z
Managing risks while trying to do good 2024-02-01T18:08:46.506Z
AI doing philosophy = AI generating hands? 2024-01-15T09:04:39.659Z
UDT shows that decision theory is more puzzling than ever 2023-09-13T12:26:09.739Z
Meta Questions about Metaphilosophy 2023-09-01T01:17:57.578Z
Why doesn't China (or didn't anyone) encourage/mandate elastomeric respirators to control COVID? 2022-09-17T03:07:39.080Z
How to bet against civilizational adequacy? 2022-08-12T23:33:56.173Z
AI ethics vs AI alignment 2022-07-26T13:08:48.609Z
A broad basin of attraction around human values? 2022-04-12T05:15:14.664Z
Morality is Scary 2021-12-02T06:35:06.736Z
(USA) N95 masks are available on Amazon 2021-01-18T10:37:40.296Z
Anti-EMH Evidence (and a plea for help) 2020-12-05T18:29:31.772Z
A tale from Communist China 2020-10-18T17:37:42.228Z
Everything I Know About Elite America I Learned From ‘Fresh Prince’ and ‘West Wing’ 2020-10-11T18:07:52.623Z
Tips/tricks/notes on optimizing investments 2020-05-06T23:21:53.153Z
Have epistemic conditions always been this bad? 2020-01-25T04:42:52.190Z
Against Premature Abstraction of Political Issues 2019-12-18T20:19:53.909Z
What determines the balance between intelligence signaling and virtue signaling? 2019-12-09T00:11:37.662Z
Ways that China is surpassing the US 2019-11-04T09:45:53.881Z
List of resolved confusions about IDA 2019-09-30T20:03:10.506Z
Don't depend on others to ask for explanations 2019-09-18T19:12:56.145Z
Counterfactual Oracles = online supervised learning with random selection of training episodes 2019-09-10T08:29:08.143Z
AI Safety "Success Stories" 2019-09-07T02:54:15.003Z
Six AI Risk/Strategy Ideas 2019-08-27T00:40:38.672Z
Problems in AI Alignment that philosophers could potentially contribute to 2019-08-17T17:38:31.757Z
Forum participation as a research strategy 2019-07-30T18:09:48.524Z
On the purposes of decision theory research 2019-07-25T07:18:06.552Z
AGI will drastically increase economies of scale 2019-06-07T23:17:38.694Z
How to find a lost phone with dead battery, using Google Location History Takeout 2019-05-30T04:56:28.666Z
Where are people thinking and talking about global coordination for AI safety? 2019-05-22T06:24:02.425Z
"UDT2" and "against UD+ASSA" 2019-05-12T04:18:37.158Z
Disincentives for participating on LW/AF 2019-05-10T19:46:36.010Z
Strategic implications of AIs' ability to coordinate at low cost, for example by merging 2019-04-25T05:08:21.736Z
Please use real names, especially for Alignment Forum? 2019-03-29T02:54:20.812Z
The Main Sources of AI Risk? 2019-03-21T18:28:33.068Z
What's wrong with these analogies for understanding Informed Oversight and IDA? 2019-03-20T09:11:33.613Z
Three ways that "Sufficiently optimized agents appear coherent" can be false 2019-03-05T21:52:35.462Z
Why didn't Agoric Computing become popular? 2019-02-16T06:19:56.121Z
Some disjunctive reasons for urgency on AI risk 2019-02-15T20:43:17.340Z
Some Thoughts on Metaphilosophy 2019-02-10T00:28:29.482Z
The Argument from Philosophical Difficulty 2019-02-10T00:28:07.472Z
Why is so much discussion happening in private Google Docs? 2019-01-12T02:19:19.332Z
Two More Decision Theory Problems for Humans 2019-01-04T09:00:33.436Z
Two Neglected Problems in Human-AI Safety 2018-12-16T22:13:29.196Z
Three AI Safety Related Ideas 2018-12-13T21:32:25.415Z
Counterintuitive Comparative Advantage 2018-11-28T20:33:30.023Z
A general model of safety-oriented AI development 2018-06-11T21:00:02.670Z
Beyond Astronomical Waste 2018-06-07T21:04:44.630Z
Can corrigibility be learned safely? 2018-04-01T23:07:46.625Z
Multiplicity of "enlightenment" states and contemplative practices 2018-03-12T08:15:48.709Z


Comment by Wei Dai (Wei_Dai) on 3b. Formal (Faux) Corrigibility · 2024-06-12T23:47:11.514Z · LW · GW

I now think that corrigibility is a single, intuitive property

My intuitive notion of corrigibility can be straightforwardly leveraged to build a formal, mathematical measure.

This formal measure is still lacking, and almost certainly doesn’t actually capture what I mean by “corrigibility.”

I don't know, maybe it's partially or mostly my fault for reading too much optimism into these passages... But I think it would have managed my expectations better to say something like "my notion of corrigibility heavily depends on a subnotion of 'don't manipulate the principals' values' which is still far from being well-understood or formalizable."

Switching topics a little, I think I'm personally pretty confused about what human values are and therefore what it means to not manipulate someone's values. Since you're suggesting relying less on formalization and more on "examples of corrigibility collected in a carefully-selected dataset", how would you go about collecting such examples?

(One concern is that you could easily end up with a dataset that embodies a hodgepodge of different ideas of what "don't manipulate" means and then it's up to luck whether the AI generalizes from that in a correct or reasonable way.)

Comment by Wei Dai (Wei_Dai) on UDT shows that decision theory is more puzzling than ever · 2024-06-11T23:16:19.406Z · LW · GW

Thanks, Alex. Any connections between this and CTMU? (I'm in part trying to evaluate CTMU by looking at whether it has useful implications for an area that I'm relatively familiar with.)

BTW, @jessicata, do you still endorse this post, and what other posts should I read to get up to date on your current thinking about decision theory?

Comment by Wei Dai (Wei_Dai) on 3b. Formal (Faux) Corrigibility · 2024-06-11T11:06:11.901Z · LW · GW

Additional work is surely needed in developing a good measure of the kind of value modification that we don’t like while still leaving room for the kind of growth and updating that we do like.

I flagged a similar problem in a slightly different context several years ago, but don't know of any significant progress on it.

A (perhaps overly) simple measure of value modification is to measure the difference between the Value distribution given some policy and when compared with the Value distribution under the null policy. This seems like a bad choice in that it discourages the AI from taking actions which help us update in ways that we reflectively desire, even when those actions are as benign as talking about the history of philosophy.

It also prevents the AI from taking action to defend the principals against value manipulation by others. (Even if the principals request such defense, I think?) Because the AI has to keep the principles' values as close as possible to what they would be under the null policy, in order to maximize (your current formalization of) corrigibility.

Actually, have you thought about what P(V|pi_0) would actually be? If counterfactually, the CAST AI adopted the null policy, what would that imply about the world in general and hence subsequent evolution of the principals' values?

You've also said that the sim(...) part doesn't work, so I won't belabor the point, but I'm feeling a bit rug-pulled given the relatively optimistic tone in the earlier posts. I've been skeptical of earlier proposals targeting corrigibility, where the promise is that it lets us avoid having to understand human values. A basic problem I saw was, if we don't understand human values, how are we going to avoid letting our own AI or other AIs manipulate our values? Your work seems to suggest that this was a valid concern, and that there has been essentially zero progress to either solve or bypass this problem over the years.

Comment by Wei Dai (Wei_Dai) on AALWA: Ask any LessWronger anything · 2024-06-10T07:44:09.473Z · LW · GW

Can't claim to have put much thought into this topic, but here are my guesses of what the most cost-effective ways of throwing money at the problem of reducing existential risk might include:

  1. Research into human intelligence enhancement, e.g., tech related to embryo selection.
  2. Research into how to design/implement an international AI pause treaty, perhaps x-risk governance in general.
  3. Try to identify more philosophical talent across the world and pay them to make philosophical progress, especially in metaphilosophy. (I'm putting some of my own money into this.)
  4. Research into public understanding of x-risks, what people's default risk tolerances are, what arguments can or can't they understand, etc.
  5. Strategy think tanks that try to keep a big picture view of everything, propose new ideas or changes to what people/orgs should do, discuss these ideas with the relevant people, etc.
Comment by Wei Dai (Wei_Dai) on 1. The CAST Strategy · 2024-06-10T04:09:35.714Z · LW · GW

But I also think that if you gave me a year where I had lots of money, access, and was free from people trying to pressure me, I would have a good shot at pulling it off.

Want to explain a bit about how you'd go about doing this? Seems like you're facing some similar problems as assuring that an AI is wise, benevolent, and stable, e.g., not knowing what wisdom really is, distribution shift between testing and deployment, adversarial examples/inputs.

This is indeed my overall suggested strategy, with CAST coming after a “well, if you’re going to try to build it anyway you might as well die with a bit more dignity by...” disclaimer.

I think this means you should be extra careful not to inadvertently make people too optimistic about alignment, which would make coordination to stop capabilities research even harder than it already is. For example you said that you "like" the visualization of 5 humans selected by various governments, without mentioning that you don't trust governments to do this, which seems like a mistake?

Comment by Wei Dai (Wei_Dai) on 1. The CAST Strategy · 2024-06-09T08:22:38.347Z · LW · GW

A visualization that I like is imagining a small group of, say, five humans selected by various governments for being wise, benevolent, and stable.

I think this might be a dealbreaker. I don't trust the world's governments to come up with 5 humans who are sufficient wise, benevolent, and stable. (Do you really?) I'm not sure I can come with 5 such people myself. None of the alternatives you talk about seem acceptable either.

I think maybe a combination of two things could change my mind, but both seem very hard and have close to nobody working on them:

  1. The AI is very good at helping the principles be wise and stable, for example by being super-competent at philosophy. (I think this may also require being less than maximally corrigible, but I'm not sure.) Otherwise what happens if, e.g., the principles or AI start thinking about distant superintelligences?
  2. There is some way to know that benevolence is actually the CEV of such a group, i.e., they're not just "deceptively aligned", or something like that, while not having much power.
Comment by Wei Dai (Wei_Dai) on mesaoptimizer's Shortform · 2024-06-07T23:07:36.560Z · LW · GW

Yeah it seems like a bunch of low hanging fruit was picked around that time, but that opened up a vista of new problems that are still out of reach. I wrote a post about this, which I don't know if you've seen or not.

(This has been my experience with philosophical questions in general, that every seeming advance just opens up a vista of new harder problems. This is a major reason that I switched my attention to trying to ensure that AIs will be philosophically competent, instead of object-level philosophical questions.)

Comment by Wei Dai (Wei_Dai) on The Standard Analogy · 2024-06-06T22:48:29.293Z · LW · GW

Thanks for your insightful answers. You may want to make a top-level post on this topic to get more visibility. If only a very small fraction of the world is likely to ever understand and take into account many important ideas/considerations about AI x-safety, that changes the strategic picture considerably, and people around here may not be sufficiently "pricing it in". I think I'm still in the process of updating on this myself.

Having more intelligence seems to directly or indirectly improve at least half of the items on your list. So doing an AI pause and waiting for (or encouraging) humans to become smarter still seems like the best strategy. Any thoughts on this?

And I guess this… just doesn’t seem to be the case (at least to an outsider like me)?

I may be too sensitive about unintentionally causing harm, after observing many others do this. I was also just responding to what you said earlier, where it seemed like I was maybe causing you personally to be too pessimistic about contributing to solving the problems.

you probably knew him personally?

No, I never met him and didn't interact online much. He does seem like a good example of you're talking about.

Comment by Wei Dai (Wei_Dai) on Former OpenAI Superalignment Researcher: Superintelligence by 2030 · 2024-06-06T07:48:24.122Z · LW · GW

Some questions for @leopold.

  1. Anywhere I can listen to or read your debates with "doomers"?
  2. We share a strong interest in economics, but apparently not in philosophy. I'm curious if this is true, or you just didn't talk about it in the places I looked.
  3. What do you think about my worries around AIs doing philosophy? See this post or my discussion about it with Jan Leike.
  4. What do you think about my worries around AGI being inherently centralizing and/or offense-favoring and/or anti-democratic (aside from above problems, how would elections work when minds can be copied at little cost)? Seems like the free world "prevailing" on AGI might well be a Pyrrhic victory unless we can also solve these follow-up problems, but you don't address them.
  5. More generally, do you have a longer term vision of how your proposal leads to a good outcome for our lightcone, avoiding all the major AI-related x-risks and s-risks?
  6. Why are you not in favor of an AI pause treaty with other major nations? (You only talk about unilateral pause in the section "AGI Realism".) China is currently behind in chips and AI and it seems hard to surpass the entire West in a chips/AI race, so why would they not go for an AI pause treaty to preserve the status quo instead of risking a US-led intelligence explosion (not to mention x-risks)?
Comment by Wei Dai (Wei_Dai) on The Standard Analogy · 2024-06-05T23:37:13.248Z · LW · GW

In my view, the main good outcomes of the AI transition are 1) we luck out, AI x-safety is actually pretty easy across all the subproblems 2) there's an AI pause, humans get smarter via things like embryo selection, then solve all the safety problems.

I'm mainly pushing for #2, but also don't want to accidentally make #1 less likely. It seems like one of the main ways in which I could end up having a negative impact is to persuade people that the problems are definitely too hard and hence not worth trying to solve, and it turns out the problems could have been solved with a little more effort.

"it doesn’t seem like you have answers to (or even a great path forward on) these questions either despite your great interest in and effort spent on them, which bodes quite terribly for the rest of us" is a bit worrying from this perspective, and also because my "effort spent on them" isn't that great. As I don't have a good approach to answering these questions, I mainly just have them in the back of my mind while my conscious effort is mostly on other things.

BTW I'm curious what your background is and how you got interested/involved in AI x-safety. It seems rare for newcomers to the space (like you seem to be) to quickly catch up on all the ideas that have been developed on LW over the years, and many recently drawn to AGI instead appear to get stuck on positions/arguments from decades ago. For example, r/Singularity has 2.5M members and seems to be dominated by accelerationism. Do you have any insights about this? (How were you able to do this? How to help others catch up? Intelligence is probably a big factor which is why I'm hoping that humanity will automatically handle these problems better once it gets smarter, but many seem plenty smart and still stuck on primitive ideas about AI x-safety.)

Comment by Wei Dai (Wei_Dai) on The Standard Analogy · 2024-06-05T01:33:16.696Z · LW · GW

Simplicia: Hm, perhaps a crux between us is how narrow of a target is needed to realize how much of the future's value. I affirm the orthogonality thesis, but it still seems plausible to me that the problem we face is more forgiving, not so all-or-nothing as you portray it.

I agree that it's plausible. I even think a strong form of moral realism (denial of orthogonality thesis) is plausible. My objection is that humanity should figure out what is actually the case first (or have some other reasonable plan of dealing with this uncertainty), instead of playing logical Russian roulette like it seems to be doing. I like that Simplicia isn't being overconfident here, but is his position actually that "seems plausible to me that the problem we face is more forgiving" is sufficient basis for moving forward with building AGI? (Does any real person in the AI risk debate have a position like this?)

Comment by Wei Dai (Wei_Dai) on Introducing AI Lab Watch · 2024-05-27T19:17:44.113Z · LW · GW
  1. Publish important governance documents. (Seemed too basic to mention, but apparently not.)
Comment by Wei Dai (Wei_Dai) on What would stop you from paying for an LLM? · 2024-05-22T13:26:08.626Z · LW · GW

I also am not paying for any LLM. Between Microsoft's Copilot (formerly Bing Chat), LMSYS Chatbot Arena, and Codeium, I have plenty of free access to SOTA chatbots/assistants. (Slightly worried that I'm contributing to race dynamics or AI risk in general even by using these systems for free, but not enough to stop, unless someone wants to argue for this.)

Comment by Wei Dai (Wei_Dai) on Introducing AI Lab Watch · 2024-05-22T02:05:35.074Z · LW · GW

Unfortunately I don't have well-formed thoughts on this topic. I wonder if there are people who specialize in AI lab governance and have written about this, but I'm not personally aware of such writings. To brainstorm some ideas:

  1. Conduct and publish anonymous surveys of employee attitudes about safety.
  2. Encourage executives, employees, board members, advisors, etc., to regularly blog about governance and safety culture, including disagreements over important policies.
  3. Officially encourage (e.g. via financial rewards) internal and external whistleblowers. Establish and publish policies about this.
  4. Publicly make safety commitments and regularly report on their status, such as how much compute and other resources have been allocated/used by which safety teams.
  5. Make/publish a commitment to publicly report negative safety news, which can be used as basis for whistleblowing if needed (i.e. if some manager decides to hide such news instead).
Comment by Wei Dai (Wei_Dai) on OpenAI: Exodus · 2024-05-21T15:02:02.008Z · LW · GW

I'd like to hear from people who thought that AI companies would act increasingly reasonable (from an x-safety perspective) as AGI got closer. Is there still a viable defense of that position (e.g., that SamA being in his position / doing what he's doing is just uniquely bad luck, not reflecting what is likely to be happening / will happen at other AI labs)?

Also, why is there so little discussion of x-safety culture at other AI labs? I asked on Twitter and did not get a single relevant response. Are other AI company employees also reluctant to speak out, if so that seems bad (every explanation I can think of seems bad, including default incentives + companies not proactively encouraging transparency).

Comment by Wei Dai (Wei_Dai) on Introducing AI Lab Watch · 2024-05-21T03:41:21.585Z · LW · GW

Suggest having a row for "Transparency", to cover things like whether the company encourages or discourages whistleblowing, does it report bad news about alignment/safety (such as negative research results) or only good news (new ideas and positive results), does it provide enough info to the public to judge the adequacy of its safety culture and governance, etc.

Comment by Wei Dai (Wei_Dai) on Stephen Fowler's Shortform · 2024-05-20T09:20:59.082Z · LW · GW

It's also notable that the topic of OpenAI nondisparagement agreements was brought to Holden Karnofsky's attention in 2022, and he replied with "I don’t know whether OpenAI uses nondisparagement agreements; I haven’t signed one." (He could have asked his contacts inside OAI about it, or asked the EA board member to investigate. Or even set himself up earlier as someone OpenAI employees could whistleblow to on such issues.)

If the point was to buy a ticket to play the inside game, then it was played terribly and negative credit should be assigned on that basis, and for misleading people about how prosocial OpenAI was likely to be (due to having an EA board member).

Comment by Wei Dai (Wei_Dai) on Stephen Fowler's Shortform · 2024-05-19T20:34:55.055Z · LW · GW

Agreed that it reflects on badly on the people involved, although less on Paul since he was only a "technical advisor" and arguably less responsible for thinking through / due diligence on the social aspects. It's frustrating to see the EA community (on EAF and Twitter at least) and those directly involved all ignoring this.

("shouldn’t be allowed anywhere near AI Safety decision making in the future" may be going too far though.)

Comment by Wei Dai (Wei_Dai) on Ilya Sutskever and Jan Leike resign from OpenAI [updated] · 2024-05-17T20:23:31.097Z · LW · GW

So these resignations don’t negatively impact my p(doom) in the obvious way. The alignment people at OpenAI were already powerless to do anything useful regarding changing the company direction.

How were you already sure of this before the resignations actually happened? I of course had my own suspicions that this was the case, but was uncertain enough that the resignations are still a significant negative update.

ETA: Perhaps worth pointing out here that Geoffrey Irving recently left Google DeepMind to be Research Director at UK AISI, but seemingly on good terms (since Google DeepMind recently reaffirmed its intention to collaborate with UK AISI).

Comment by Wei Dai (Wei_Dai) on Wei Dai's Shortform · 2024-05-17T20:13:48.628Z · LW · GW

Bad: AI developers haven't taken alignment seriously enough to have invested enough in scalable oversight, and/or those techniques are unworkable or too costly, causing them to be unavailable.

Turns out at least one scalable alignment team has been struggling for resources. From Jan Leike (formerly co-head of Superalignment at OpenAI):

Over the past few months my team has been sailing against the wind. Sometimes we were struggling for compute and it was getting harder and harder to get this crucial research done.

Even worse, apparently the whole Superalignment team has been disbanded.

Comment by Wei Dai (Wei_Dai) on quila's Shortform · 2024-05-08T06:58:48.424Z · LW · GW

These may be among the ‘most direct’ or ‘simplest to imagine’ possible actions, but in the case of superintelligence, simplicity is not a constraint.

I think it is considered a constraint by some because they think that it would be easier/safer to use a superintelligent AI to do simpler actions, while alignment is not yet fully solved. In other words, if alignment was fully solved, then you could use it to do complicated things like what you suggest, but there could be an intermediate stage of alignment progress where you could safely use SI to do something simple like "melt GPUs" but not to achieve more complex goals.

Comment by Wei Dai (Wei_Dai) on Rapid capability gain around supergenius level seems probable even without intelligence needing to improve intelligence · 2024-05-08T06:35:39.797Z · LW · GW

Some evidence in favor of your explanation (being at least a correct partial explanation):

  1. von Neuman apparently envied Einstein's physics intuitions, while Einstein lacked von Neuman's math skills. This seems to suggest that they were "tuned" in slightly different directions.
  2. Neither of the two seem superhumanly accomplished in other areas (that a smart person/agent might have goals for), such as making money, moral/philosophical progress, changing culture/politics in their preferred direction.

(An alternative explanation for 2 is that they could have been superhuman in other areas but their terminal goals did not chain through instrumental goals in those areas, which in turn raises the question of what those terminal goals must have been for this explanation to be true and what that says about human values.)

I note that under your explanation, someone could surprise the world by tuning a not-particularly-advanced AI for a task nobody previously thought to tune AI for, or by inventing a better tuning method (either general or specialized), thus achieving a large capability jump in one or more domains. Not sure how worrisome this is though.

Comment by Wei Dai (Wei_Dai) on How do open AI models affect incentive to race? · 2024-05-07T04:38:50.813Z · LW · GW

A government might model the situation as something like "the first country/coalition to open up an AI capabilities gap of size X versus everyone else wins" because it can then easily win a tech/cultural/memetic/military/economic competition against everyone else and take over the world. (Or a fuzzy version of this to take into account various uncertainties.) Seems like a very different kind of utility function.

Comment by Wei Dai (Wei_Dai) on How do open AI models affect incentive to race? · 2024-05-07T03:42:14.869Z · LW · GW

Hmm, open models make it easier for a corporation to train closed models, but also make that activity less profitable, whereas for a government the latter consideration doesn't apply or has much less weight, so it seems much clearer that open models increase overall incentive for AI race between nations.

Comment by Wei Dai (Wei_Dai) on How do open AI models affect incentive to race? · 2024-05-07T03:04:21.086Z · LW · GW

I think open source models probably reduce profit incentives to race, but can increase strategic (e.g., national security) incentives to race. Consider that if you're the Chinese government, you might think that you're too far behind in AI and can't hope to catch up, and therefore decide to spend your resources on other ways to mitigate the risk of a future transformative AI built by another country. But then an open model is released, and your AI researchers catch up to near state-of-the-art by learning from it, which may well change your (perceived) tradeoffs enough that you start spending a lot more on AI research.

Comment by Wei Dai (Wei_Dai) on The formal goal is a pointer · 2024-05-03T03:27:57.512Z · LW · GW

What do you think of this post by Tammy?

It seems like someone could definitely be wrong about what they want (unless normative anti-realism is true and such a sentence has no meaning). For example consider someone who thinks it's really important to be faithful to God and goes to church every Sunday to maintain their faith and would use a superintelligent religious AI assistant to help keep the faith if they could. Or maybe they're just overconfident about their philosophical abilities and would fail to take various precautions that I think are important in a high-stakes reflective process.

Mostly that thing where we had a lying vs lie-detecting arms race and the liars mostly won by believing their own lies and that’s how we have things like overconfidence bias and self-serving bias and a whole bunch of other biases.

Are you imagining that the RL environment for AIs will be single-player, with no social interactions? If yes, how will they learn social skills? If no, why wouldn't the same thing happen to them?

Unless we do a very stupid thing like reading the AI’s thoughts and RL-punish wrongthink, this seems very unlikely to happen.

We already RL-punish AIs for saying things that we don't like (via RLHF), and in the future will probably punish them for thinking things we don't like (via things like interpretability). Not sure how to avoid this (given current political realities) so safety plans have to somehow take this into account.

Comment by Wei Dai (Wei_Dai) on Which skincare products are evidence-based? · 2024-05-03T03:07:55.519Z · LW · GW

Retinoids, which is a big family of compounds but I would go with adapalene, which has better safety/side effect than anything else. It has less scientific evidence for anti-aging than other retinoids (and is not marketed for that purpose), but I've tried it myself (bought it for acne), and it has very obvious anti-wrinkle effects within like a week. You can get generic 0.1% adapalene gel on Amazon for 1.6oz/$12.

(I'm a little worried about long term effects, i.e. could the increased skin turnover mean faster aging in the long run, but can't seem to find any data or discussion about it.)

Comment by Wei Dai (Wei_Dai) on The formal goal is a pointer · 2024-05-02T05:09:40.370Z · LW · GW

I would honestly be pretty comfortable with maximizing SBF’s CEV.

Yikes, I'm not even comfortable maximizing my own CEV. One crux may be that I think a human's values may be context-dependent. In other words, current me-living-in-a-normal-society may have different values from me-given-keys-to-the-universe and should not necessarily trust that version of myself. (Similar to how earlier idealistic Mao shouldn't have trusted his future self.)

My own thinking around this is that we need to advance metaphilosophy and social epistemology, engineer better discussion rules/norms/mechanisms and so on, design a social process that most people can justifiably trust in (i.e., is likely to converge to moral truth or actual representative human values or something like that), then give AI a pointer to that, not any individual human's reflection process which may be mistaken or selfish or skewed.

TLDR: Humans can be powerful and overconfident. I think this is the main source of human evil. I also think this is unlikely to naturally be learned by RL in environments that don’t incentivize irrationality (like ours did).

Where is the longer version of this? I do want to read it. :) Specifically, what is it about the human ancestral environment that made us irrational, and why wouldn't RL environments for AI cause the same or perhaps a different set of irrationalities?

Also, how does RL fit into QACI? Can you point me to where this is discussed?

Comment by Wei Dai (Wei_Dai) on The formal goal is a pointer · 2024-05-02T03:26:35.007Z · LW · GW

Luckily the de-facto nominees for this position are alignment researchers, who pretty strongly self-select for having cosmopolitan altruistic values.

But we could have said the same thing of SBF, before the disaster happened.

Due to very weird selection pressure, humans ended up really smart but also really irrational. [...] An AGI (at least, one that comes from something like RL rather than being conjured in a simulation or something else weird) will probably end up with a way higher rationality:intelligence ratio, and so it will be much less likely to destroy everything we value than an empowered human.

Please explain your thinking behind this?

Dealing with moral uncertainty is just part of expected utility maximization.

It's not, because some moral theories are not compatible with EU maximization, and of the ones that are, it's still unclear how to handle uncertainty between them.

Comment by Wei Dai (Wei_Dai) on Ironing Out the Squiggles · 2024-05-02T03:09:46.118Z · LW · GW

the inductive bias doesn’t precisely match human vision, so it has different mistakes, but as you scale both architectures they become more similar. that’s exactly what you’d expect for any approximately Bayesian setup.

I can certainly understand that as you scale both architectures, they both make less mistakes on distribution. But do they also generalize out of training distribution more similarly? If so, why? Can you explain this more? (I'm not getting your point from just "approximately Bayesian setup".)

They needed a giant image classification dataset which I don’t think even existed 5 years ago.

This is also confusing/concerning for me. Why would it be necessary or helpful to have such a large dataset to align the shape/texture bias with humans?

Comment by Wei Dai (Wei_Dai) on Ironing Out the Squiggles · 2024-05-01T22:11:05.843Z · LW · GW

Do you know if it is happening naturally from increased scale, or only correlated with scale (people are intentionally trying to correct the "misalignment" between ML and humans of shape vs texture bias by changing aspects of the ML system like its training and architecture, and simultaneously increasing scale)? I somewhat suspect the latter due the existence of a benchmark that the paper seems to target ("humans are at 96% shape / 4% texture bias and ViT-22B-384 achieves a previously unseen 87% shape bias / 13% texture bias").

In either case, it seems kind of bad that it has taken a decade or two to get to this point from when adversarial examples were first noticed, and it's unclear whether other adversarial examples or "misalignment" remain in the vision transformer. If the first transformative AIs don't quite learn the right values due to having a different inductive bias from humans, it may not matter much that 10 years later the problem would be solved.

Comment by Wei Dai (Wei_Dai) on Why I’m not working on {debate, RRM, ELK, natural abstractions} · 2024-05-01T05:26:28.895Z · LW · GW

Traditionally, those techniques are focused on what the model is outputting, not what the model’s underlying motivations are. But I haven’t read all the literature. Am I missing something?

It's confusing to me as well, perhaps because different people (or even the same person at different times) emphasize different things within the same approach, but here's one post where someone said, "It is important that the overseer both knows which action the distilled AI wants to take as well as why it takes that action."

Comment by Wei Dai (Wei_Dai) on The formal goal is a pointer · 2024-05-01T04:02:47.534Z · LW · GW

Did SBF or Mao Zedong not have a pointer to the right values, or had a right pointer but made mistakes due to computational issues (i.e., would have avoided causing the disasters that they did if they were smarter and/or had more time to think)? Both seem possible to me, so I'd like to understand how the QACI approach would solve (or rule out) both of these potential problems:

  1. If many humans don't have pointers to right values, how to make sure QACI gets a pointer from humans who have a pointer to the right values?
  2. How to make sure that AI will not make some catastrophic mistake while it's not smart enough to fully understand the values we give it, while still being confident enough in its guesses of what to do in the short term to do useful things?

Moral uncertainty is an area in philosophy with ongoing research, and assuming that AI will handle it correctly by default seems unsafe, similar to assuming that AI will have the right decision theory by default.

I see that Tasmin Leake also pointed out 2 above as a potential problem, but I don't see anything that looks like a potential solution at QACI table of contents.

Comment by Wei Dai (Wei_Dai) on Ironing Out the Squiggles · 2024-05-01T02:29:23.881Z · LW · GW

Katja Grace notes that image synthesis methods have no trouble generating photorealistic human faces.

They're terrible at hands though (which has ruined many otherwise good images for me). That post used Stable Diffusion 1.5, but even the latest SD 3.0 (with versions 2.0, 2.1, XL, Stable Cascade in between) is still terrible at it.

Don't really know how relevant this is to your point/question about fragility of human values, but thought I'd mention it since it seems plausibly as relevant as AIs being able to generate photorealistic human faces.

Comment by Wei Dai (Wei_Dai) on Ironing Out the Squiggles · 2024-04-30T08:20:59.656Z · LW · GW

Adversarial examples suggest to me that by default ML systems don't necessarily learn what we want them to learn:

  1. They put too much emphasis on high frequency features, suggesting a different inductive bias from humans.
  2. They don't handle contradictory evidence in a reasonable way, i.e., giving a confident answer when high frequency features (pixel-level details) and low frequency features (overall shape) point to different answers.

The evidence from adversarial training suggests to me that AT is merely patching symptoms (e.g., making the ML system de-emphasize certain specific features) and not fixing the underlying problem. At least this is my impression from watching this video on Adversarial Robustness, specifically the chapters on Adversarial Arms Race and Unforeseen Adversaries.

Aside from this, it's also unclear how to apply AT to your original motivation:

A function that tells your AI system whether an action looks good and is right virtually all of the time on natural inputs isn’t safe if you use it to drive an enormous search for unnatural (highly optimized) inputs on which it might behave very differently.

because in order to apply AT we need a model of what "attacks" the adversary is allowed to do (in this case the "attacker" is a superintelligence trying to optimize the universe, so we have to model it as being allowed to do anything?) and also ground-truth training labels.

For this purpose, I don't think we can use the standard AT practice of assuming that any data point within a certain distance of a human-labeled instance, according to some metric, has the same label as that instance. Suppose we instead let the training process query humans directly for training labels (i.e., how good some situation is) on arbitrary data points, well that's slow/costly if the process isn't very sample efficient (which modern ML isn't), and also scary if human implementations of human values may already have adversarial examples. (The "perceptual wormholes" work and other evidence suggest that humans also aren't 100% adversarially robust.)

My own thinking is that we probably need to go beyond adversarial training for this, along the lines of solving metaphilosophy and then using that solution to find/fix existing adversarial examples and correctly generalize human values out of distribution.

Comment by Wei Dai (Wei_Dai) on A proposed method for forecasting transformative AI · 2024-04-27T13:22:49.179Z · LW · GW

I'm confused about how heterogeneity in data quality interacts with scaling. Surely training a LM on scientific papers would give different results from training it on web spam, but data quality is not an input to the scaling law... This makes me wonder whether your proposed forecasting method might have some kind of blind spot in this regard, for example failing to take into account that AI labs have probably already fed all the scientific papers they can into their training processes. If future LMs train on additional data that have little to do with science, could that keep reducing overall cross-entropy loss (as scientific papers become a smaller fraction of the overall corpus) but fail to increase scientific ability?

Comment by Wei Dai (Wei_Dai) on Eric Neyman's Shortform · 2024-04-27T06:38:14.906Z · LW · GW

Thank you for detailing your thoughts. Some differences for me:

  1. I'm also worried about unaligned AIs as a competitor to aligned AIs/civilizations in the acausal economy/society. For example, suppose there are vulnerable AIs "out there" that can be manipulated/taken over via acausal means, unaligned AI could compete with us (and with others with better values from our perspective) in the race to manipulate them.
  2. I'm perhaps less optimistic than you about commitment races.
  3. I have some credence on max good and max bad being not close to balanced, that additionally pushes me towards the "unaligned AI is bad" direction.

ETA: Here's a more detailed argument for 1, that I don't think I've written down before. Our universe is small enough that it seems plausible (maybe even likely) that most of the value or disvalue created by a human-descended civilization comes from its acausal influence on the rest of the multiverse. An aligned AI/civilization would likely influence the rest of the multiverse in a positive direction, whereas an unaligned AI/civilization would probably influence the rest of the multiverse in a negative direction. This effect may outweigh what happens in our own universe/lightcone so much that the positive value from unaligned AI doing valuable things in our universe as a result of acausal trade is totally swamped by the disvalue created by its negative acausal influence.

Comment by Wei Dai (Wei_Dai) on Eric Neyman's Shortform · 2024-04-26T11:05:36.839Z · LW · GW

Perhaps half of the value of misaligned AI control is from acausal trade and half from the AI itself being valuable.

Why do you think these values are positive? I've been pointing out, and I see that Daniel Kokotajlo also pointed out in 2018 that these values could well be negative. I'm very uncertain but my own best guess is that the expected value of misaligned AI controlling the universe is negative, in part because I put some weight on suffering-focused ethics.

Comment by Wei Dai (Wei_Dai) on LLMs seem (relatively) safe · 2024-04-26T10:29:53.346Z · LW · GW

If something is both a vanguard and limited, then it seemingly can't stay a vanguard for long. I see a few different scenarios going forward:

  1. We pause AI development while LLMs are still the vanguard.
  2. The data limitation is overcome with something like IDA or Debate.
  3. LLMs are overtaken by another AI technology, perhaps based on RL.

In terms of relative safety, it's probably 1 > 2 > 3. Given that 2 might not happen in time, might not be safe if it does, or might still be ultimately outcompeted by something else like RL, I'm not getting very optimistic about AI safety just yet.

Comment by Wei Dai (Wei_Dai) on AI Regulation is Unsafe · 2024-04-25T04:02:04.971Z · LW · GW

The argument is that with 1970′s tech the soviet union collapsed, however with 2020 computer tech (not needing GenAI) it would not.

I note that China is still doing market economics, and nobody is trying (or even advocating, AFAIK) some very ambitious centrally planned economy using modern computers, so this seems like pure speculation? Has someone actually made a detailed argument about this, or at least has the agreement of some people with reasonable economics intuitions?

Comment by Wei Dai (Wei_Dai) on AI Regulation is Unsafe · 2024-04-25T03:54:39.752Z · LW · GW

I've arguably lived under totalitarianism (depending on how you define it), and my parents definitely have and told me many stories about it. I think AGI increases risk of totalitarianism, and support a pause in part to have more time to figure out how to make the AI transition go well in that regard.

Comment by Wei Dai (Wei_Dai) on Examples of Highly Counterfactual Discoveries? · 2024-04-25T03:17:38.640Z · LW · GW

Even if someone made a discovery decades earlier than it otherwise would have been, the long term consequences of that may be small or unpredictable. If your goal is to "achieve high counterfactual impact in your own research" (presumably predictably positive ones) you could potentially do that in certain fields (e.g., AI safety) even if you only counterfactually advance the science by a few months or years. I'm a bit confused why you're asking people to think in the direction outlined in the OP.

Comment by Wei Dai (Wei_Dai) on Changes in College Admissions · 2024-04-25T02:55:47.168Z · LW · GW

Some of my considerations for college choice for my kid, that I suspect others may also want to think more about or discuss:

  1. status/signaling benefits for the parents (This is probably a major consideration for many parents to push their kids into elite schools. How much do you endorse it?)
  2. sex ratio at the school and its effect on the local "dating culture"
  3. political/ideological indoctrination by professors/peers
  4. workload (having more/less time/energy to pursue one's own interests)
Comment by Wei Dai (Wei_Dai) on Richard Ngo's Shortform · 2024-04-25T01:12:07.099Z · LW · GW

I added this to my comment just before I saw your reply: Maybe it changes moment by moment as we consider different decisions, or something like that? But what about when we're just contemplating a philosophical problem and not trying to make any specific decisions?

I mostly offer this in the spirit of "here's the only way I can see to reconcile subjective anticipation with UDT at all", not "here's something which makes any sense mechanistically or which I can justify on intuitive grounds".

Ah I see. I think this is incomplete even for that purpose, because "subjective anticipation" to me also includes "I currently see X, what should I expect to see in the future?" and not just "What should I expect to see, unconditionally?" (See the link earlier about UDASSA not dealing with subjective anticipation.)

ETA: Currently I'm basically thinking: use UDT for making decisions, use UDASSA for unconditional subjective anticipation, am confused about conditional subjective anticipation as well as how UDT and UDASSA are disconnected from each other (i.e., the subjective anticipation from UDASSA not feeding into decision making). Would love to improve upon this, but your idea currently feels worse than this...

Comment by Wei Dai (Wei_Dai) on Changes in College Admissions · 2024-04-25T00:51:05.008Z · LW · GW

As you would expect, I strongly favor (1) over (2) over (3), with (3) being far, far worse for ‘eating your whole childhood’ reasons.

Is this actually true? China has (1) (affirmative action via "Express and objective (i.e., points and quotas)") for its minorities and different regions and FWICT the college admissions "eating your whole childhood" problem over there is way worse. Of course that could be despite (1) not because of it, but does make me question whether (3) ("Implied and subjective ('we look at the whole person').") is actually far worse than (1) for this.

Comment by Wei Dai (Wei_Dai) on Richard Ngo's Shortform · 2024-04-25T00:27:58.228Z · LW · GW

Intuitively this feels super weird and unjustified, but it does make the "prediction" that we'd find ourselves in a place with high marginal utility of money, as we currently do.

This is particularly weird because your indexical probability then depends on what kind of bet you're offered. In other words, our marginal utility of money differs from our marginal utility of other things, and which one do you use to set your indexical probability? So this seems like a non-starter to me... (ETA: Maybe it changes moment by moment as we consider different decisions, or something like that? But what about when we're just contemplating a philosophical problem and not trying to make any specific decisions?)

By "acausal games" do you mean a generalization of acausal trade?

Yes, didn't want to just say "acausal trade" in case threats/war is also a big thing.

Comment by Wei Dai (Wei_Dai) on Richard Ngo's Shortform · 2024-04-24T23:50:51.372Z · LW · GW

This was all kinda rambly but I think I can summarize it as "Isn't it weird that ADT tells us that we should act as if we'll end up in unusually important places, and also we do seem to be in an incredibly unusually important place in the universe? I don't have a story for why these things are related but it does seem like a suspicious coincidence."

I'm not sure this is a valid interpretation of ADT. Can you say more about why you interpret ADT this way, maybe with an example? My own interpretation of how UDT deals with anthropics (and I'm assuming ADT is similar) is "Don't think about indexical probabilities or subjective anticipation. Just think about measures of things you (considered as an algorithm with certain inputs) have influence over."

This seems to "work" but anthropics still feels mysterious, i.e., we want an explanation of "why are we who we are / where we're at" and it's unsatisfying to "just don't think about it". UDASSA does give an explanation of that (but is also unsatisfying because it doesn't deal with anticipations, and also is disconnected from decision theory).

I would say that under UDASSA, it's perhaps not super surprising to be when/where we are, because this seems likely to be a highly simulated time/scenario for a number of reasons (curiosity about ancestors, acausal games, getting philosophical ideas from other civilizations).

Comment by Wei Dai (Wei_Dai) on Rejecting Television · 2024-04-24T02:52:16.867Z · LW · GW

It occurs to me that many alternatives you mention are also superstimuli:

  • Reading a book
    • Pretty unlikely or rare to encounter stories or ideas with this much information content or entertainment value in the ancestral environment.
    • Some people do get addicted to books, e.g., romance novels.
  • Extroversion / talking to attractive people
    • We have access to more people, including more attractive people, but talking to anyone is less likely to lead to anything consequential because of birth control and because they also have way more choices.
    • Sex addiction. People who party all the time.
  • Creativity
    • We have the time and opportunity to do a lot more things that feel "creative" or "meaningful" to us, but these activities have less real-world significance than such feelings might suggest because other people have way more creative products/personalities to choose from.
    • Struggling artists/entertainers who refuse to give up their passions. Obscure hobbies.

Not sure if there are exceptions or not, but it seems like everything we could do for fun these days is some kind of supernormal stimulus, or the "fun" isn't much related to the original evolutionary purpose anymore. This includes e.g. forum participation. So far I haven't tried to make great efforts to quit anything, and instead have just eventually gotten bored of certain things I used to be "addicted" to (e.g., CRPGs, micro-optimizing crypto code). (This is not meant to be advice for other people. Also the overall issue of superstimuli/addiction is perhaps more worrying to me than this comment might suggest.)

Comment by Wei Dai (Wei_Dai) on Security amplification · 2024-04-22T03:10:15.019Z · LW · GW

Does anyone know why security amplification and meta-execution are rarely talked about these days? I did a search on LW and found just 1 passing reference to either phrase in the last 3 years. Is the problem not considered an important problem anymore? The problem is too hard and no one has new ideas? There are too many alignment problems/approaches to work on and not enough researchers?

Comment by Wei Dai (Wei_Dai) on When is a mind me? · 2024-04-18T11:18:21.104Z · LW · GW

If you think there’s something mysterious or unknown about what happens when you make two copies of yourself

Eliezer talked about some puzzles related to copying and anticipation in The Anthropic Trilemma that still seem quite mysterious to me. See also my comment on that post.