Posts

UDT shows that decision theory is more puzzling than ever 2023-09-13T12:26:09.739Z
Meta Questions about Metaphilosophy 2023-09-01T01:17:57.578Z
Why doesn't China (or didn't anyone) encourage/mandate elastomeric respirators to control COVID? 2022-09-17T03:07:39.080Z
How to bet against civilizational adequacy? 2022-08-12T23:33:56.173Z
AI ethics vs AI alignment 2022-07-26T13:08:48.609Z
A broad basin of attraction around human values? 2022-04-12T05:15:14.664Z
Morality is Scary 2021-12-02T06:35:06.736Z
(USA) N95 masks are available on Amazon 2021-01-18T10:37:40.296Z
Anti-EMH Evidence (and a plea for help) 2020-12-05T18:29:31.772Z
A tale from Communist China 2020-10-18T17:37:42.228Z
Everything I Know About Elite America I Learned From ‘Fresh Prince’ and ‘West Wing’ 2020-10-11T18:07:52.623Z
Tips/tricks/notes on optimizing investments 2020-05-06T23:21:53.153Z
Have epistemic conditions always been this bad? 2020-01-25T04:42:52.190Z
Against Premature Abstraction of Political Issues 2019-12-18T20:19:53.909Z
What determines the balance between intelligence signaling and virtue signaling? 2019-12-09T00:11:37.662Z
Ways that China is surpassing the US 2019-11-04T09:45:53.881Z
List of resolved confusions about IDA 2019-09-30T20:03:10.506Z
Don't depend on others to ask for explanations 2019-09-18T19:12:56.145Z
Counterfactual Oracles = online supervised learning with random selection of training episodes 2019-09-10T08:29:08.143Z
AI Safety "Success Stories" 2019-09-07T02:54:15.003Z
Six AI Risk/Strategy Ideas 2019-08-27T00:40:38.672Z
Problems in AI Alignment that philosophers could potentially contribute to 2019-08-17T17:38:31.757Z
Forum participation as a research strategy 2019-07-30T18:09:48.524Z
On the purposes of decision theory research 2019-07-25T07:18:06.552Z
AGI will drastically increase economies of scale 2019-06-07T23:17:38.694Z
How to find a lost phone with dead battery, using Google Location History Takeout 2019-05-30T04:56:28.666Z
Where are people thinking and talking about global coordination for AI safety? 2019-05-22T06:24:02.425Z
"UDT2" and "against UD+ASSA" 2019-05-12T04:18:37.158Z
Disincentives for participating on LW/AF 2019-05-10T19:46:36.010Z
Strategic implications of AIs' ability to coordinate at low cost, for example by merging 2019-04-25T05:08:21.736Z
Please use real names, especially for Alignment Forum? 2019-03-29T02:54:20.812Z
The Main Sources of AI Risk? 2019-03-21T18:28:33.068Z
What's wrong with these analogies for understanding Informed Oversight and IDA? 2019-03-20T09:11:33.613Z
Three ways that "Sufficiently optimized agents appear coherent" can be false 2019-03-05T21:52:35.462Z
Why didn't Agoric Computing become popular? 2019-02-16T06:19:56.121Z
Some disjunctive reasons for urgency on AI risk 2019-02-15T20:43:17.340Z
Some Thoughts on Metaphilosophy 2019-02-10T00:28:29.482Z
The Argument from Philosophical Difficulty 2019-02-10T00:28:07.472Z
Why is so much discussion happening in private Google Docs? 2019-01-12T02:19:19.332Z
Two More Decision Theory Problems for Humans 2019-01-04T09:00:33.436Z
Two Neglected Problems in Human-AI Safety 2018-12-16T22:13:29.196Z
Three AI Safety Related Ideas 2018-12-13T21:32:25.415Z
Counterintuitive Comparative Advantage 2018-11-28T20:33:30.023Z
A general model of safety-oriented AI development 2018-06-11T21:00:02.670Z
Beyond Astronomical Waste 2018-06-07T21:04:44.630Z
Can corrigibility be learned safely? 2018-04-01T23:07:46.625Z
Multiplicity of "enlightenment" states and contemplative practices 2018-03-12T08:15:48.709Z
Online discussion is better than pre-publication peer review 2017-09-05T13:25:15.331Z
Examples of Superintelligence Risk (by Jeff Kaufman) 2017-07-15T16:03:58.336Z
Combining Prediction Technologies to Help Moderate Discussions 2016-12-08T00:19:35.854Z

Comments

Comment by Wei Dai (Wei_Dai) on Book Review: 1948 by Benny Morris · 2023-12-04T17:23:48.382Z · LW · GW

Israel was prepared to accept a peaceful partition of the land

There are sources saying that Israeli leaders were publicly willing to accept partition, but privately not. I'm uncertain how true this is, but haven't been able to find any push back on it. (I brought it up in a FB discussion and the pro-Israeli participants just ignored it.)

https://www.salon.com/2015/11/30/u_n_voted_to_partition_palestine_68_years_ago_in_an_unfair_plan_made_even_worse_by_israels_ethnic_cleansing/

Leading Israeli historian Benny Morris documented the intention of Israel's founding fathers to occupy all of the land. Ben-Gurion and Israel's first president, Chaim Weizmann, lobbied strongly for the Partition Plan at the time of the vote 68 years ago. Both leaders, Morris wrote, "saw partition as a stepping stone to further expansion and the eventual takeover of the whole of Palestine."

https://en.wikipedia.org/wiki/1937_Ben-Gurion_letter

Does the establishment of a Jewish state [in only part of Palestine] advance or retard the conversion of this country into a Jewish country? My assumption (which is why I am a fervent proponent of a state, even though it is now linked to partition) is that a Jewish state on only part of the land is not the end but the beginning.... This is because this increase in possession is of consequence not only in itself, but because through it we increase our strength, and every increase in strength helps in the possession of the land as a whole. The establishment of a state, even if only on a portion of the land, is the maximal reinforcement of our strength at the present time and a powerful boost to our historical endeavors to liberate the entire country.

Comment by Wei Dai (Wei_Dai) on OpenAI: Altman Returns · 2023-12-01T23:04:46.682Z · LW · GW

Just saw a poll result that's consistent with this.

Comment by Wei Dai (Wei_Dai) on My techno-optimism [By Vitalik Buterin] · 2023-11-29T17:12:27.047Z · LW · GW

I didn't downvote/disagree-vote Ben's comment, but it doesn't unite the people who think that accelerating development of certain technologies isn't enough to (sufficiently) prevent doom, that we also need to slow down or pause development of certain other technologies.

Comment by Wei Dai (Wei_Dai) on My techno-optimism [By Vitalik Buterin] · 2023-11-28T20:34:33.319Z · LW · GW

Crossposted from X (I'm experimenting with participating more there.)

This is speaking my language, but I worry that AI may inherently disfavor defense (in at least one area), decentralization, and democracy, and may differentially accelerate wrong intellectual fields, and humans pushing against that may not be enough. Some explanations below.

"There is an apparent asymmetry between attack and defense in this arena, because manipulating a human is a straightforward optimization problem [...] but teaching or programming an AI to help defend against such manipulation seems much harder [...]" https://www.lesswrong.com/posts/HTgakSs6JpnogD6c2/two-neglected-problems-in-human-ai-safety

"another way for AGIs to greatly reduce coordination costs in an economy is by having each AGI or copies of each AGI profitably take over much larger chunks of the economy" https://www.lesswrong.com/posts/Sn5NiiD5WBi4dLzaB/agi-will-drastically-increase-economies-of-scale

Lastly, I worry that AI will slow down progress in philosophy/wisdom relative to science and technology, because we have easy access to ground truths in the latter fields, which we can use to train AI, but not the former, making it harder to deal with new social/ethical problems

Comment by Wei Dai (Wei_Dai) on Sam Altman fired from OpenAI · 2023-11-18T01:42:16.750Z · LW · GW

Came across this account via a random lawyer I'm following on Twitter (for investment purposes), who commented, "Huge L for the e/acc nerds tonight". Crazy times...

Comment by Wei Dai (Wei_Dai) on Sam Altman fired from OpenAI · 2023-11-18T01:35:53.883Z · LW · GW

https://twitter.com/karaswisher/status/1725678898388553901 Kara Swisher @karaswisher

Sources tell me that the profit direction of the company under Altman and the speed of development, which could be seen as too risky, and the nonprofit side dedicated to more safety and caution were at odds. One person on the Sam side called it a “coup,” while another said it was the the right move.

Comment by Wei Dai (Wei_Dai) on Value systematization: how values become coherent (and misaligned) · 2023-10-28T21:24:10.605Z · LW · GW

Similarly, suppose you have two deontological values which trade off against each other. Before systematization, the question of “what’s the right way to handle cases where they conflict” is not really well-defined; you have no procedure for doing so.

Why is this a problem, that calls out to be fixed (hence leading to systematization)? Why not just stick with the default of "go with whichever value/preference/intuition that feels stronger in the moment"? People do that unthinkingly all the time, right? (I have my own thoughts on this, but curious if you agree with me or what your own thinking is.)

And that’s why the “mind itself wants to do this” does make sense, because it’s reasonable to assume that highly capable cognitive architectures will have ways of identifying aspects of their thinking that “don’t make sense” and correcting them.

How would you cash out "don't make sense" here?

Comment by Wei Dai (Wei_Dai) on Value systematization: how values become coherent (and misaligned) · 2023-10-28T04:12:22.444Z · LW · GW

which makes sense, since in some sense systematizing from our own welfare to others’ welfare is the whole foundation of morality

This seems wrong to me. I think concern for others' welfare comes from being directly taught/trained as a child to have concern for others, and then later reinforced by social rewards/punishments as one succeeds or fails at various social games. This situation could have come about without anyone "systematizing from our own welfare", just by cultural (and/or genetic) variation and selection. I think value systematizing more plausibly comes into play with things like widening one's circle of concern beyond one's family/tribe/community.

What you're trying to explain with this statement, i.e., "Morality seems like the domain where humans have the strongest instinct to systematize our preferences" seems better explained by what I wrote in this comment.

Comment by Wei Dai (Wei_Dai) on Value systematization: how values become coherent (and misaligned) · 2023-10-28T02:34:55.988Z · LW · GW

This reminds me that I have an old post asking Why Do We Engage in Moral Simplification? (What I called "moral simplification" seems very similar to what you call "value systematization".) I guess my post didn't really fully answer this question, and you don't seem to talk much about the "why" either.

Here are some ideas after thinking about it for a while. (Morality is Scary is useful background here, if you haven't read it already.)

  1. Wanting to use explicit reasoning with our values (e.g., to make decisions), which requires making our values explicit, i.e., defining them symbolically, which necessitates simplification given limitations of human symbolic reasoning.
  2. Moral philosophy as a status game, where moral philosophers are implicitly scored on the moral theories they come up with by simplicity and by how many human moral intuitions they are consistent with.
  3. Everyday signaling games, where people (in part) compete to show that they have community-approved or locally popular values. Making values legible and not too complex facilitates playing these games.
  4. Instinctively transferring our intuitions/preferences for simplicity from "belief systematization" where they work really well, into a different domain (values) where they may or may not still make sense.

(Not sure how any of this applies to AI. Will have to think more about that.)

Comment by Wei Dai (Wei_Dai) on Thoughts on responsible scaling policies and regulation · 2023-10-26T21:15:49.250Z · LW · GW

if the biggest impact is the very slow thinking from a very small group of people who care about them, then I think that’s a very small impact

I guess from my perspective, the biggest impact is the possibility that the idea of better preparing for these risks becomes a lot more popular. An analogy with Bitcoin comes to mind, where the idea of cryptography-based distributed money languished for many years, known only to a tiny community, and then was suddenly everywhere. An AI pause would provide more time for something like that to happen. And if the idea of better preparing for these risks was actually a good one (as you seem to think), there's no reason why it couldn't (or was very unlikely to) spread beyond a very small group, do you agree?

Comment by Wei Dai (Wei_Dai) on Thoughts on responsible scaling policies and regulation · 2023-10-26T06:33:44.588Z · LW · GW

In My views on “doom” you wrote:

Probability of messing it up in some other way during a period of accelerated technological change (e.g. driving ourselves crazy, creating a permanent dystopia, making unwise commitments…): 15%

Do you think these risks can also be reduced by 10x by a "very good RSP"? If yes, how or by what kinds of policies? If not, isn't "cut risk dramatically [...] perhaps a 10x reduction" kind of misleading?

It concerns me that none of the RSP documents or discussions I've seen talked about these particular risks, or "unknown unknowns" (other risks that we haven't thought of yet).

I'm also bummed that "AI pause" people don't talk about these risks either, but at least an AI pause would implicitly address these risks by default, whereas RSPs would not.

Comment by Wei Dai (Wei_Dai) on Fertility Roundup #2 · 2023-10-17T21:17:35.737Z · LW · GW

Tangentially, I just saw a Chinese comedy-drama film (Xueba 学爸) about Chinese parents trying to get their kids (kindergarteners) into "good" schools, and all the sacrifices/difficulties/frustrations they go through. While watching it, I was thinking "Who is going to want to become a parent after seeing this? Whoever approved this film (as every film in China has to be pre-approved by government censors), don't they know that the Chinese government is trying to raise birth rates?" I guess from their perspective, the lesson of the movie is that it's not worth it to push your kids that hard to try to get into good schools (which is another thing the Chinese government is trying to persuade parents of).

I agree with Robin Hanson that signaling must be a big part of the problem, and think it's striking how bad we are at understanding and manipulating signaling dynamics. Even Robin doesn't seem to explain why there's now so much more signaling via parenting, to the extent of causing a lot of people to not want to become parents at all. And I don't see any proposals for raising fertility that are explicitly trying to defuse the signaling spiral. (Are there any?)

Comment by Wei Dai (Wei_Dai) on RSPs are pauses done right · 2023-10-14T19:26:53.226Z · LW · GW

To be clear, I definitely agree with this. My position is not “RSPs are all we need”, “pauses are bad”, “pause advocacy is bad”, etc.—my position is that getting good RSPs is an effective way to implement a pause: i.e. “RSPs are pauses done right.”

Some feedback on this: my expectation upon seeing your title was that you would argue, or that you implicitly believe, that RSPs are better than other current "pause" attempts/policies/ideas. I think this expectation came from the common usage of the phrase "done right" to mean that other people are doing it wrong or at least doing it suboptimally.

Comment by Wei Dai (Wei_Dai) on Anti-EMH Evidence (and a plea for help) · 2023-09-27T21:58:26.518Z · LW · GW

Another piece of evidence against EMH: Coal commodity spot and futures prices have been moving up for several months, with coal stock prices naturally following, but today one analyst raised his price targets on several met coal stocks based on higher expected coal prices, and almost every US coal stock rose another 3-10%. (See BTU, CEIX, ARCH, AMR, HCC.) But there was no new private information released or any change in fundamentals compared to yesterday (futures markets are essentially flat). It's just a pure change in valuation.

Even more damningly, CEIX is up 3.5% (was up 5% intraday) even though it was not upgraded by this analyst, due to the fact that it mines thermal coal and the upgrades were based on higher met coal prices.

Comment by Wei Dai (Wei_Dai) on UDT shows that decision theory is more puzzling than ever · 2023-09-16T16:41:34.362Z · LW · GW

Thanks, I've set a reminder to attend your talk. In case I miss it, can you please record it and post a link here?

Comment by Wei Dai (Wei_Dai) on UDT shows that decision theory is more puzzling than ever · 2023-09-15T20:42:27.273Z · LW · GW

But, the gist of your post seems to be: "Since coming up with UDT, we ran into these problems, made no progress, and are apparently at a dead end. Therefore, UDT might have been the wrong turn entirely."

This is a bit stronger than how I would phrase it, but basically yes.

On the other hand, my view is: Since coming up with those problems, we made a lot of progress on agent theory within the LTA

I tend to be pretty skeptical of new ideas. (This backfired spectacularly once, when I didn't pay much attention to Satoshi when he contacted me about Bitcoin, but I think in general has served me well.) My experience with philosophical questions is that even when some approach looks a stone's throw away from a final solution to some problem, a bunch of new problems pop up and show that we're still quite far away. With an approach that is still as early as yours, I just think there's quite a good chance it doesn't work out in the end, or gets stuck somewhere on a hard problem. (Also some people who have digged into the details don't seem as optimistic that it is the right approach.) So I'm reluctant to decrease my probability of "UDT was a wrong turn" too much based on it.

The rest of your discussion about 2TDT-1CDT seems plausible to me, although of course depends on whether the math works out, doing something about monotonicity, and also a solution to the problem of how to choose one's IBH prior. (If the solution was something like "it's subjective/arbitrary" that would be pretty unsatisfying from my perspective.)

Comment by Wei Dai (Wei_Dai) on UDT shows that decision theory is more puzzling than ever · 2023-09-15T20:10:43.431Z · LW · GW

See this comment and the post that it's replying to.

Comment by Wei Dai (Wei_Dai) on Meta Questions about Metaphilosophy · 2023-09-14T23:06:25.079Z · LW · GW

Do you think part of it might be that even people with graduate philosophy educations are too prone to being wedded to their own ideas, or don't like to poke holes at them as much as they should? Because part of what contributes to my wanting to go more meta is being dissatisfied with my own object-level solutions and finding more and more open problems that I don't know how to solve. I haven't read much academic philosophy literature, but did read some anthropic reasoning and decision theory literature earlier, and the impression I got is that most of the authors weren't trying that hard to poke holes in their own ideas.

Comment by Wei Dai (Wei_Dai) on UDT shows that decision theory is more puzzling than ever · 2023-09-14T22:23:13.337Z · LW · GW

I don't understand your ideas in detail (am interested but don't have the time/ability/inclination to dig into the mathematical details), but from the informal writeups/reviews/critiques I've seen of your overall approach, as well as my sense from reading this comment of how far away you are from a full solution to the problems I listed in the OP, I'm still comfortable sticking with "most are wide open". :)

On the object level, maybe we can just focus on Problem 4 for now. What do you think actually happens in a 2IBH-1CDT game? Presumably CDT still plays D, and what do the IBH agents do? And how does that imply that the puzzle is resolved?

As a reminder, the puzzle I see is that this problem shows that a CDT agent doesn't necessarily want to become more UDT-like, and for seemingly good reason, so on what basis can we say that UDT is a clear advancement in decision theory? If CDT agents similarly don't want to become more IBH-like, isn't there the same puzzle? (Or do they?) This seems different from the playing chicken with a rock example, because a rock is not a decision theory so that example doesn't seem to offer the same puzzle.

ETA: Oh, I think you're saying that the CDT agent could turn into a IBH agent but with a different prior from the other IBH agents, that ends up allowing it to still play D while the other two still play C, so it's not made worse off by switching to IBH. Can you walk this through in more detail? How does the CDT agent choose what prior to use when switching to IBH, and how do the different priors actual imply a CCD outcome in the end?

Comment by Wei Dai (Wei_Dai) on UDT shows that decision theory is more puzzling than ever · 2023-09-14T21:46:12.774Z · LW · GW

I think I kind of get what you're saying, but it doesn't seem right to model TDT as caring about all other TDT agents, as they would exploit other TDT agents if they could do so without negative consequences to themselves, e.g., if a TDT AI was in a one-shot game where they unilaterally decide whether to attack and take over another TDT AI or not.

Maybe you could argue that the TDT agent would refrain from doing this because of considerations like its decision to attack being correlated with other AIs' decisions to potentially attack it in other situations/universes, but that's still not the same as caring for other TDT agents. I mean the chain of reasoning/computation you would go through in the two cases seem very different.

Also it's not clear to me what implications your idea has even if it was correct, like what does it suggest about what the right decision theory is?

BTW do you have any thoughts on Vanessa Kosoy's decision theory ideas?

Comment by Wei Dai (Wei_Dai) on UDT shows that decision theory is more puzzling than ever · 2023-09-14T13:22:49.384Z · LW · GW

I'm not aware of good reasons to think that it's wrong, it's more that I'm just not sure it's the right approach. I mean we can say that it's a matter of preferences, problem solved, but unless we can also show that we should be anti-realist about these preferences, or what the right preferences are, the problem isn't really solved. Until we do have a definitive full solution, it seems hard to be confident that any particular approach is the right one.

It seems plausible that treating anthropic reasoning as a matter of preferences makes it harder to fully solve the problem. I wrote "In general, Updateless Decision Theory converts anthropic reasoning problems into ethical problems." in the linked post, but we don't have a great track record of solving ethical problems...

Comment by Wei Dai (Wei_Dai) on UDT shows that decision theory is more puzzling than ever · 2023-09-14T12:08:52.159Z · LW · GW

Even items 1, 3, 4, and 6 are covered by your research agenda? If so, can you quickly sketch what you expect the solutions to look like?

Comment by Wei Dai (Wei_Dai) on UDT shows that decision theory is more puzzling than ever · 2023-09-14T04:08:09.837Z · LW · GW

The general hope is that slight differences in source code (or even large differences, as long as they're all using UDT or something close to it) wouldn't be enough to make a UDT agents defect against another UDT agent (i.e. the logical correlation between their decisions would be high enough), otherwise "UDT agents cooperate with each other in one-shot PD" would be false or not have much practical implications, since why would all UDT agents have the exact same source code?

Comment by Wei Dai (Wei_Dai) on UDT shows that decision theory is more puzzling than ever · 2023-09-14T03:11:57.625Z · LW · GW

I'm not sure why you say "if the UDT agents could change their own code (silently) cooperation would immediately break down", because in my view a UDT agent would reason that if it changed its code (to something like CDT for example), that logically implies other UDT agents also changing their code to do the same thing, so the expected utility of changing its code would be evaluated as lower than not changing its code. So it would remain a UDT agent and still cooperate with other UDT agents or when the probability of the other agent being UDT is high enough.

To me this example is about a CDT agent not wanting to become UDT-like if it found itself in a situation with many other UDT agents, which just seems puzzling if your previous perspective was that UDT is a clear advancement in decision theory and everyone should adopt UDT or become more UDT-like.

Comment by Wei Dai (Wei_Dai) on UDT shows that decision theory is more puzzling than ever · 2023-09-14T01:58:48.918Z · LW · GW

Nobody is being tricked though. Everyone knows there's a CDT agent among the population, just not who, and we can assume they have a correct amount of uncertainty about what the other agent's decision theory / source code is. The CDT agent still has an advantage in that case. And it is a problem because it means CDT agents don't always want to become more UDT-like (it seems like there are natural or at least not completely contrived situations, like Omega punishing UDT agents just for using UDT, where they don't), which takes away a major argument in its favor.

Comment by Wei Dai (Wei_Dai) on UDT shows that decision theory is more puzzling than ever · 2023-09-14T01:51:55.338Z · LW · GW

But the situation isn't symmetrical, meaning if you reversed the setup to have 2 CDT agents and 1 TDT agent, the TDT agent doesn't do better than the CDT agents, so it does seem like the puzzle has something to do with decision theory, and is not just about smaller vs larger groups? (Sorry, I may be missing your point.)

Comment by Wei Dai (Wei_Dai) on UDT shows that decision theory is more puzzling than ever · 2023-09-14T01:15:41.741Z · LW · GW

I feel like MIRI perhaps mispositioned FDT (their variant of UDT) as a clear advancement in decision theory

On second thought this is probably not fair to MIRI since I don't think I objected to such positioning when they sent paper drafts for me to review. I guess in the early days UDT did look more like a clear advancement, because it seems to elegantly solve several problems at once, namely anthropic reasoning (my original reason to start thinking in the "updateless" direction), counterfactual mugging, cooperation with psychological twin / same decision theory, Newcomb's problem, and it wasn't yet known that the open problems would remain open for so long.

Comment by Wei Dai (Wei_Dai) on Contra Heighn Contra Me Contra Functional Decision Theory · 2023-09-13T01:51:22.959Z · LW · GW

(Will be using "UDT" below but I think the same issue applies to all subsequent variants such as FDT that kept the "updateless" feature.)

I think this is a fair point. It's not the only difference between CDT and UDT but does seem to account for why many people find UDT counterintuitive. I made a similar point in this comment. I do disagree with "As such the debate over which is more “rational” mostly comes down to a semantic dispute." though. There are definitely some substantial issues here.

(A nit first: it's not that UDT must value all copies of oneself equally but it is incompatible with indexical values. You can have a UDT utility function that values different copies differently, it just has to be fixed for all time instead of changing based on what you observe.)

I think humans do seem to have indexical values, but what to do about it is a big open problem in decision theory. "Just use CDT" is unsatisfactory because as soon as someone could self-modify, they would have incentive to modify themselves to no longer use CDT (and no longer have indexical values). I'm not sure what further implications that has though. (See above linked post where I talked about this puzzle in a bit more detail.)

Comment by Wei Dai (Wei_Dai) on Meta Questions about Metaphilosophy · 2023-09-03T18:05:57.968Z · LW · GW

Thanks for this clear explanation of conceptual analysis. I've been wanting to ask some questions about this line of thought:

  1. Where do semantic intuitions come from?
  2. What should we do when different people have different such intuitions? For example you must know that Newcomb's problem is famously divisive, with roughly half of philosophers preferring one-boxing and half preferring two-boxing. Similarly for trolley thought experiments, intuitions about the nature of morality (metaethics), etc.
  3. How do we make sure that AI has the right intuitions? Maybe in some cases we can just have it learn from humans, but what about:
    1. Cases where humans disagree.
    2. Cases where all/most humans are wrong. (In other words, can we build AIs that have better intuitions than humans?) Or is that not a thing in conceptual analysis, i.e., semantic intuitions can't be wrong?
    3. Completely novel philosophical questions or situations where AI can't learn from humans (because humans don't have intuitions about it either, or AI has to make time sensitive decisions and humans are too slow).
Comment by Wei Dai (Wei_Dai) on Meta Questions about Metaphilosophy · 2023-09-02T19:21:27.671Z · LW · GW

I’m not sure I understand why it would be bad if it actually is a solution. If we do, great, p(doom) drops because now we are much closer to making aligned systems that can help us grow the economy, do science, stabilize society etc. Though of course this moves us into a “misuse risk” paradigm, which is also extremely dangerous.

I prefer to frame it as human-AI safety problems instead of "misuse risk", but the point is that if we're trying to buy time in part to have more time to solve misuse/human-safety (e.g. by improving coordination/epistemology or solving metaphilosophy), but the strategy for buying time only achieves a pause until alignment is solved, then the earlier alignment is solved, the less time we have to work on misuse/human-safety.

Comment by Wei Dai (Wei_Dai) on Meta Questions about Metaphilosophy · 2023-09-02T18:54:37.545Z · LW · GW

which is bottlenecked by us running out of time, hence why I think the pragmatic strategic choice is to try to buy us more time.

What are you proposing or planning to do to achieve this? I observe that most current attempts to "buy time" seem organized around convincing people that AI deception/takeover is a big risk and that we should pause or slow down AI development or deployment until that problem is solved, for example via intent alignment. But what happens if AI deception then gets solved relatively quickly (or someone comes up with a proposed solution that looks good enough to decision makers)? And this is another way that working on alignment could be harmful from my perspective...

Comment by Wei Dai (Wei_Dai) on Meta Questions about Metaphilosophy · 2023-09-02T14:29:04.549Z · LW · GW

@jessicata @Connor Leahy @Domenic @Daniel Kokotajlo @romeostevensit @Vanessa Kosoy @cousin_it @ShardPhoenix @Mitchell_Porter @Lukas_Gloor (and others, apparently I can only notify 10 people by mentioning them in a comment)

Sorry if I'm late in responding to your comments. This post has gotten more attention and replies than I expected, in many different directions, and it will probably take a while for me to process and reply to them all. (In the meantime, I'd love to see more people discuss each other's ideas here.)

Comment by Wei Dai (Wei_Dai) on Meta Questions about Metaphilosophy · 2023-09-02T06:50:53.113Z · LW · GW

Do you have any examples that could illustrate your theory?

It doesn't seem to fit my own experience. I became interested in Bayesian probability, universal prior, Tegmark multiverse, and anthropic reasoning during college, and started thinking about decision theory and ideas that ultimately led to UDT, but what heuristics could I have been applying, learned from what "domains with feedback"?

Maybe I used a heuristic like "computer science is cool, lets try to apply it to philosophical problems" but if the heuristics are this coarse grained, it doesn't seem like the idea can explain how detailed philosophical reasoning happens, or be used to ensure AI philosophical competence?

Comment by Wei Dai (Wei_Dai) on Meta Questions about Metaphilosophy · 2023-09-02T06:26:48.976Z · LW · GW

I expect at this moment in time me building a company is going to help me deconfuse a lot of things about philosophy more than me thinking about it really hard in isolation would

Hard for me to make sense of this. What philosophical questions do you think you'll get clarity on by doing this? What are some examples of people successfully doing this in the past?

It seems plausible that there is no such thing as “correct” metaphilosophy, and humans are just making up random stuff based on our priors and environment and that’s it and there is no “right way” to do philosophy, similar to how there are no “right preferences”.

Definitely a possibility (I've entertained it myself and maybe wrote some past comments along these lines). I wish there was more people studying this possibility.

I have short timelines and think we will be dead if we don’t make very rapid progress on extremely urgent practical problems like government regulation and AI safety. Metaphilosophy falls into the unfortunate bucket of “important, but not (as) urgent” in my view.

Everyone dying isn't the worst thing that could happen. I think from a selfish perspective, I'm personally a bit more scared of surviving into a dystopia powered by ASI that is aligned in some narrow technical sense. Less sure from an altruistic/impartial perspective, but it seems at least plausible that building an aligned AI without making sure that the future human-AI civilization is "safe" is a not good thing to do.

I would say that better philosophy/arguments around questions like this is a bottleneck. One reason for my interest in metaphilosophy that I didn't mention in the OP is that studying it seems least likely to cause harm or make things worse, compared to any other AI related topics I can work on. (I started thinking this as early as 2012.) Given how much harm people have done in the name of good, maybe we should all take "first do no harm" much more seriously?

There are no good institutions, norms, groups, funding etc to do this kind of work.

Which also represents an opportunity...

It’s weird. I happen to have a very deep interest in the topic, but it costs you weirdness points to push an idea like this when you could instead be advocating more efficiently for more pragmatic work.

Is it actually that weird? Do you have any stories of trying to talk about it with someone and having that backfire on you?

Comment by Wei Dai (Wei_Dai) on Meta Questions about Metaphilosophy · 2023-09-02T02:52:28.020Z · LW · GW

Philosophy is a social/intellectual process taking place in the world. If you understand the world, you understand how philosophy proceeds.

What if I'm mainly interested in how philosophical reasoning ideally ought to work? (Similar to how decision theory studies how decision making normatively should work, not how it actually works in people.) Of course if we have little idea how real-world philosophical reasoning works, understanding that first would probably help a lot, but that's not the ultimate goal, at least not for me, for both intellectual and AI reasons.

The latter because humans do a lot of bad philosophy and often can’t recognize good philosophy. (See popularity of two-boxing among professional philosophers.) I want a theory of ideal/normative philosophical reasoning so we can build AI that improves upon human philosophy, and in a way that convinces many people (because they believe the theory is right) to trust the AI's philosophical reasoning.

This leads to a view where philosophy is one of many types of discourse/understanding that each shape each other (a non-foundationalist view). This is perhaps disappointing if you wanted ultimate foundations in some simple framework.

Sure ultimate foundations in some simple framework would be nice but I'll take whatever I can get. How would you flesh out the non-foundationalist view?

Most thought is currently not foundationalist, but perhaps a foundational re-orientation could be found by understanding the current state of non-foundational thought.

I don't understand this sentence at all. Please explain more?

Comment by Wei Dai (Wei_Dai) on Forum participation as a research strategy · 2023-09-01T15:15:52.674Z · LW · GW

On a forum you can judge other people's opinions of your contributions by the karma (or the equivalent) of your posts, and by their comments. Of course there's a risk that people on some forum liking your posts might represent groupthink instead of genuine intellectual progress, but the same risk exists with academic peer review, and one simply has to keep this risk/uncertainty in mind.

Comment by Wei Dai (Wei_Dai) on Eliezer Yudkowsky Is Frequently, Confidently, Egregiously Wrong · 2023-09-01T10:01:00.356Z · LW · GW

FDT is a valuable idea in that it’s a stepping stone towards / approximation of UDT.

You might be thinking of TDT, which was invented prior to UDT. FDT actually came out after UDT. My understanding is that the OP disagrees with the entire TDT/UDT/FDT line of thinking, since they all one-box in Newcomb's problem and the OP thinks one should two-box.

Comment by Wei Dai (Wei_Dai) on Meta Questions about Metaphilosophy · 2023-09-01T09:42:24.902Z · LW · GW

If we keep stumbling into LLM type things which are competent at a surprisingly wide range of tasks, do you expect that they’ll be worse at philosophy than at other tasks?

I'm not sure but I do think it's very risky to depend on LLMs to be good at philosophy by default. Some of my thoughts on this:

  • Humans do a lot of bad philosophy and often can't recognize good philosophy. (See popularity of two-boxing among professional philosophers.) Even if a LLM has learned how to do good philosophy, how will users or AI developers know how to prompt it to elicit that capability (e.g., which philosophers to emulate)? (It's possible that even solving metaphilosophy doesn't help enough with this, if many people can't recognize the solution as correct, but there's at least a chance that the solution does look obviously correct to many people, especially if there's not already wrong solutions to compete with).
  • What if it learns how to do good philosophy during pre-training, but RLHF trains that away in favor of optimizing arguments to look good to the user.
  • What if philosophy is just intrinsically hard for ML in general (I gave an argument for why ML might have trouble learning philosophy from humans in the section Replicate the trajectory with ML? of Some Thoughts on Metaphilosophy, but I'm not sure how strong it is) or maybe it's just some specific LLM architecture that has trouble with this, and we never figure this out because the AI is good at finding arguments that look good to humans?
  • Or maybe we do figure out that AI is worse at philosophy than other tasks, after it has been built, but it's too late to do anything with that knowledge (because who is going to tell the investors that they've lost their money because we don't want to differentially decelerate philosophical progress by deploying the AI).
Comment by Wei Dai (Wei_Dai) on A list of core AI safety problems and how I hope to solve them · 2023-08-30T22:28:53.454Z · LW · GW

If it’s not misuse, the provisions in 5.1.4-5 will steer the search process away from policies that attempt to propagandize to humans.

Ok I'll quote 5.1.4-5 to make it easier for others to follow this discussion:

5.1.4. It may be that the easiest plan to find involves an unacceptable degree of power-seeking and control over irrelevant variables. Therefore, the score function should penalize divergence of the trajectory of the world state from the trajectory of the status quo (in which no powerful AI systems take any actions).

5.1.5. The incentives under 5.1.4 by default are to take control over irrelevant variables so as to ensure that they proceed as in the anticipated “status quo”. Infrabayesian uncertainty about the dynamics is the final component that removes this incentive. In particular, the infrabayesian prior can (and should) have a high degree of Knightian uncertainty about human decisions and behaviour. This makes the most effective way to limit the maximum divergence (of human trajectories from the status quo) actually not interfering.

I'm not sure how these are intended to work. How do you intend to define/implement "divergence"? How does that definition/implementation combined with "high degree of Knightian uncertainty about human decisions and behaviour" actually cause the AI to "not interfere" but also still accomplish the goals that we give it?

In order to accomplish its goals, the AI has to do lots of things that will have butterfly effects on the future, so the system has to allow it to do those things, but also not allow it to "propagandize to humans". It's just unclear to me how you intend to achieve this.

Comment by Wei Dai (Wei_Dai) on Anyone want to debate publicly about FDT? · 2023-08-30T20:38:03.309Z · LW · GW

Why the focus on repeated games? It seems like one of the central motivations for people to be interested in logical decision theories (TDT/UDT/FDT) is that they recommend one-boxing and playing C in PD (against someone with similar decision theory), even in non-repeated games, and you're not addressing that?

Comment by Wei Dai (Wei_Dai) on Anyone want to debate publicly about FDT? · 2023-08-29T12:57:55.167Z · LW · GW

You should take a look this list of UDT open problems that Vladimir Slepnev wrote 13 years ago, where 2 and 3 are problems in which UDT/FDT seemingly make incorrect decisions, and 1 and 5 are definitely also serious open problems.

Comment by Wei Dai (Wei_Dai) on Another attempt to explain UDT · 2023-08-29T12:42:33.332Z · LW · GW

Answering your questions 13 years later because I want to cite cousin_it's list of open problems, and others may see your comment and wonder what the answers are. I'm not sure about 4 on his list but I think 1, 2, 3, and 5 are definitely still open problems.

How is this [1. 2TDT-1CDT] not resolved?

Consider the evolutionary version of this. Suppose there's a group of TDT (or UDT or FDT) agents and one of them got a random mutation that changed it into a CDT agent, and this was known to everyone (but not the identity of the CDT agent). If two randomly selected agents paired off to play true PD against each other, a TDT agent would play C (since they're still likely facing another TDT agent) and the CDT agent would play D. So the CDT agent would be better off, and it would want to be really careful not to become a TDT agent or delegate to a TDT AI or become accidentally correlated with TDT agents. This doesn't necessarily mean that TDT/UDT/FDT is wrong, but seems like a weird outcome, plus how do we know that we're not in a situation like the one that the CDT agent is in (i.e., should be very careful not to become/delegate to a TDT-like agent)?

Eliezer also ended up thinking this might be a real issue.

This [2. Agent simulates predictor] basically says that the predictor is a rock, doesn't depend on agent's decision, which makes the agent lose because of the way problem statement argues into stipulating (outside of predictor's own decision process) that this must be a two-boxing rock rather than a one-boxing rock.

No, we're not saying the predictor is a rock. We're assuming that the predictor is using some kind reasoning process to make the prediction. Specifically the predictor could reasoning as follows: The agent is using UDT1.1 (for example). UDT1.1 is not updateless with regard to logical facts. Given enough computing power (which the agent has) it will inevitably simulate me and then update on my prediction, after which it will view two-boxing as having higher expected utility (no matter what my prediction actually is). Therefore I should predict that it will two-box.

[3. The stupid winner paradox] Same as (2). We stipulate the weak player to be a $9 rock. Nothing to be surprised about.

No, again the weak player is applying reasoning to decide to demand $9, similar to the reasoning of the predictor above. To spell it out: My opponent is not logically updateless. Whatever I decide, it will simulate me and update on my decision, after which it will play the best response against that. Therefore I should demand $9.

("Be logically updateless" is the seemingly obvious implication here, but how to do that without running into other issues is also an open problem.)

Comment by Wei Dai (Wei_Dai) on A list of core AI safety problems and how I hope to solve them · 2023-08-28T02:10:36.914Z · LW · GW

"AI-powered memetic warfare makes all humans effectively insane" a catastrophe that I listed in an earlier comment, which seems one of the hardest to formally specify. It seems values-complete or metaphilosophy-complete to me, since without having specified human values or having solved metaphilosophy, how can we check whether an AI-generated argument is trying to convince us of something that is wrong according to actual human values, or wrong according to normative philosophical reasoning?

I don't see anything in this post or the linked OAA post that addresses or tries to bypass this difficulty?

Comment by Wei Dai (Wei_Dai) on Update on Ought's experiments on factored evaluation of arguments · 2023-08-17T23:38:08.376Z · LW · GW

Is this the final update from Ought about their factored cognition experiments? (I can't seem to find anything more recent.) The reason I ask is that the experiments they reported on here do not seem very conclusive, and they talked about doing further experiments but then did not seem to give any more updates. Does anyone know the story of what happened, and what that implies about the viability of factored-cognition style alignment schemes?

Comment by Wei Dai (Wei_Dai) on AI #25: Inflection Point · 2023-08-17T23:30:12.678Z · LW · GW

Washington Post’s Parmy Olsen complains There’s Too Much Money Going to AI Doomers, opening with the argument that in the Industrial Revolution we shouldn’t have spent a lot of money ensuring the machines did not rise up against us, because in hindsight they did not do that.

I wonder what would happen if we "amplified" reasoning like this, as in HCH, IDA, Debate, etc.

Do we understand reasoning well enough to ensure that this class of errors can avoided in AI alignment schemes that depend on human reasoning, or to ensure that this class of errors will be reliably self-corrected as the AI scales up?

Comment by Wei Dai (Wei_Dai) on If I Was An Eccentric Trillionaire · 2023-08-09T20:35:50.692Z · LW · GW

Metaphilosophy

I appreciate you sharing many of the same philosophical interests as me (and giving them a signal boost here), but for the sake of clarity / good terminology, I think all the topics you list under this section actually belong to object-level philosophy, not metaphilosophy.

I happen to think metaphilosophy is also extremely interesting/important, and you can see my latest thoughts on it at Some Thoughts on Metaphilosophy (which also links to earlier posts on the topic) if you're interested.

Comment by Wei Dai (Wei_Dai) on [Linkpost] Introducing Superalignment · 2023-08-02T10:08:19.188Z · LW · GW

goodness of HCH

What is the latest thinking/discussion about this? I tried to search LW/AF but haven't found a lot of discussions, especially positive arguments for HCH being good. Do you have any links or docs you can share?

How do you think about the general unreliability of human reasoning (for example, the majority of professional decision theorists apparently being two-boxers and favoring CDT, and general overconfidence of almost everyone on all kinds of topics, including morality and meta-ethics and other topics relevant for AI alignment) in relation to HCH? What are your guesses for how future historians would complete the following sentence? Despite human reasoning being apparently very unreliable, HCH was a good approximation target for AI because ...

instead relies on some claims about offense-defense between teams of weak agents and strong agents

I'm curious if you have an opinion on where the burden of proof lies when it comes to claims like these. I feel like in practice it's up to people like me to offer sufficiently convincing skeptical arguments if we want to stop AI labs from pursuing their plans (since we have little power to do anything else) but morally shouldn't the AI labs have much stronger theoretical foundations for their alignment approaches before e.g. trying to build a human-level alignment researcher in 4 years? (Because if the alignment approach doesn't work, we would either end up with an unaligned AGI or be very close to being able to build AGI but with no way to align it.)

Comment by Wei Dai (Wei_Dai) on A Hill of Validity in Defense of Meaning · 2023-07-17T11:03:40.071Z · LW · GW

Yudkowsky couldn’t be bothered to either live up to his own stated standards

"his own stated standards" could use a link/citation.

regardless of the initial intent, scrupulous rationalists were paying rent to something claiming moral authority, which had no concrete specific plan to do anything other than run out the clock, maintaining a facsimile of dialogue in ways well-calibrated to continue to generate revenue.

The original Kolmogorov complicity was an instance of lying to protect one's intellectual endeavors. But here you/Ben seem to be accusing Eliezer of doing something much worse, and which seems like a big leap from what came before it in the post. How did you/Ben rule out the Kolmogorov complicity hypothesis (i.e., that Eliezer still had genuine intellectual or altruistic aims that he wanted to protect)?

Of what you wrote specifically, "no concrete specific plan" is in my view actually a point in Eliezer's favor, as it's a natural consequence of high alignment difficulty and intellectual honesty. "Run out the clock" hardly seems fair, and by "maintaining a facsimile of dialogue" what are you referring to? Are you including things like the 2021 MIRI Conversations and if so are you suggesting that all the other (non-MIRI) participants are being fooled or in on the scam?

But since I did spend my entire adult life in Yudkowsky’s robot cult, trusting him the way a Catholic trusts the Pope

I would be interested to read an account of how this happened, and what might have prevented the error.

Comment by Wei Dai (Wei_Dai) on The Commitment Races problem · 2023-07-14T23:27:52.772Z · LW · GW

But yeah also I think that AGIs will be by default way better than humans at this sort of stuff.

What's your reasons for thinking this? (Sorry if you already explained this and I missed your point, but it doesn't seem like you directly addressed my point that if AGIs learn from or defer to humans, they'll be roughly human-level at this stuff?)

When you say “the top tier of rational superintelligences exploits everyone else” I say that is analogous to “the most rational/clever/capable humans form an elite class which rules over and exploits the masses.” So I’m like yeah, kinda sorta I expect that to happen, but it’s typically not that bad?

I think it could be much worse than current exploitation, because technological constraints prevent current exploiters from extracting full value from the exploited (have to keep them alive for labor, can't make them too unhappy or they'll rebel, monitoring for and repressing rebellions is costly). But with superintelligence and future/acausal threats, an exploiter can bypass all these problems by demanding that the exploited build an AGI aligned to itself and let it take over directly.

Comment by Wei Dai (Wei_Dai) on The Commitment Races problem · 2023-07-14T20:58:14.268Z · LW · GW

I think that agents worthy of being called “rational” will probably handle all this stuff more gracefully/competently than humans do

Humans are kind of terrible at this right? Many give in even to threats (bluffs) conjured up by dumb memeplexes and back up by nothing (i.e., heaven/hell), popular films are full of heros giving in to threats, apparent majority of philosophers have 2-boxing intuitions (hence the popularity of CDT, which IIUC was invented specifically because some philosophers were unhappy with EDT choosing to 1-box), governments negotiate with terrorists pretty often, etc.

The sort of society AGIs construct will be at least as cooperatively-competent / good-at-coordinating-diverse-agents-with-diverse-agendas-and-beliefs as Dath Ilan.

If we build AGI that learn from humans or defer to humans on this stuff, do we not get human-like (in)competence?[1][2] If humans are not atypical, large parts of the acausal society/economy could be similarly incompetent? I imagine there could be a top tier of "rational" superintelligences, built by civilizations that were especially clever or wise or lucky, that cooperate with each other (and exploit everyone else who can be exploited), but I disagree with this second quoted statement, which seems overly optimistic to me. (At least for now; maybe your unstated reasons to be optimistic will end up convincing me.)


  1. I can see two ways to improve upon this: 1) AI safety people seem to have better intuitions (cf popularity of 1-boxing among alignment researchers) and maybe can influence the development of AGI in a better direction, e.g., to learn from / defer to humans with intuitions more like themselves. 2) We figure out metaphilosophy, which lets AGI figure out how to improve upon humans. (ETA: However, conditioning on there not being a simple and elegant solution to decision theory also seems to make metaphilosophy being simple and elegant much less likely. So what would "figure out metaphilosophy" mean in that case?) ↩︎

  2. I can also see the situation potentially being even worse, since many future threats will be very "out of distribution" for human evolution/history/intuitions/reasoning, so maybe we end up handling them even worse than current threats. ↩︎