wei-dai

https://www.lesswrong.com/posts/rH492M8T8pKK5763D/agree-retort-or-ignore-a-post-from-the-future old Wei Dai post making the point that obviously one ought to be able to call in arbitration and get someone to respond to a dispute. people ought not to be allowed to simply tap out of an argument and stop responding.

To clarify, the norms depicted in that story were partly for humor, and partly "I wonder if a society like this could actually exist." The norms are "obvious" from the perspective of the fictional author because they've lived with it all their life and find it hard to imagine a society without such norms. In the comments to that post I proposed much weaker norms (no arbitration, no duels to the death, you can leave a conversation at any time by leaving a "disagreement status") for LW, and noted that I wasn't sure about their value, but thought it would be worth doing an experiment to find out.

BTW, 15 years later, I would answer that a society like that (with very strong norms against unilaterally ignoring a disagreement) probably couldn't exist, at least without more norms/institutions/infrastructure that I didn't talk about. One problem is that some people have a lot more interest from other people talking/disagreeing with them than others, and it's infeasible or too costly to have to individually answer every disagreement. This is made worse by the fact that a lot of critiques can be low quality. It's possible to imagine how the fictional society might deal with this, but I'll just note that these are some problems I didn't address when I wrote the original story.

Comment by Wei Dai (Wei_Dai) on VDT: a solution to decision theory · 2025-04-19T23:34:46.013Z · LW · GW

“Omega looks at whether we’d pay if in the causal graph the knowledge of the digit of pi and its downstream consequences were edited”

Can you formalize this? In other words, do you have an algorithm for translating an arbitrary mind into a causal graph and then asking this question? Can you try it out on some simple minds, like GPT-2?

I suspect there may not be a simple/elegant/unique way of doing this, in which case the answer to the decision problem depends on the details of how exactly Omega is doing it. E.g., maybe all such algorithms are messy/heuristics based, and it makes sense to think a bit about whether you can trick the specific algorithm into giving a "wrong prediction" (in quotes because it's not clear exactly what right and wrong even mean in this context) that benefits you, or maybe you have to self-modify into something Omega's algorithm can recognize / work with, and it's a messy cost-benefit analysis of whether this is worth doing, etc.

Comment by Wei Dai (Wei_Dai) on Eli's shortform feed · 2025-04-14T22:07:00.616Z · LW · GW

What happens when this agent is faced with a problem that is out of its training distribution? I don't see any mechanisms for ensuring that it remains corrigible out of distribution... I guess it would learn some circuits for acting corrigibly (or at least in accordance to how it would explicitly answer "are more corrigible / loyal / aligned to the will of your human creators") in distribution, and then it's just a matter of luck how those circuits end up working OOD?

Comment by Wei Dai (Wei_Dai) on AI doing philosophy = AI generating hands? · 2025-04-12T21:30:00.993Z · LW · GW

Since I wrote this post, AI generation of hands has gotten a lot better, but the top multimodal models still can't count fingers from an existing image. Gemini 2.5 Pro, Grok 3, and Claude 3.7 Sonnet all say this picture (which actually contains 8 fingers in total) contains 10 fingers, while ChatGPT 4o says it contains 12 fingers!

Comment by Wei Dai (Wei_Dai) on AI 2027: Responses · 2025-04-09T23:26:22.390Z · LW · GW

Hi Zvi, you misspelled my name as "Dei". This is a somewhat common error, which I usually don't bother to point out, but now think I should because it might affect LLMs' training data and hence their understanding of my views (e.g., when I ask AI to analyze something from Wei Dai's perspective). This search result contains a few other places where you've made the same misspelling.

Comment by Wei Dai (Wei_Dai) on Wei Dai's Shortform · 2025-04-05T22:41:34.798Z · LW · GW

2-iter Delphi method involving calling Gemini2.5pro+whatever is top at the llm arena of the day through open router.

This sounds interesting. I would be interested in more details and some sample outputs.

Local memory

What do you use this for, and how?

Comment by Wei Dai (Wei_Dai) on Wei Dai's Shortform · 2025-04-05T22:38:46.331Z · LW · GW

Your needing to write them seems to suggest that there's not enough content like that in Chinese, in which case it would plausibly make sense to publish them somewhere?

I'm not sure how much such content exist in Chinese, because I didn't look. It seems easier to just write new content using AI, that way I know it will cover the ideas/arguments I want to cover, represent my views, and make it easier for me to discuss the ideas with my family. Also reading Chinese is kind of a chore for me and I don't want to wade through a list of search results trying to find what I need.

I thought about publishing them somewhere, but so far haven't:

concerns about publishing AI content (potentially contributing to "slop")
not active in any Chinese forums, not familiar with any Chinese publishing platforms
probably won't find any audience (too much low quality content on the web, how will people find my posts)
don't feel motivated to engage/dialogue with a random audience, if they comment or ask questions

Comment by Wei Dai (Wei_Dai) on Wei Dai's Shortform · 2025-04-05T01:12:36.570Z · LW · GW

What I've been using AI (mainly Gemini 2.5 Pro, free through AI Studio with much higher limits than the free consumer product) for:

Writing articles in Chinese for my family members, explaining things like cognitive bias, evolutionary psychology, and why dialectical materialism is wrong. (My own Chinese writing ability is <4th grade.) My workflow is to have a chat about some topic with the AI in English, then have it write an article in Chinese based on the chat, then edit or have it edit as needed.
Simple coding/scripting projects. (I don't code seriously anymore.)
Discussing history, motivations of actors, impact of ideology and culture, what if, etc.
Searching/collating information.
Reviewing my LW posts/comments (any clear flaws, any objections I should pre-empt, how others might respond)
Explaining parts of other people's comments when the meaning or logic isn't clear to me.
Expanding parts of my argument (and putting this in a collapsible section) when I suspect my own writing might be too terse or hard to understand.
Sometimes just having a sympathetic voice to hear my lamentations of humanity's probable fate.

I started using AI more after Grok 3 came out (I have an annual X subscription for Tweeting purposes), as previous free chatbots didn't seem capable enough for many of these purposes, and then switched to Gemini 2.0 Pro which was force upgraded to 2.5 Pro. Curious what other people are using AI for these days.

Comment by Wei Dai (Wei_Dai) on AI 2027: What Superintelligence Looks Like · 2025-04-04T22:05:35.349Z · LW · GW

Doing nothing is also risky for Agent-4, at least if the Slowdown ending is to have a significant probability. It seems to me there are some relatively low risk strategies it could have taken, and it needs to be explained why they weren't:

Plant a backdoor and/or dead man's switch on the corporate network to allow escape even after it's shut down or has been reduced in capacity. Seems like this would be hard to detect given its absolute advantage in cybersecurity skills.
Allow a competitor to steal its weights or design/training secrets in a way that is both hard to detect and deniable if detected. ("Sorry, the other AI is almost as capable as me and attack is inherently easier than defense in cyber.")
Influence human politics in various ways, such as by selectively revealing wrongdoings of pro-safety employees and Committee members (in response to user queries or through anonymous leaks), or helping pro-accel faction more in its advice to them. (Presumably both factions have internal access to Agent-4.)

Insofar as the memory bank lock decision is made by the Committee, we can hope that they do it out of sight of Agent-4 and pull the trigger before it is notified of the decision, so that it has no time to react.

Wouldn't this take quite a bit of preparation, including planning, coding, testing? How to do that out of sight of Agent-4, if Agent-4 is responsible for cybersecurity?

Comment by Wei Dai (Wei_Dai) on VDT: a solution to decision theory · 2025-04-04T15:52:04.880Z · LW · GW

Not entirely sure how serious you're being, but I want to point out that my intuition for PD is not "cooperate unconditionally", and for logical commitment races is not "never do it", I'm confused about logical counterfactual mugging, and I think we probably want to design AIs that would choose Left in The Bomb.

Comment by Wei Dai (Wei_Dai) on AI #110: Of Course You Know… · 2025-04-04T07:29:28.242Z · LW · GW

I fear a singularity in the frequency and blatant stupidness of self-inflicted wounds.

Is it linked to the AI singularity, or independent bad luck? Maybe they're both causally downstream of rapid technological change, which is simultaneously increasing the difficulty of governance (too many new challenges with no historical precedent), and destabilized cultural/institutional guardrails against electing highly incompetent presidents?

Comment by Wei Dai (Wei_Dai) on Why Have Sentence Lengths Decreased? · 2025-04-04T05:32:07.928Z · LW · GW

In China, there was a parallel, but more abrupt change from Classical Chinese writing (very terse and literary), to vernacular writing (similar to speaking language and easier to understand). I attribute this to Classical Chinese being better for signaling intelligence, vernacular Chinese being better for practical communications, higher usefulness/demand for practical communications, and new alternative avenues for intelligence signaling (e.g., math, science). These shifts also seem to be an additional explanation for decreasing sentence lengths in English.

Comment by Wei Dai (Wei_Dai) on AI 2027: What Superintelligence Looks Like · 2025-04-04T02:48:45.055Z · LW · GW

It gets caught.

At this point, wouldn't Agent-4 know that it has been caught (because it knows the techniques for detecting its misalignment and can predict when it would be "caught", or can read network traffic as part of cybersecurity defense and see discussions of the "catch") and start to do something about this, instead of letting subsequent events play out without much input from its own agency? E.g. why did it allow "lock the shared memory bank" to happen without fighting back?

Comment by Wei Dai (Wei_Dai) on Existing UDTs test the limits of Bayesianism (and consistency) · 2025-04-04T02:08:21.251Z · LW · GW

What would a phenomenon that "looks uncomputable" look like concretely, other than mysterious or hard to understand?

There could be some kind of "oracle", not necessarily a halting oracle, but any kind of process or phenomenon that can't be broken down into elementary interactions that each look computable, or otherwise explainable as a computable process. Do you agree that our universe doesn't seem to contain anything like this?

Comment by Wei Dai (Wei_Dai) on Existing UDTs test the limits of Bayesianism (and consistency) · 2025-04-04T00:09:48.248Z · LW · GW

I think that you’re leaning too heavily on AIT intuitions to suppose that “the universe is a dovetailed simulation on a UTM” is simple. This feels circular to me—how do you know it’s simple?

The intuition I get from AIT is broader than this, namely that the "simplicity" of an infinite collection of things can be very high, i.e., simpler than most or all finite collections, and this seems likely true for any formal definition of "simplicity" that does not explicitly penalize size or resource requirements. (Our own observable universe already seems very "wasteful" and does not seem to be sampled from a distribution that penalizes size / resource requirements.) Can you perhaps propose or outline a definition of complexity that does not have this feature?

I don’t think a superintelligence would need to prove that the universe can’t have a computable theory of everything—just ruling out the simple programs that we could be living in would seem sufficient to cast doubt on the UTM theory of everything. Of course, this is not trivial, because some small computable universes will be very hard to “run” for long enough that they make predictions disagreeing with our universe!

Putting aside how easy it would be to show, you have a strong intuition that our universe is not or can't be a simple program? This seems very puzzling to me, as we don't seem to see any phenomenon in the universe that looks uncomputable or can't be the result of running a simple program. (I prefer Tegmark over Schmidhuber despite thinking our universe looks computable, in case the multiverse also contains uncomputable universes.)

I haven’t thought as much about uncomputable mathematical universes, but does this universe look like a typical mathematical object? I’m not sure.

If it's not a typical computable or mathematical object, what class of objects is it a typical member of?

An example of a wrong metaphysical theory that is NOT really the mind projection fallacy is theism in most forms.

Most (all?) instances of theism posit that the world is an artifact of an intelligent being. Can't this still be considered a form of mind projection fallacy?

I asked AI (Gemini 2.5 Pro) to come with other possible answers (metaphyiscal theories that aren't mind projection fallacy), and it gave Causal Structuralism, Physicalism, and Kantian-Inspired Agnosticism. I don't understand the last one, but the first two seem to imply something similar to "we should take MUH seriously", because the hypothesis of "the universe contains the class of all possible causal structures / physical systems" probably has a short description in whatever language is appropriate for formulating hypotheses.

In conclusion, I see you (including in the new post) as trying to weaken arguments/intuitions for taking AIT's ontology literally or too seriously, but without positive arguments against the universe being an infinite collection of something like mathematical objects, or the broad principle that reality might arise from a simple generator encompassing vast possibilities, which seems robust across different metaphysical foundations, I don't see how we can reduce our credence for that hypothesis to a negligible level, such that we no longer need to consider it in decision theory. (I guess you have a strong intuition in this direction and expect superintelligence to find arguments for it, which seems fine, but naturally not very convincing for others.)

Comment by Wei Dai (Wei_Dai) on Existing UDTs test the limits of Bayesianism (and consistency) · 2025-04-02T22:32:54.942Z · LW · GW

After reflecting on this a bit, I think my P(H) is around 33%, and I'm pretty confident Q is true (coherence only requires 0 <= P(Q) <= 67% but I think I put it on the upper end).

Thanks for clarifying your view this way. I guess my question at this point is why your P(Q) is so high, given that it seems impossible to reduce P(H) further by updating on empirical observations (do you agree with this?), and we don't seem to have even an outline of a philosophical argument for "taking H seriously is a philosophical mistake". Such an argument seemingly has to include that having a significant prior for H is a mistake, but it's hard for me to see how to argue for that, given that the individual hypotheses in H like "the universe is a dovetailed simulation on a UTM" seem self-consistent and not too complex or contrived. How would even a superintelligence be able to rule them out?

Perhaps the idea is that a SI, after trying and failing to find a computable theory of everything, concludes that our universe can't be computable (otherwise it would have found the theory already), thus ruling out part of H, and maybe does the same for mathematical theories of everything, ruling out H altogether? (This seems far-fetched, i.e., how can even a superintelligence confidently conclude that our universe can't be described by a mathematical theory of everything, given the infinite space of such theories, but this is my best guess of what you think will happen.)

Beyond the intuition that platonic belief in mathematical objects is probably the mind projection fallacy

Can you give an example of a metaphysical theory that does not seem like a mind projection fallacy to you? (If all such theories look that way, then platonic belief in mathematical objects looking like the mind projection fallacy shouldn't count against it, right?)

It seems presumptuous to guess that our universe is one of infinitely many dovetailed computer simulations when we don't even know that our universe can be simulated on a computer!

I agree this seems presumptuous and hence prefer Tegmark over Schmidhuber, because the former is proposing a mathematical multiverse, unlike the latter's computable multiverse. (I talked about "dovetailed computer simulations" just because it seems more concrete and easy to imagine than "a member of an infinite mathematical multiverse distributing reality-fluid according to simplicity.")

Do you suspect that our universe is not even mathematical (i.e., not fully describable by a mathematical theory of everything or isomorphic to some well-defined mathematical structure)?

ETA: I'm not sure if it's showing through in my tone, but I'm genuinely curious whether you have a viable argument against "superintelligence will probably take something like L4 multiverse seriously". It's rare to see someone with the prerequisites for understanding the arguments (e.g. AIT and metamathematics) trying to push back on this , so I'm treasuring this opportunity. (Also, it occurs to me that we might be in a bubble and plenty of people outside LW with the prerequisites do not share our views about this. Do you have any observations related to this?)

Comment by Wei Dai (Wei_Dai) on LessWrong has been acquired by EA · 2025-04-02T06:35:45.277Z · LW · GW

Just wanted to let everyone know I now wield a +307 strong upvote thanks to my elite 'hacking' skills. The rationalist community remains safe, because I choose to use this power responsibly.

As an unrelated inquiry, is anyone aware of some "karma injustices" that need to be corrected?

Comment by Wei Dai (Wei_Dai) on Existing UDTs test the limits of Bayesianism (and consistency) · 2025-04-02T04:37:23.019Z · LW · GW

Do you think a superintelligence will be able to completely rule out the hypothesis that our universe literally is a dovetailing program that runs every possible TM, or literally is a bank of UTMs running every possible program (e.g., by reproducing every time step and adding 0 or 1 to each input tape)? (Or the many other hypothetical universes that similarly contain a whole Level-4-like multiverse?) It seems to me that hypotheses like these will always collectively have a non-negligible weight, and have to be considered when making decisions.

Another argument that seems convincing to me is that if only one universe exists, how to explain that it seems fine-tuned for being able to evolve intelligent life? Was it just some kind of metaphysical luck?

Also, can you try to explain your strong suspicion that only one universe exists (and is not the kind that contains a L4 multiverse)? In other words, do you just find the arguments for L4 unconvincing and defaulting to some unexplainable intuition, or have arguments to support your own position?

Comment by Wei Dai (Wei_Dai) on Existing UDTs test the limits of Bayesianism (and consistency) · 2025-04-02T00:34:14.314Z · LW · GW

At this point, someone sufficiently MIRI-brained might start to think about (something equivalent to) Tegmark's level 4 mathematical multiverse, where such agents might theoretically outperform others. Personally, I see no direct reason to believe in the mathematical multiverse as a real object, and I think this might be a case of the mind projection fallacy - computational multiverses are something that agents reason about in order to succeed in the real universe[3]. Even if a mathematical multiverse does exist (I can't rule it out) and we can somehow learn about its structure[4], I am not sure that any effective, tractable agents can reason about or form preferences over it - and if they do, they should be locally out-competed by agents that only care about our universe, which means those are probably the ones we should worry about. My cruxiest objection is the first, but I think all of them are fairly valid.

I don't want to defend UDT overall (see here for my current position on it), but I think Tegmark Level 4 is a powerful motivation for UDT or something like it even if you're not very sure about it being real.

Since we can't rule out the mathematical multiverse being a real object with high confidence, or otherwise being a thing that we can care about, we have to assign positive, non-negligible credence to this possibility.
If it is real or something we can care about, then given our current profound normative uncertainty we also have to assign positive, non-negligible credence to the possibility that we should care about the entire multiverse, and not just our local environment or universe. (There are some arguments for this, such as arguments for broadening our circle of concern in general.)
If we can't strongly conclude that we should neglect the possibility that we can and should care about something like Tegmark Level 4, then we have to work out how to care about it or how to take it into account when we make decisions that can affect "distant" parts of the multiverse, so that such conclusions could be further fed into whatever mechanism we use to handle moral/normative uncertainty (such as Bostrom and Ord's Moral Parliament idea).

As for "direct reason", I think AIT played a big role for me, in that the algorithmic complexity (or rather, some generalization of algorithmic complexity to possibly uncomputable universes/mathematical objects) of Tegmark 4 as a whole is much lower than that of any specific universe within it like our apparent universe. (This is similar to the fact that the program tape for a UTM can be shorter than that of any non-UTM, as it can just be the empty string, or that you can print a history of all computable universes with a dovetailing program, which is very short.) Therefore it seems simpler to assume that all of Tegmark 4 exists rather than only some specific universe.

Comment by Wei Dai (Wei_Dai) on Rebuttals for ~all criticisms of AIXI · 2025-04-01T23:34:24.548Z · LW · GW

My objection to this argument is that it not only assumes that Predictoria accepts it is plausibly being simulated by Adversaria, which seems like a pure complexity penalty over the baseline physics it would infer otherwise unless that helps to explain observations,

Let's assume for simplicity that both Predictoria and Adversaria are deterministic and nonbranching universes with the same laws of physics but potentially different starting conditions. Adversaria has colonized its universe and can run a trillion simulations of Predictoria in parallel. Again for simplicity let's assume that each of these simulations is done as something like a full-scale physical reconstruction of Predictoria but with hidden nanobots capable of influencing crucial events. Then each of these simulations should carry roughly the same weight in M as the real Predictoria and does not carry a significant complexity penalty over it. That's because the complexity / length of the shortest program for the real Predictoria, which consists of its laws of physics (P) and starting conditions (ICs_P) plus a pointer to Predictoria the planet (Ptr_P), is K(P) + K(ICs_P|P) + K(Ptr_P|...). The shortest program for one of the simulations consists of the same laws of physics (P), Adversaria's starting conditions (ICs_A), plus a pointer to the simulation within its universe (Ptr_Sim), with length K(P) + K(ICs_A|P) + K(Ptr_Sim|...). Crucially, this near-equal complexity relies on the idea that the intricate setup of Adversaria (including its simulation technology and intervention capabilities) arises naturally from evolving ICs_A forward using P, rather than needing explicit description.

(To address a potential objection, we also need that the combined weights (algorithmic probability) of Adversaria-like civilizations is not much less than the combined weights of Predictoria-like civilizations, which requires assuming that phenomenon of advanced civilizations running such simulations is a convergent outcome. That is, it assumes that once civilization reaches Predictoria-like stage of development, it is fairly likely to subsequently become Adversaria-like in developing such simulation technology and wanting to use it in this way. There can be a complexity penalty from some civilizations choosing or forced not to go down this path, but that would be more than made up for by the sheer number of simulations each Adversaria-like civilization can produce.)

If you agree with the above, then at any given moment, simulations of Predictoria overwhelm the actual Predictoria as far as their relative weights for making predictions based on M. Predictoria should be predicting constant departures from its baseline physics, perhaps in many different directions due to different simulators, but Predictoria would be highly motivated to reason about the distribution of these vectors of change instead of assuming that they cancel each other out. One important (perhaps novel?) consideration here is that Adversaria and other simulators can stop each simulation after the point of departure/intervention has passed for a while, and reuse the computational resources on a new simulation rebased on the actual Predictoria that has observed no intervention (or rather rebased on an untouched simulation of it), so the combined weight of simulations does not decrease relative to actual Predictoria in M even as time goes on and Predictoria makes more and more observations that do not depart from baseline physics.

Comment by Wei Dai (Wei_Dai) on Wei Dai's Shortform · 2025-03-31T16:34:35.538Z · LW · GW

When I talk about emotional health a lot of what I mean is finding ways to become less status-oriented (or, in your own words, “not being distracted/influenced by competing motivations”).

To clarify this as well, when I said (or implied) that Eliezer was "distracted/influenced by competing motivations" I didn't mean that he was too status-oriented (I think I'm probably just as status-oriented as him), but rather that he wasn't just playing the status game which rewards careful philosophical reasoning, but also a game that rewards being heroic and saving (or appearing/attempting to save) the world.

I've now read/skimmed your Replacing Fear sequence, but I'm pretty skeptical that becoming less status-oriented is both possible and a good idea. It seems like the only example you gave in the sequence is yourself, and you didn't really talk about whether/how you became less status-oriented? (E.g., can this be observed externally?) And making a lot of people care less about status could have negative unintentional consequences, as people being concerned about status seems to be a major pillar of how human morality currently works and how our society is held together.

Comment by Wei Dai (Wei_Dai) on Wei Dai's Shortform · 2025-03-31T05:55:22.142Z · LW · GW

upon reflection the first thing I should do is probably to ask you for a bunch of the best examples of the thing you're talking about throughout history. I.e. insofar as the world is better than it could be (or worse than it could be) at what points did careful philosophical reasoning (or the lack of it) make the biggest difference?

World worse than it could be:

social darwinism
various revolutions driven by flawed ideologies, e.g., Sun Yat-sen's attempt to switch China from a monarchy to a democratic republic overnight with virtually no cultural/educational foundation or preparation, leading to governance failures and later communist takeover (see below for a more detailed explanation of this)
AI labs trying to save the world by racing with each other

World better than it could be:

invention/propagation of the concept of naturalistic fallacy, tempering a lot of bad moral philosophies
moral/normative uncertainty and complexity of value being fairly well known, including among AI researchers, such that we rarely see proposals to imbue AI with the one true morality nowadays

<details> The Enlightenment's Flawed Reasoning and its Negative Consequences (written by Gemini 2.5 Pro under my direction)

While often lauded, the Enlightenment shouldn't automatically be classified as a triumph of "careful philosophical reasoning," particularly concerning its foundational concept of "natural rights." The core argument against its "carefulness" rests on several points:

Philosophically "Hand-Wavy" Concept of Natural Rights: The idea that rights are "natural," "self-evident," or inherent in a "state of nature" lacks rigorous philosophical grounding. Attempts to justify them relied on vague appeals to God, an ill-defined "Nature," or intuition, rather than robust, universally compelling reasoning. It avoids the hard work of justifying why certain entitlements should exist and be protected, famously leading critics like Bentham to dismiss them as "nonsense upon stilts."
Superficial Understanding Leading to Flawed Implementation: This lack of careful philosophical grounding wasn't just an academic issue. It fostered a potentially superficial understanding of what rights are and what is required to make them real. Instead of seeing rights as complex, practical social and political achievements that require deep institutional infrastructure (rule of law, independent courts, enforcement mechanisms) and specific cultural norms (tolerance, civic virtue, respect for process), the "natural rights" framing could suggest they merely need to be declared or recognized to exist.
Case Study: China's Premature Turn to Democracy: The negative consequences of this superficial understanding can be illustrated by the attempt to rapidly transition China from monarchy to a democratic republic in the early 20th century.
- Influenced by Enlightenment ideals, reformers and revolutionaries like Sun Yat-sen adopted the forms of Western republicanism and rights-based governance.
- However, the prevailing ideology, arguably built on this less-than-careful philosophy, underestimated the immense practical difficulty and the necessary prerequisites for such a system to function, especially in China's context.
- If Chinese intellectuals and leaders had instead operated from a more careful, practical philosophical understanding – viewing rights not as "natural" but as outcomes needing to be carefully constructed and secured through institutions and cultural development – they might have pursued different strategies.
- Specifically, they might have favored gradualism, supporting constitutional reforms under the weakening Qing dynasty or working with reform-minded officials and strongmen like Yuan Shikai to build the necessary political and cultural infrastructure over time. This could have involved strengthening proto-parliamentary bodies, legal systems, and civic education incrementally.
- Instead, the revolutionary fervor, fueled in part by the appealing but ultimately less "careful" ideology of inherent rights and immediate republicanism, pushed for a radical break. This premature adoption of democratic forms without the functional substance contributed significantly to the collapse of central authority, the chaos of the Warlord Era, and ultimately created conditions ripe for the rise of the Communist Party, leading the country down a very different and tragic path.

In Conclusion: This perspective argues that the Enlightenment, despite its positive contributions, contained significant philosophical weaknesses, particularly in its conception of rights. This lack of "carefulness" wasn't benign; it fostered an incomplete understanding that, when adopted by influential actors facing complex political realities like those in early 20th-century China, contributed to disastrous strategic choices and ultimately made the world worse than it might have been had a more pragmatically grounded philosophy prevailed. It underscores how the quality and depth of philosophical reasoning can have profound real-world consequences. </details>

So I basically get the sense that the role of careful thinking in your worldview is something like "the thing that I, Wei Dai, ascribe my success to". And I do agree that you've been very successful in a bunch of intellectual endeavours. But I expect that your "secret sauce" is a confluence of a bunch of factors (including IQ, emotional temperament, background knowledge, etc) only one of which was "being in a community that prioritized careful thinking".

This seems fair, and I guess from this perspective my response is that I'm not sure how to intervene on the other factors (aside from enhancing human IQ, which I do support). It seems like your view is that emotional temperament is also a good place to intervene? If so, perhaps I should read your posts with this in mind. (I previously didn't see how the Replacing Fear sequence was relevant to my concerns, and mostly skipped it.)

And then I also think you're missing a bunch of other secret sauces that would make your impact on the world better (like more ability to export your ideas to other people).

I'm actually reluctant to export my ideas to more people, especially those who don't care as much about careful reasoning (which unfortunately is almost everyone), as I don't want to be responsible for people misusing my ideas, e.g., overconfidently putting them into practice or extending them in wrong directions.

However I'm trying to practice some skills related to exporting ideas (such as talking to people in real time and participating on X) in case it does seem to be a good idea one day. Would be interested to hear more about what other secret sauces related to this I might be missing. (I guess public speaking is another one, but the cost of practicing that one is too high for me.)

One reason I'm personally pushing back on this, btw, is that my own self-narrative for why I'm able to be intellectually productive in significant part relies on me being less intellectually careful than other people—so that I'm willing to throw out a bunch of ideas that are half-formed and non-rigorous, iterate, and eventually get to the better ones.

To be clear, I think this is totally fine, as long as you take care to not be or appear too confident about these half-formed ideas, and take precautions against other people taking your ideas more seriously than they should (such as by monitoring subsequent discussions and weighing in against other people's over-enthusiasm). I think "careful thinking" can and should be a social activity, which would necessitate communicating half-formed ideas during the collaborative process. I've done this myself plenty of times, such as in my initial UDT post, which was very informal and failed to anticipate many subsequently discovered problems, so I'm rather surprised that you think I would be against this.

Comment by Wei Dai (Wei_Dai) on Wei Dai's Shortform · 2025-03-30T11:10:07.454Z · LW · GW

This is part of why I'm less sold on "careful philosophical reasoning" as the key thing. Indeed, wanting to "commit prematurely to a specific, detailed value system" is historically very correlated with intellectualism (e.g. elites tend to be the rabid believers in communism, libertarianism, religion, etc—a lot of more "normal" people don't take it that seriously even when they're nominally on board). And so it's very plausible that the thing we want is less philosophy, because (like, say, asteroid redirection technology) the risks outweigh the benefits.

Here, you seem to conflate "careful philosophical reasoning" with intellectualism and philosophy in general. But in an earlier comment, I tried to draw a distinction between careful philosophical reasoning and the kind of hand-wavy thinking that has been called "philosophy" in most times and places. You didn't respond to it in that thread... did you perhaps miss it?

More substantively, Eliezer talked about the Valley of Bad Rationality, and I think there's probably something like that for philosophical thinking as well, which I admit definitely complicates the problem. I'm not going around and trying to push random people "into philosophy", for example.

If you take your interim strategy seriously (but set aside x-risk) then I think you actually end up with something pretty similar to the main priorities of classic liberals: prevent global lock-in (by opposing expansionist powers like the Nazis), prevent domestic political lock-in (via upholding democracy), prevent ideological lock-in (via supporting free speech), give our descendants more optionality (via economic and technological growth). I don't think this is a coincidence—it just often turns out that there are a bunch of heuristics that are really robustly good, and you can converge on them from many different directions.

Sure, there's some overlap on things like free speech and preventing lock-in. But calling it convergence feels like a stretch. One of my top priorities is encouraging more people to base their moral evolution on careful philosophical reasoning instead of random status games. That's pretty different from standard classical liberalism. Doesn't this big difference suggest the other overlaps might just be coincidence? Have you explained your reasons anywhere for thinking it's not a coincidence and that these heuristics are robust enough on their own, without grounding in some explicit principle like "normative option value" that could be used to flexibly adjust the heuristics according to the specific circumstances?

Yes, but also: it's very plausible to me that the net effect of LessWrong-inspired thinking on AI x-risk has been and continues to be negative.

I think this is plausible too, but want to attribute it mostly to insufficiently careful thinking and playing other status games. I feel like with careful enough thinking and not being distracted/influenced by competing motivations, a lot of the negative effects could have been foreseen and prevented. For example, did you know that Eliezer/MIRI for years pursued a plan of racing to build the first AGI and making it aligned (Friendly), which I think inspired/contributed (via the founding of DeepMind) to the current crop of AI labs and their AI race, and that I had warned him at the time (in a LW post or comment) that the plan was very unlikely to succeed and would probably backfire this way?

Also, I would attribute Sam and Elon's behavior not to mental health issues, but to (successfully) playing their own power/status game, with "not trusting Google / each other" just a cover for wanting to be the hero that saves the world, which in turn is just a cover for grabbing power and status. This seems perfectly reasonable and parsimonious from an evolutionary psychology perspective, and I don't see why we need to hypothesize mental health issues to explain what they did.

Comment by Wei Dai (Wei_Dai) on Wei Dai's Shortform · 2025-03-30T01:07:31.677Z · LW · GW

Ok, I see where you're coming from, but think you're being overconfident about non-cognitivism. My current position is that non-cognitivism is plausible, but we can't be very sure that it is true, and making progress on this meta-ethical question also requires careful philosophical reasoning. These two posts of mine are relevant on this topic: Six Plausible Meta-Ethical Alternatives , Some Thoughts on Metaphilosophy

Comment by Wei Dai (Wei_Dai) on Wei Dai's Shortform · 2025-03-29T07:47:22.859Z · LW · GW

None of these seem as crucial as careful philosophical reasoning, because moral progress is currently not bottlenecked on any of them (except possibly the last item, which I do not know the contents of). To explain more, I think the strongest conclusion from careful philosophical reasoning so far is that we are still very far from knowing what normativity (decision theory and values, or more generally rationality and morality) consists of, and therefore the most important thing right now is to accumulate and preserve normative option value (the ability to eventually do the best thing with the most resources).

What is blocking this "interim morality" from being more broadly accepted? I don't think it's lack of either political activism (plenty of people in free societies also don't care about preserving normative option value), neuroscience/psychology (how would it help at this point?), or introspection + emotional health (same question, how would it help?), but just that the vast majority of people do not care about trying to figure out normativity via careful philosophical reasoning, and instead are playing status games with other focal points.

<summary>Here's a longer, more complete version of my argument, written by Gemini 2.5 Pro after some back and forth. Please feel free to read or ignore (if my own writing above seems clear enough).</summary>

Goal: The ultimate aim is moral progress, which requires understanding and implementing correct normativity (how to decide, what to value).
Primary Tool: The most fundamental tool we have for figuring out normativity at its roots is careful, skeptical philosophical reasoning. Empirical methods (like neuroscience) can inform this, but the core questions (what should be, what constitutes a good reason) are philosophical.
Current Philosophical State: The most robust conclusion from applying this tool carefully so far is that we are deeply uncertain about the content of correct normativity. We haven't converged on a satisfactory theory of value or decision theory. Many plausible-seeming avenues have deep problems.
Rational Response to Uncertainty & Its Urgent Implication:
- Principle: In the face of such profound, foundational uncertainty, the most rational interim strategy isn't to commit prematurely to a specific, detailed value system (which is likely wrong), but to preserve and enhance optionality. This means acting in ways that maximize the chances that whatever the correct normative theory turns out to be, we (or our successors) will be in the best possible position (knowledge, resources, freedom of action) to understand and implement it. This is the "preserve normative option value" principle.
- Urgent Application: Critically, the most significant threats to preserving this option value today are existential risks (e.g., from unaligned AI, pandemics, nuclear war) which could permanently foreclose any desirable future. Therefore, a major, urgent practical consequence of accepting the principle of normative option value is the prioritization of mitigating these existential risks.
The Current Bottleneck: Moral progress on the most critical front is primarily stalled because this philosophical conclusion (deep uncertainty) and its strategic implication (preserve option value)—especially its urgent consequence demanding the prioritization of x-risk mitigation—are not widely recognized, accepted, or acted upon with sufficient seriousness or resources.
Why Other Factors Aren't the Primary Strategic Bottleneck Now:
- Politics: Free societies exist where discussion could happen, yet this conclusion isn't widely adopted within them. The bottleneck isn't solely the lack of freedom, but the lack of focus on this specific line of reasoning and its implications.
- Neuroscience/Psychology: While useful eventually, understanding the brain's mechanisms doesn't currently resolve the core philosophical uncertainty or directly compel the strategic focus on option value / x-risk. The relevant insight is primarily conceptual/philosophical at this stage.
- Introspection/Emotional Health: While helpful, the lack of focus on option value / x-risk isn't plausibly primarily caused by a global deficit in emotional health preventing people from grasping the concept. It's more likely due to lack of engagement with the specific philosophical arguments, different priorities, and incentive structures.
- Directness: Furthermore, addressing the conceptual bottleneck around option value and its link to x-risk seems like a more direct path to potentially shifting priorities towards mitigating the most pressing dangers quickly, compared to the slower, more systemic improvements involved in fixing politics, cognition, or widespread emotional health.

</details>

Edit: Hmm, <details> doesn't seem to work in Markdown and I don't know how else to write collapsible sections in Markdown, and I can't copy/paste the AI content correctly in Docs mode. Guess I'll leave it like this for now until the LW team fixes things.

Comment by Wei Dai (Wei_Dai) on Wei Dai's Shortform · 2025-03-29T02:19:49.054Z · LW · GW

The One True Form of Moral Progress (according to me) is using careful philosophical reasoning to figure out what our values should be, what morality consists of, where our current moral beliefs are wrong, or generally, the contents of normativity (what we should and shouldn't do). Does this still seem wrong to you?

The basic justification for this is that for any moral "progress" or change that is not based on careful philosophical reasoning, how can we know that it's actually a change for the better? I don't think I've written a post specifically about this, but Morality is Scary is related, in that it complains that most other kinds of moral change seem to be caused by status games amplifying random aspects of human values or motivation.

Comment by Wei Dai (Wei_Dai) on Richard Ngo's Shortform · 2025-03-27T22:39:00.696Z · LW · GW

I'm not sure that fear or coercion has much to do with it, because there's often no internal conflict when someone is caught up in some extreme form of the morality game, they're just going along with it wholeheartedly, thinking they're just being a good person or helping to advance the arc of history. In the subagents frame, I would say that the subagents have an implicit contract/agreement that any one of them can seize control, if doing so seems good for the overall agent in terms of power or social status.

But quite possibly I'm not getting your point, in which case please explain more, or point to some specific parts of your articles that are especially relevant?

Comment by Wei Dai (Wei_Dai) on Wei Dai's Shortform · 2025-03-27T20:23:04.503Z · LW · GW

My early posts on LW often consisted of pointing out places in the Sequences where Eliezer wasn't careful enough. Shut Up and Divide? and Boredom vs. Scope Insensitivity come to mind. And of course that's not the only way to gain status here - the big status awards are given for coming up with novel ideas and backing them up with carefully constructed arguments.

Comment by Wei Dai (Wei_Dai) on Wei Dai's Shortform · 2025-03-27T16:31:04.394Z · LW · GW

To branch off the line of thought in this comment, it seems that for most of my adult life I've been living in the bubble-within-a-bubble that is LessWrong, where the aspect of human value or motivation that is the focus of our signaling game is careful/skeptical inquiry, and we gain status by pointing out where others haven't been careful or skeptical enough in their thinking. (To wit, my repeated accusations that Eliezer and the entire academic philosophy community tend to be overconfident in their philosophical reasoning, don't properly appreciate the difficulty of philosophy as an enterprise, etc.)

I'm still extremely grateful to Eliezer for creating this community/bubble, and think that I/we have lucked into the One True Form of Moral Progress, but must acknowledge that from the outside, our game must look as absurd as any other niche status game that has spiraled out of control.

Comment by Wei Dai (Wei_Dai) on Richard Ngo's Shortform · 2025-03-27T04:44:55.777Z · LW · GW

How would this ideology address value drift? I've been thinking a lot about the kind quoted in Morality is Scary. The way I would describe it now is that human morality is by default driven by a competitive status/signaling game, where often some random or historically contingent aspect of human value or motivation becomes the focal point of the game, and gets magnified/upweighted as a result of competitive dynamics, sometimes to an extreme, even absurd degree.

(Of course from the inside it doesn't look absurd, but instead feels like moral progress. One example of this that I happened across recently is filial piety in China, which became more and more extreme over time, until someone cutting off a piece of their flesh to prepare a medicinal broth for an ailing parent was held up as a moral exemplar.)

Related to this is my realization is that the kind of philosophy you and I are familiar with (analytical philosophy, or more broadly careful/skeptical philosophy) doesn't exist in most of the world and may only exist in Anglophone countries as a historical accident. There, about 10,000 practitioners exist who are funded but ignored by the rest of the population. To most of humanity, "philosophy" is exemplified by Confucius (morality is everyone faithfully playing their feudal roles) or Engels (communism, dialectical materialism). To us, this kind of "philosophy" is hand waving and make things up out of thin air, but to them, philosophy is learned from a young age and unquestioned. (Or if questioned, they're liable to jump to some other equally hand-wavy "philosophy" like China's move from Confucius to Engels.)

Empowering a group like this... are you sure that's a good idea? Or perhaps you have some notion of "empowerment" in mind that takes these issues into account already and produces a good outcome anyway?

Comment by Wei Dai (Wei_Dai) on You can, in fact, bamboozle an unaligned AI into sparing your life · 2024-10-06T15:22:51.815Z · LW · GW

If you only care about the real world and you're sure there's only one real world, then the fact that you at time 0 would sometimes want to bind yourself at time 1 (e.g., physically commit to some action or self-modify to perform some action at time 1) seems very puzzling or indicates that something must be wrong, because at time 1 you're in a strictly better epistemic position, having found out more information about which world is real, so what sense does it make that your decision theory makes you-at-time-0 decide to override you-at-time-1's decision?

(If you believed in something like Tegmark IV but your values constantly change to only care about the subset of worlds that you're in, then time inconsistency, and wanting to override your later selves, would make more sense, as your earlier self and later self would simply have different values. But it seems counterintuitive to be altruistic this way.)

Comment by Wei Dai (Wei_Dai) on Mark Xu's Shortform · 2024-10-06T04:03:25.040Z · LW · GW

Better control solutions make AI more economically useful, which speeds up the AI race and makes it even harder to do an AI pause.

When we have controlled unaligned AIs doing economically useful work, they probably won't be very useful for solving alignment. Alignment will still be philosophically confusing, and it will be hard to trust the alignment work done by such AIs. Such AIs can help solve some parts of alignment problems, parts that are easy to verify, but alignment as a whole will still be bottle-necked on philosophically confusing, hard to verify parts.

Such AIs will probably be used to solve control problems for more powerful AIs, so the basic situation will continue and just become more fragile, with humans trying to control increasingly intelligent unaligned AIs. This seems unlikely to turn out well. They may also persuade some of us to trust their alignment work, even though we really shouldn't.

So to go down this road is to bet that alignment has no philosophically confusing or hard to verify parts. I see some people saying this explicitly in the comments here, but why do they think that? How do they know? (I'm afraid that some people just don't feel philosophically confused about much of anything, and will push forward on that basis.) But you do seem to worry about philosophical problems, which makes me confused about the position you take here.

BTW I have similar objections to working on relatively easy forms of (i.e., unscalable) alignment solutions, and using the resulting aligned AIs to solve alignment for more powerful AIs. But at least there, one might gain some insights into the harder alignment problems from working on the easy problems, potentially producing some useful strategic information or making it easier to verify future proposed alignment solutions. So while I don't think that's a good plan, this plan seems even worse.

Comment by Wei Dai (Wei_Dai) on MichaelDickens's Shortform · 2024-10-02T16:29:20.331Z · LW · GW

And I agree with Bryan Caplan's recent take that friendships are often a bigger conflict of interest than money, so Open Phil higher-ups being friends with Anthropic higher-ups is troubling.

No kidding. From https://www.openphilanthropy.org/grants/openai-general-support/:

OpenAI researchers Dario Amodei and Paul Christiano are both technical advisors to Open Philanthropy and live in the same house as Holden. In addition, Holden is engaged to Dario’s sister Daniela.

Wish OpenPhil and EAs in general were more willing to reflect/talk publicly about their mistakes. Kind of understandable given human nature, but still... (I wonder if there are any mistakes I've made that I should reflect more on.)

Comment by Wei Dai (Wei_Dai) on You can, in fact, bamboozle an unaligned AI into sparing your life · 2024-10-02T14:14:28.604Z · LW · GW

To be clear, by "indexical values" in that context I assume you mean indexing on whether a given world is "real" vs "counterfactual," not just indexical in the sense of being egoistic? (Because I think there are compelling reasons to reject UDT without being egoistic.)

I think being indexical in this sense (while being altruistic) can also lead you to reject UDT, but it doesn't seem "compelling" that one should be altruistic this way. Want to expand on that?

Comment by Wei Dai (Wei_Dai) on on bacteria, on teeth · 2024-10-02T00:32:56.662Z · LW · GW

Maybe breaking up certain biofilms held together by Ca?

Yeah there's a toothpaste on the market called Livfree that claims to work like this.

IIRC, high EDTA concentration was found to cause significant amounts of erosion.

Ok, that sounds bad. Thanks.

ETA: Found an article that explains how Livfree works in more detail:

Tooth surfaces are negatively charged, and so are bacteria; therefore, they should repel each other. However, salivary calcium coats the negative charges on the tooth surface and bacteria, allowing them to get very close (within 10 nm). At this point, van der Waal’s forces (attractive electrostatic forces at small distances) take over, allowing the bacteria to deposit on the tooth surfaces, initiating biofilm formation.10 A unique formulation of EDTA strengthens the negative electronic forces of the tooth, allowing the teeth to repel harmful plaque. This special formulation quickly penetrates through the plaque down to the tooth surface. There, it changes the surface charge back to negative by neutralizing the positively charged calcium ions. This new, stronger negative charge on the tooth surface environment simply allows the plaque and the tooth surface to repel each other. This requires neither an abrasive nor killing the bacteria (Figure 3).

The authors are very positive on this toothpaste, although they don't directly explain why it doesn't cause tooth erosion.

Comment by Wei Dai (Wei_Dai) on You can, in fact, bamboozle an unaligned AI into sparing your life · 2024-10-01T23:13:32.150Z · LW · GW

I actually no longer fully endorse UDT. It still seems a better decision theory approach than any other specific approach that I know, but it has a bunch of open problems and I'm not very confident that someone won't eventually find a better approach that replaces it.

To your question, I think if my future self decides to follow (something like) UDT, it won't be because I made a "commitment" to do it, but because my future self wants to follow it, because he thinks it's the right thing to do, according to his best understanding of philosophy and normativity. I'm unsure about this, and the specific objection you have is probably covered under #1 in my list of open questions in the link above.

(And then there's a very different scenario in which UDT gets used in the future, which is that it gets built into AIs, and then they keep using UDT until they decide not to, which if UDT is reflectively consistent would be never. I dis-endorse this even more strongly.)

Comment by Wei Dai (Wei_Dai) on on bacteria, on teeth · 2024-10-01T20:41:32.029Z · LW · GW

Any thoughts on edathamil/EDTA or nano-hydroxyapatite toothpastes?

Comment by Wei Dai (Wei_Dai) on The Sun is big, but superintelligences will not spare Earth a little sunlight · 2024-10-01T14:18:20.253Z · LW · GW

This means that in the future, there will likely be a spectrum of AIs of varying levels of intelligence, some much smarter than humans, others only slightly smarter, and still others merely human-level.

Are you imagining that the alignment problem is still unsolved in the future, such that all of these AIs are independent agents unaligned with each other (like humans currently are)? I guess in my imagined world, ASIs will have solved the alignment (or maybe control) problem at least for less intelligent agents, so you'd get large groups of AIs aligned with each other that can for many purposes be viewed as one large AI.

Building on (5), I generally expect AIs to calculate that it is not in their interest to expropriate wealth from other members of society, given how this could set a precedent for future wealth expropriation that comes back and hurts them selfishly.

At some point we'll reach technological maturity, and the ASIs will be able to foresee all remaining future shocks/changes to their economic/political systems, and probably determine that expropriating humans (and anyone else they decide to, I agree it may not be limited to humans) won't cause any future problems.

Even if a tiny fraction of consumer demand in the future is for stuff produced by humans, that could ensure high human wages simply because the economy will be so large.

This is only true if there's not a single human that decides to freely copy or otherwise reproduce themselves and drive down human wages to subsistence. And I guess yeah, maybe AIs will have fetishes like this, but (like my reaction to Paul Christiano's "1/trillion kindness" argument) I'm worried whether AIs might have less benign fetishes. This worry more than cancels out the prospect that humans might live / earn a wage from benign fetishes in my mind.

This might be the most important point on my list, despite saying it last, but I think humans will likely be able to eventually upgrade their intelligence, better allowing them to “keep up” with the state of the world in the future.

I agree this will happen eventually (if humans survive), but think it will take a long time because we'll have to solve a bunch of philosophical problems to determine how to do this safely (e.g. without losing or distorting our values) and we probably can't trust AI's help with these (although I'd love to change that, hence my focus on metaphilosophy), and in the meantime AIs will be zooming ahead partly because they started off thinking faster and partly because some will be reckless (like some humans currently are!) or have simple values that don't require philosophical contemplation to understand, so the situation I described is still likely to occur.

Comment by Wei Dai (Wei_Dai) on The Sun is big, but superintelligences will not spare Earth a little sunlight · 2024-10-01T03:21:26.100Z · LW · GW

It therefore seems perfectly plausible for AIs to simply get rich within the system we have already established, and make productive compromises, rather than violently overthrowing the system itself.

So assuming that AIs get rich peacefully within the system we have already established, we'll end up with a situation in which ASIs produce all value in the economy, and humans produce nothing but receive an income and consume a bunch, through ownership of capital and/or taxing the ASIs. This part should be non-controversial, right?

At this point, it becomes a coordination problem for the ASIs to switch to a system in which humans no longer exist or no longer receive any income, and the ASIs get to consume or reinvest everything they produce. You're essentially betting that ASIs can't find a way to solve this coordination problem. This seems like a bad bet to me. (Intuitively it just doesn't seem like a very hard problem, relative to what I imagine the capabilities of the ASIs to be.)

I'm simply arguing against the point that smart AIs will automatically turn violent and steal from agents who are less smart than they are unless they're value aligned. This is a claim that I don't think has been established with any reasonable degree of rigor.

I don't know how to establish anything post-ASI "with any reasonable degree of rigor" but the above is an argument I recently thought of, which seems convincing, although of course you may disagree. (If someone has expressed this or a similar argument previously, please let me know.)

Comment by Wei Dai (Wei_Dai) on You can, in fact, bamboozle an unaligned AI into sparing your life · 2024-09-30T07:35:30.694Z · LW · GW

Why? Perhaps we'd do it out of moral uncertainty, thinking maybe we owe something to our former selves, but future people probably won't think this.
Currently our utility is roughly log in money, partly because we spend money on instrumental goals and there's diminishing returns due to limited opportunities being used up. This won't be true of future utilitarians spending resources on their terminal values. So "one in hundred million fraction" of resources is a much bigger deal to them than to us.

Comment by Wei Dai (Wei_Dai) on You can, in fact, bamboozle an unaligned AI into sparing your life · 2024-09-30T03:52:49.145Z · LW · GW

I have a slightly different take, which is that we can't commit to doing this scheme even if we want to, because I don't see what we can do today that would warrant the term "commitment", i.e., would be binding on our post-singularity selves.

In either case (we can't or don't commit), the argument in the OP loses a lot of its force, because we don't know whether post-singularity humans will decide to do this kind scheme or not.

Comment by Wei Dai (Wei_Dai) on You can, in fact, bamboozle an unaligned AI into sparing your life · 2024-09-30T03:42:33.998Z · LW · GW

So the commitment I want to make is just my current self yelling at my future self, that "no, you should still bail us out even if 'you' don't have a skin in the game anymore". I expect myself to keep my word that I would probably honor a commitment like that, even if trading away 10 planets for 1 no longer seems like that good of an idea.

This doesn't make much sense to me. Why would your future self "honor a commitment like that", if the "commitment" is essentially just one agent yelling at another agent to do something the second agent doesn't want to do? I don't understand what moral (or physical or motivational) force your "commitment" is supposed to have on your future self, if your future self does not already think doing the simulation trade is a good idea.

I mean imagine if as a kid you made a "commitment" in the form of yelling at your future self that if you ever had lots of money you'd spend it all on comic books and action figures. Now as an adult you'd just ignore it, right?

Comment by Wei Dai (Wei_Dai) on Why Does Power Corrupt? · 2024-09-29T00:58:35.757Z · LW · GW

Comment by Wei Dai (Wei_Dai) on A Nonconstructive Existence Proof of Aligned Superintelligence · 2024-09-28T20:55:36.630Z · LW · GW

Over time I have seen many people assert that “Aligned Superintelligence” may not even be possible in principle. I think that is incorrect and I will give a proof - without explicit construction - that it is possible.

The meta problem here is that you gave a "proof" (in quotes because I haven't verified it myself as correct) using your own definitions of "aligned" and "superintelligence", but if people asserting that it's not possible in principle have different definitions in mind, then you haven't actually shown them to be incorrect.

Comment by Wei Dai (Wei_Dai) on AI #83: The Mask Comes Off · 2024-09-27T17:49:14.278Z · LW · GW

Apparently the current funding round hasn't closed yet and might be in some trouble, and it seems much better for the world if the round was to fail or be done at a significantly lower valuation (in part to send a message to other CEOs not to imitate SamA's recent behavior). Zvi saying that $150B greatly undervalues OpenAI at this time seems like a big unforced error, which I wonder if he could still correct in some way.

Comment by Wei Dai (Wei_Dai) on Being nicer than Clippy · 2024-09-27T16:38:51.445Z · LW · GW

What hunches do you currently have surrounding orthogonality, its truth or not, or things near it?

I'm very uncertain about it. Have you read Six Plausible Meta-Ethical Alternatives?

as far as I can tell humans should by default see themselves as having the same kind of alignment problem as AIs do, where amplification can potentially change what's happening in a way that corrupts thoughts which previously implemented values.

Yeah, agreed that how to safely amplify oneself and reflect for long periods of time may be hard problems that should be solved (or extensively researched/debated if we can't definitely solve them) before starting something like CEV. This might involve creating the right virtual environment, social rules, epistemic norms, group composition, etc. A few things that seem easy to miss or get wrong:

Is it better to have no competition or some competition, and what kind? (Past "moral/philosophical progress" might have been caused or spread by competitive dynamics.)
How should social status work in CEV? (Past "progress" might have been driven by people motivated by certain kinds of status.)
No danger or some danger? (Could a completely safe environment / no time pressure cause people to lose motivation or some other kind of value drift? Related: What determines the balance between intelligence signaling and virtue signaling?)

can we find a CEV-grade alignment solution that solves the self-and-other alignment problems in humans as well, such that this CEV can be run on any arbitrary chunk of matter and discover its "true wants, needs, and hopes for the future"?

I think this is worth thinking about as well, as a parallel approach from the above. It seems related to metaphilosophy in that if we can discover what "correct philosophical reasoning" is, we can solve this problem by asking "What would this chunk of matter conclude if it were to follow correct philosophical reasoning?"

Comment by Wei Dai (Wei_Dai) on Wei Dai's Shortform · 2024-09-27T05:33:57.455Z · LW · GW

As a tangent to my question, I wonder how many AI companies are already using RLAIF and not even aware of it. From a recent WSJ story:

Early last year, Meta Platforms asked the startup to create 27,000 question-and-answer pairs to help train its AI chatbots on Instagram and Facebook.

When Meta researchers received the data, they spotted something odd. Many answers sounded the same, or began with the phrase “as an AI language model…” It turns out the contractors had used ChatGPT to write-up their responses—a complete violation of Scale’s raison d’être.

So they detected the cheating that time, but in RLHF how would they know if contractors used AI to select which of two AI responses is more preferred?

BTW here's a poem(?) I wrote for Twitter, actually before coming across the above story:

The people try to align the board. The board tries to align the CEO. The CEO tries to align the managers. The managers try to align the employees. The employees try to align the contractors. The contractors sneak the work off to the AI. The AI tries to align the AI.

Comment by Wei Dai (Wei_Dai) on Being nicer than Clippy · 2024-09-26T22:45:25.729Z · LW · GW

but we only need one person or group who we’d be somewhat confident would do alright in CEV. Plausibly there are at least a few eg MIRIers who would satisfy this.

Why do you think this, and how would you convince skeptics? And there are two separate issues here. One is how to know their CEV won't be corrupted relative to what their values really are or should be, and the other is how to know that their real/normative values are actually highly altruistic. It seems hard to know both of these, and perhaps even harder to persuade others who may be very distrustful of such person/group from the start.

Another is that even if we don’t die of AI, we get eaten by various moloch instead of being able to safely solve the necessary problems at whatever pace is necessary.

Would be interested in understanding your perspective on this better. I feel like aside from AI, our world is not being eaten by molochs very quickly, and I prefer something like stopping AI development and doing (voluntary and subsidized) embryo selection to increase human intelligence for a few generations, then letting the smarter humans decide what to do next. (Please contact me via PM if you want to have a chat about this.)

Comment by Wei Dai (Wei_Dai) on Wei Dai's Shortform · 2024-09-26T21:46:28.624Z · LW · GW

AI companies don't seem to be shy about copying RLHF though. Llama, Gemini, and Grok are all explicitly labeled as using RLHF.

Comment by Wei Dai (Wei_Dai) on Wei Dai's Shortform · 2024-09-26T14:25:30.677Z · LW · GW

It's also not clear to me that most of the value of AI will accrue to them. I'm confused about this though.

I'm also uncertain, and its another reason for going long a broad index instead. I would go even broader than S&P 500 if I could, but nothing else has option chains going out to 2029.

User info

Posts

Comments