Posts

Wei Dai's Shortform 2024-03-01T20:43:15.279Z
Managing risks while trying to do good 2024-02-01T18:08:46.506Z
AI doing philosophy = AI generating hands? 2024-01-15T09:04:39.659Z
UDT shows that decision theory is more puzzling than ever 2023-09-13T12:26:09.739Z
Meta Questions about Metaphilosophy 2023-09-01T01:17:57.578Z
Why doesn't China (or didn't anyone) encourage/mandate elastomeric respirators to control COVID? 2022-09-17T03:07:39.080Z
How to bet against civilizational adequacy? 2022-08-12T23:33:56.173Z
AI ethics vs AI alignment 2022-07-26T13:08:48.609Z
A broad basin of attraction around human values? 2022-04-12T05:15:14.664Z
Morality is Scary 2021-12-02T06:35:06.736Z
(USA) N95 masks are available on Amazon 2021-01-18T10:37:40.296Z
Anti-EMH Evidence (and a plea for help) 2020-12-05T18:29:31.772Z
A tale from Communist China 2020-10-18T17:37:42.228Z
Everything I Know About Elite America I Learned From ‘Fresh Prince’ and ‘West Wing’ 2020-10-11T18:07:52.623Z
Tips/tricks/notes on optimizing investments 2020-05-06T23:21:53.153Z
Have epistemic conditions always been this bad? 2020-01-25T04:42:52.190Z
Against Premature Abstraction of Political Issues 2019-12-18T20:19:53.909Z
What determines the balance between intelligence signaling and virtue signaling? 2019-12-09T00:11:37.662Z
Ways that China is surpassing the US 2019-11-04T09:45:53.881Z
List of resolved confusions about IDA 2019-09-30T20:03:10.506Z
Don't depend on others to ask for explanations 2019-09-18T19:12:56.145Z
Counterfactual Oracles = online supervised learning with random selection of training episodes 2019-09-10T08:29:08.143Z
AI Safety "Success Stories" 2019-09-07T02:54:15.003Z
Six AI Risk/Strategy Ideas 2019-08-27T00:40:38.672Z
Problems in AI Alignment that philosophers could potentially contribute to 2019-08-17T17:38:31.757Z
Forum participation as a research strategy 2019-07-30T18:09:48.524Z
On the purposes of decision theory research 2019-07-25T07:18:06.552Z
AGI will drastically increase economies of scale 2019-06-07T23:17:38.694Z
How to find a lost phone with dead battery, using Google Location History Takeout 2019-05-30T04:56:28.666Z
Where are people thinking and talking about global coordination for AI safety? 2019-05-22T06:24:02.425Z
"UDT2" and "against UD+ASSA" 2019-05-12T04:18:37.158Z
Disincentives for participating on LW/AF 2019-05-10T19:46:36.010Z
Strategic implications of AIs' ability to coordinate at low cost, for example by merging 2019-04-25T05:08:21.736Z
Please use real names, especially for Alignment Forum? 2019-03-29T02:54:20.812Z
The Main Sources of AI Risk? 2019-03-21T18:28:33.068Z
What's wrong with these analogies for understanding Informed Oversight and IDA? 2019-03-20T09:11:33.613Z
Three ways that "Sufficiently optimized agents appear coherent" can be false 2019-03-05T21:52:35.462Z
Why didn't Agoric Computing become popular? 2019-02-16T06:19:56.121Z
Some disjunctive reasons for urgency on AI risk 2019-02-15T20:43:17.340Z
Some Thoughts on Metaphilosophy 2019-02-10T00:28:29.482Z
The Argument from Philosophical Difficulty 2019-02-10T00:28:07.472Z
Why is so much discussion happening in private Google Docs? 2019-01-12T02:19:19.332Z
Two More Decision Theory Problems for Humans 2019-01-04T09:00:33.436Z
Two Neglected Problems in Human-AI Safety 2018-12-16T22:13:29.196Z
Three AI Safety Related Ideas 2018-12-13T21:32:25.415Z
Counterintuitive Comparative Advantage 2018-11-28T20:33:30.023Z
A general model of safety-oriented AI development 2018-06-11T21:00:02.670Z
Beyond Astronomical Waste 2018-06-07T21:04:44.630Z
Can corrigibility be learned safely? 2018-04-01T23:07:46.625Z
Multiplicity of "enlightenment" states and contemplative practices 2018-03-12T08:15:48.709Z

Comments

Comment by Wei Dai (Wei_Dai) on Eric Neyman's Shortform · 2024-04-27T06:38:14.906Z · LW · GW

Thank you for detailing your thoughts. Some differences for me:

  1. I'm also worried about unaligned AIs as a competitor to aligned AIs/civilizations in the acausal economy/society. For example, suppose there are vulnerable AIs "out there" that can be manipulated/taken over via acausal means, unaligned AI could compete with us (and with others with better values from our perspective) in the race to manipulate them.
  2. I'm perhaps less optimistic than you about commitment races.
  3. I have some credence on max good and max bad being not close to balanced, that additionally pushes me towards the "unaligned AI is bad" direction.

ETA: Here's a more detailed argument for 1, that I don't think I've written down before. Our universe is small enough that it seems plausible (maybe even likely) that most of the value or disvalue created by a human-descended civilization comes from its acausal influence on the rest of the multiverse. An aligned AI/civilization would influence the rest of the multiverse in a positive direction, whereas an unaligned AI/civilization would (probably) influence the rest of the multiverse in a negative direction. This effect may outweigh what happens in our own universe/lightcone so much that the positive value from unaligned AI doing valuable things in our universe as a result of acausal trade is totally swamped by the disvalue created by its negative acausal influence.

Comment by Wei Dai (Wei_Dai) on Eric Neyman's Shortform · 2024-04-26T11:05:36.839Z · LW · GW

Perhaps half of the value of misaligned AI control is from acausal trade and half from the AI itself being valuable.

Why do you think these values are positive? I've been pointing out, and I see that Daniel Kokotajlo also pointed out in 2018 that these values could well be negative. I'm very uncertain but my own best guess is that the expected value of misaligned AI controlling the universe is negative, in part because I put some weight on suffering-focused ethics.

Comment by Wei Dai (Wei_Dai) on LLMs seem (relatively) safe · 2024-04-26T10:29:53.346Z · LW · GW

If something is both a vanguard and limited, then it seemingly can't stay a vanguard for long. I see a few different scenarios going forward:

  1. We pause AI development while LLMs are still the vanguard.
  2. The data limitation is overcome with something like IDA or Debate.
  3. LLMs are overtaken by another AI technology, perhaps based on RL.

In terms of relative safety, it's probably 1 > 2 > 3. Given that 2 might not happen in time, might not be safe if it does, or might still be ultimately outcompeted by something else like RL, I'm not getting very optimistic about AI safety just yet.

Comment by Wei Dai (Wei_Dai) on AI Regulation is Unsafe · 2024-04-25T04:02:04.971Z · LW · GW

The argument is that with 1970′s tech the soviet union collapsed, however with 2020 computer tech (not needing GenAI) it would not.

I note that China is still doing market economics, and nobody is trying (or even advocating, AFAIK) some very ambitious centrally planned economy using modern computers, so this seems like pure speculation? Has someone actually made a detailed argument about this, or at least has the agreement of some people with reasonable economics intuitions?

Comment by Wei Dai (Wei_Dai) on AI Regulation is Unsafe · 2024-04-25T03:54:39.752Z · LW · GW

I've arguably lived under totalitarianism (depending on how you define it), and my parents definitely have and told me many stories about it. I think AGI increases risk of totalitarianism, and support a pause in part to have more time to figure out how to make the AI transition go well in that regard.

Comment by Wei Dai (Wei_Dai) on Examples of Highly Counterfactual Discoveries? · 2024-04-25T03:17:38.640Z · LW · GW

Even if someone made a discovery decades earlier than it otherwise would have been, the long term consequences of that may be small or unpredictable. If your goal is to "achieve high counterfactual impact in your own research" (presumably predictably positive ones) you could potentially do that in certain fields (e.g., AI safety) even if you only counterfactually advance the science by a few months or years. I'm a bit confused why you're asking people to think in the direction outlined in the OP.

Comment by Wei Dai (Wei_Dai) on Changes in College Admissions · 2024-04-25T02:55:47.168Z · LW · GW

Some of my considerations for college choice for my kid, that I suspect others may also want to think more about or discuss:

  1. status/signaling benefits for the parents (This is probably a major consideration for many parents to push their kids into elite schools. How much do you endorse it?)
  2. sex ratio at the school and its effect on the local "dating culture"
  3. political/ideological indoctrination by professors/peers
  4. workload (having more/less time/energy to pursue one's own interests)
Comment by Wei Dai (Wei_Dai) on Richard Ngo's Shortform · 2024-04-25T01:12:07.099Z · LW · GW

I added this to my comment just before I saw your reply: Maybe it changes moment by moment as we consider different decisions, or something like that? But what about when we're just contemplating a philosophical problem and not trying to make any specific decisions?

I mostly offer this in the spirit of "here's the only way I can see to reconcile subjective anticipation with UDT at all", not "here's something which makes any sense mechanistically or which I can justify on intuitive grounds".

Ah I see. I think this is incomplete even for that purpose, because "subjective anticipation" to me also includes "I currently see X, what should I expect to see in the future?" and not just "What should I expect to see, unconditionally?" (See the link earlier about UDASSA not dealing with subjective anticipation.)

ETA: Currently I'm basically thinking: use UDT for making decisions, use UDASSA for unconditional subjective anticipation, am confused about conditional subjective anticipation as well as how UDT and UDASSA are disconnected from each other (i.e., the subjective anticipation from UDASSA not feeding into decision making). Would love to improve upon this, but your idea currently feels worse than this...

Comment by Wei Dai (Wei_Dai) on Changes in College Admissions · 2024-04-25T00:51:05.008Z · LW · GW

As you would expect, I strongly favor (1) over (2) over (3), with (3) being far, far worse for ‘eating your whole childhood’ reasons.

Is this actually true? China has (1) (affirmative action via "Express and objective (i.e., points and quotas)") for its minorities and different regions and FWICT the college admissions "eating your whole childhood" problem over there is way worse. Of course that could be despite (1) not because of it, but does make me question whether (3) ("Implied and subjective ('we look at the whole person').") is actually far worse than (1) for this.

Comment by Wei Dai (Wei_Dai) on Richard Ngo's Shortform · 2024-04-25T00:27:58.228Z · LW · GW

Intuitively this feels super weird and unjustified, but it does make the "prediction" that we'd find ourselves in a place with high marginal utility of money, as we currently do.

This is particularly weird because your indexical probability then depends on what kind of bet you're offered. In other words, our marginal utility of money differs from our marginal utility of other things, and which one do you use to set your indexical probability? So this seems like a non-starter to me... (ETA: Maybe it changes moment by moment as we consider different decisions, or something like that? But what about when we're just contemplating a philosophical problem and not trying to make any specific decisions?)

By "acausal games" do you mean a generalization of acausal trade?

Yes, didn't want to just say "acausal trade" in case threats/war is also a big thing.

Comment by Wei Dai (Wei_Dai) on Richard Ngo's Shortform · 2024-04-24T23:50:51.372Z · LW · GW

This was all kinda rambly but I think I can summarize it as "Isn't it weird that ADT tells us that we should act as if we'll end up in unusually important places, and also we do seem to be in an incredibly unusually important place in the universe? I don't have a story for why these things are related but it does seem like a suspicious coincidence."

I'm not sure this is a valid interpretation of ADT. Can you say more about why you interpret ADT this way, maybe with an example? My own interpretation of how UDT deals with anthropics (and I'm assuming ADT is similar) is "Don't think about indexical probabilities or subjective anticipation. Just think about measures of things you (considered as an algorithm with certain inputs) have influence over."

This seems to "work" but anthropics still feels mysterious, i.e., we want an explanation of "why are we who we are / where we're at" and it's unsatisfying to "just don't think about it". UDASSA does give an explanation of that (but is also unsatisfying because it doesn't deal with anticipations, and also is disconnected from decision theory).

I would say that under UDASSA, it's perhaps not super surprising to be when/where we are, because this seems likely to be a highly simulated time/scenario for a number of reasons (curiosity about ancestors, acausal games, getting philosophical ideas from other civilizations).

Comment by Wei Dai (Wei_Dai) on Rejecting Television · 2024-04-24T02:52:16.867Z · LW · GW

It occurs to me that many alternatives you mention are also superstimuli:

  • Reading a book
    • Pretty unlikely or rare to encounter stories or ideas with this much information content or entertainment value in the ancestral environment.
    • Some people do get addicted to books, e.g., romance novels.
  • Extroversion / talking to attractive people
    • We have access to more people, including more attractive people, but talking to anyone is less likely to lead to anything consequential because of birth control and because they also have way more choices.
    • Sex addiction. People who party all the time.
  • Creativity
    • We have the time and opportunity to do a lot more things that feel "creative" or "meaningful" to us, but these activities have less real-world significance than such feelings might suggest because other people have way more creative products/personalities to choose from.
    • Struggling artists/entertainers who refuse to give up their passions. Obscure hobbies.

Not sure if there are exceptions or not, but it seems like everything we could do for fun these days is some kind of supernormal stimulus, or the "fun" isn't much related to the original evolutionary purpose anymore. This includes e.g. forum participation. So far I haven't tried to make great efforts to quit anything, and instead have just eventually gotten bored of certain things I used to be "addicted" to (e.g., CRPGs, micro-optimizing crypto code). (This is not meant to be advice for other people. Also the overall issue of superstimuli/addiction is perhaps more worrying to me than this comment might suggest.)

Comment by Wei Dai (Wei_Dai) on Security amplification · 2024-04-22T03:10:15.019Z · LW · GW

Does anyone know why security amplification and meta-execution are rarely talked about these days? I did a search on LW and found just 1 passing reference to either phrase in the last 3 years. Is the problem not considered an important problem anymore? The problem is too hard and no one has new ideas? There are too many alignment problems/approaches to work on and not enough researchers?

Comment by Wei Dai (Wei_Dai) on When is a mind me? · 2024-04-18T11:18:21.104Z · LW · GW

If you think there’s something mysterious or unknown about what happens when you make two copies of yourself

Eliezer talked about some puzzles related to copying and anticipation in The Anthropic Trilemma that still seem quite mysterious to me. See also my comment on that post.

Comment by Wei Dai (Wei_Dai) on Wei Dai's Shortform · 2024-04-18T05:35:45.628Z · LW · GW

I think the way morality seems to work in humans is that we have a set of potential moral values, determined by our genes, that culture can then emphasize or de-emphasize. Altruism seems to be one of these potential values, that perhaps got more emphasized in recent times, in certain cultures. I think altruism isn't directly evolutionarily connected to power, and it's more like "act morally (according to local culture) while that's helpful for gaining power" which translates to "act altruistically while that's helpful for gaining power" in cultures that emphasize altruism. Does this make more sense?

Comment by Wei Dai (Wei_Dai) on Paul Christiano named as US AI Safety Institute Head of AI Safety · 2024-04-17T04:51:40.884Z · LW · GW

What are some failure modes of such an agency for Paul and others to look out for? (I shared one anecdote with him, about how a NIST standard for "crypto modules" made my open source cryptography library less secure, by having a requirement that had the side effect that the library could only be certified as standard-compliant if it was distributed in executable form, forcing people to trust me not to have inserted a backdoor into the executable binary, and then not budging when we tried to get an exception for this requirement.)

Comment by Wei Dai (Wei_Dai) on Fertility Roundup #3 · 2024-04-03T02:42:10.129Z · LW · GW

The only way to win is not to play.

Seems like a lot of people are doing exactly this, but interpreting it as "not having kids" instead of "having kids but not trying to compete with others in terms of educational investment/signaling". As a parent myself I think this is pretty understandable in terms of risk-aversion, i.e., being worried that one's unconventional parenting strategy might not work out well in terms of conventional success, and getting a lot of guilt / blame / status loss because of it.

Given it is a dystopian status competition hell, pay for it seems terrible, but if we have 98% participation now and 94% financial hardship, then this could be a way to justify a huge de facto transfer to parents.

I don't understand how this justifies paying. Wouldn't a big transfer to parents just cause more educational investment/signaling and leave the overall picture largely unchanged?

Comment by Wei Dai (Wei_Dai) on Fertility Roundup #3 · 2024-04-03T02:23:18.485Z · LW · GW

Trying to draw some general lessons from this:

  1. We are bad at governance, even on issues/problems that emerge/change slowly relative to human thinking (unlike, e.g., COVID-19). I think people who are optimistic about x-risk governance should be a bit more pessimistic based on this.
  2. Nobody had the foresight to think ahead of time about status dynamics in relation to fertility and parental investment. Academic theories about this are lagging empirical phenomena by a lot. What important dynamics will we miss with AI? (Nobody seems to be thinking about status and AI, which is one obvious candidate.)
Comment by Wei Dai (Wei_Dai) on The Cognitive-Theoretic Model of the Universe: A Partial Summary and Review · 2024-04-01T22:45:03.418Z · LW · GW

It seems that humans, starting from a philosophically confused state, are liable to find multiple incompatible philosophies highly plausible in a path-dependent way, see for example analytic vs continental philosophy vs non-Western philosophies. I think this means if we train an AI to optimize directly for plausibility, there's little assurance that we actually end up with philosophical truth.

A better plan is to train the AI in some way that does not optimize directly for plausibility, have some independent reason to think that the AI will be philosophically competent, and then use plausibility only as a test to detect errors in this process. I've written in the past that ideally we would first solve metaphilosophy so we that we can design the AI and the training process with a good understanding of the nature of philosophy and philosophical reasoning in mind, but failing that, I think some of the ideas in your list are still better than directly optimizing for plausibility.

You can do something like train it with RL in an environment where doing good philosophy is instrumentally useful and then hope it becomes competent via this mechanism.

This is an interesting idea. If it was otherwise feasible / safe / a good idea, we could perhaps train AI in a variety of RL environments, see which ones produce AIs that end up doing something like philosophy, and then see if we can detect any patterns or otherwise use the results to think about next steps.

Comment by Wei Dai (Wei_Dai) on The Cognitive-Theoretic Model of the Universe: A Partial Summary and Review · 2024-03-29T00:42:54.181Z · LW · GW

I'm guessing you're not being serious, but just in case you are, or in case someone misinterprets you now or in the future, I think we probably do not want to train AIs to give us answers optimized to sound plausible to humans, since that would make it even harder to determine whether or not the AI is actually competent at philosophy. (Not totally sure, as I'm confused about the nature of philosophy and philosophical reasoning, but I think we definitely don't want to do that in our current epistemic state, i.e., unless we had some really good arguments that says it's actually a good idea.)

Comment by Wei Dai (Wei_Dai) on My Interview With Cade Metz on His Reporting About Slate Star Codex · 2024-03-28T08:24:42.180Z · LW · GW

Many comments pointed out that NYT does not in fact have a consistent policy of always revealing people's true names. There's even a news editorial about this which I point out in case you trust the fact-checking of NY Post more.

I think that leaves 3 possible explanations of what happened:

  1. NYT has a general policy of revealing people's true names, which it doesn't consistently apply but ended up applying in this case for no particular reason.
  2. There's an inconsistently applied policy, and Cade Metz's (and/or his editors') dislike of Scott contributed (consciously or subconsciously) to insistence on applying the policy in this particular case.
  3. There is no policy and it was a purely personal decision.

In my view, most rationalists seem to be operating under a reasonable probability distribution over these hypotheses, informed by evidence such as Metz's mention of Charles Murray, lack of a public written policy about revealing real names, and lack of evidence that a private written policy exists.

Comment by Wei Dai (Wei_Dai) on The Cognitive-Theoretic Model of the Universe: A Partial Summary and Review · 2024-03-28T07:02:30.695Z · LW · GW

While reading this, I got a flash-forward of what my life (our lives) may be like in a few years, i.e., desperately trying to understand and evaluate complex philosophical constructs presented to us by superintelligent AI, which may or may not be actually competent at philosophy.

Comment by Wei Dai (Wei_Dai) on UDT1.01: The Story So Far (1/10) · 2024-03-28T00:52:39.412Z · LW · GW

I gave this explanation at the start of the UDT1.1 post:

When describing UDT1 solutions to various sample problems, I've often talked about UDT1 finding the function S* that would optimize its preferences over the world program P, and then return what S* would return, given its input. But in my original description of UDT1, I never explicitly mentioned optimizing S as a whole, but instead specified UDT1 as, upon receiving input X, finding the optimal output Y* for that input, by considering the logical consequences of choosing various possible outputs. I have been implicitly assuming that the former (optimization of the global strategy) would somehow fall out of the latter (optimization of the local action) without having to be explicitly specified, due to how UDT1 takes into account logical correlations between different instances of itself. But recently I found an apparent counter-example to this assumption.

Comment by Wei Dai (Wei_Dai) on Wei Dai's Shortform · 2024-03-27T20:40:37.063Z · LW · GW

Any thoughts on why, if it's obvious, it's seldomly brought up around here (meaning rationalist/EA/AI safety circles)?

Comment by Wei Dai (Wei_Dai) on All About Concave and Convex Agents · 2024-03-26T22:00:02.288Z · LW · GW

It’s difficult to trade with exponential agents

"Trade" between exponential agents could look like flipping a coin (biased to reflect relative power) and having the loser give all of their resources to the winner. It could also just look like ordinary trade, where each agent specializes in their comparative advantage, to gather resources/power to prepare for "the final trade".

"Trade" between exponential and less convex agents could look like making a bet on the size (or rather, potential resources) of the universe, such that the exponential agent gets a bigger share of big universes in exchange for giving up their share of small universes (similar to my proposed trade between a linear agent and a concave agent).

Maybe the real problem with convex agents is that their expected utilities do not converge, i.e., the probabilities of big universes can't possibly decrease enough with size that their expected utilities sum to finite numbers. (This is also a problem with linear agents, but you can perhaps patch the concept by saying they're linear in UD-weighted resources, similar to UDASSA. Is it also possible/sensible to patch convex agents in this way?)

However, convexity more closely resembles the intensity deltas needed to push reinforcement learning agent to take greater notice of small advances beyond the low-hanging fruit of its earliest findings, to counteract the naturally concave, diminishing returns that natural optimization problems tend to have.

I'm not familiar enough with RL to know how plausible this is. Can you expand on this, or anyone else want to weigh in?

Comment by Wei Dai (Wei_Dai) on Wei Dai's Shortform · 2024-03-26T20:11:28.307Z · LW · GW

Evil typically refers to an extraordinary immoral behavior, in the vicinity of purposefully inflicting harm to others in order to inflict harm intrinsically, rather than out of indifference, or as a byproduct of instrumental strategies to obtain some other goal.

Ok, I guess we just define/use it differently. I think most people we think of as "evil" probably justify inflicting harm to others as instrumental to some "greater good", or are doing it to gain or maintain power, not because they value it for its own sake. I mean if someone killed thousands of people in order to maintain their grip on power, I think we'd call them "evil" and not just "selfish"?

I just think it’s not clear it’s actually true that humans get more altruistic as they get richer.

I'm pretty sure that billionaires consume much less as percent of their income, compared to the average person. EA funding comes disproportionately from billionaires, AFAIK. I personally spend a lot more time/effort on altruistic causes, compared to if I was poorer. (Not donating much though for a number of reasons.)

For example, is it the case that selfish consumer preferences have gotten weaker in the modern world, compared to centuries ago when humans were much poorer on a per capita basis?

I'm thinking that we just haven't reached that inflection point yet, where most people run out of things to spend selfishly on (like many billionaires have, and like I have to a lesser extent). As I mentioned in my reply to your post, a large part of my view comes from not being able to imagine what people would spend selfishly on, if each person "owned" something like a significant fraction of a solar system. Why couldn't 99% of their selfish desires be met with <1% of their resources? If you had a plausible story you could tell about this, that would probably change my mind a lot. One thing I do worry about is status symbols / positional goods. I tend to view that as a separate issue from "selfish consumption" but maybe you don't?

Comment by Wei Dai (Wei_Dai) on Wei Dai's Shortform · 2024-03-26T19:00:44.924Z · LW · GW

Are humans fundamentally good or evil? (By "evil" I mean something like "willing to inflict large amounts of harm/suffering on others in pursuit of one's own interests/goals (in a way that can't be plausibly justified as justice or the like)" and by "good" I mean "most people won't do that because they terminally care about others".) People say "power corrupts", but why isn't "power reveals" equally or more true? Looking at some relevant history (people thinking Mao Zedong was sincerely idealistic in his youth, early Chinese Communist Party looked genuine about wanting to learn democracy and freedom from the West, subsequent massive abuses of power by Mao/CCP lasting to today), it's hard to escape the conclusion that altruism is merely a mask that evolution made humans wear in a context-dependent way, to be discarded when opportune (e.g., when one has secured enough power that altruism is no longer very useful).

After writing the above, I was reminded of @Matthew Barnett's AI alignment shouldn’t be conflated with AI moral achievement, which is perhaps the closest previous discussion around here. (Also related are my previous writings about "human safety" although they still used the "power corrupts" framing.) Comparing my current message to his, he talks about "selfishness" and explicitly disclaims, "most humans are not evil" (why did he say this?), and focuses on everyday (e.g. consumer) behavior instead of what "power reveals".

At the time, I replied to him, "I think I’m less worried than you about “selfishness” in particular and more worried about moral/philosophical/strategic errors in general." I guess I wasn't as worried because it seemed like humans are altruistic enough, and their selfish everyday desires limited enough that as they got richer and more powerful, their altruistic values would have more and more influence. In the few months since then, I've became more worried, perhaps due to learning more about Chinese history and politics...

Comment by Wei Dai (Wei_Dai) on Daniel Kokotajlo's Shortform · 2024-03-26T03:22:40.039Z · LW · GW

I hear that there is an apparent paradox which economists have studied: If free markets are so great, why is it that the most successful corporations/businesses/etc. are top-down hierarchical planned economies internally?

Yeah, economists study this under the name "theory of the firm", dating back to a 1937 paper by Ronald Coase. (I see that jmh also mentioned this in his reply.) I remember liking Coase's "transaction cost" solution to this puzzle or paradox when I learned it, and it (and related ideas like "asymmetric information") has informed my views ever since (for example in AGI will drastically increase economies of scale).

Corporations grow bit by bit, by people hiring other people to do stuff for them.

I think this can't be a large part of the solution, because if market exchanges were more efficient (on the margin), people would learn to outsource more, or would be out-competed by others who were willing to delegate more to markets instead of underlings. In the long run, Coase's explanation that sizes of firms are driven by a tradeoff between internal and external transaction costs seemingly has to dominate.

Comment by Wei Dai (Wei_Dai) on Vernor Vinge, who coined the term "Technological Singularity", dies at 79 · 2024-03-22T20:29:00.623Z · LW · GW

Reading A Fire Upon the Deep was literally life-changing for me. How many Everett branches had someone like Vernor Vinge to draw people's attention to the possibility of a technological Singularity with such skillful writing, and to exhort us, at such an early date, to think about how to approach it strategically on a societal level or affect it positively on an individual level. Alas, the world has largely squandered the opportunity he gave us, and is rapidly approaching the Singularity with little forethought or preparation. I don't know which I feel sadder about, what this implies about our world and others, or the news of his passing.

Comment by Wei Dai (Wei_Dai) on Matthew Barnett's Shortform · 2024-03-22T04:01:58.104Z · LW · GW

In other words, I consider this counter-argument to be based on a linguistic ambiguity rather than replying to what I actually meant, and I’ll try to use more concrete language in the future to clarify what I’m talking about.

If I try to interpret "Current AIs are not able to “merge” with each other." with your clarified meaning in mind, I think I still want to argue with it, i.e., why is this meaningful evidence for how easy value handshakes will be for future agentic AIs.

In the absence of a solution to a hypothetical problem X (which we do not even know whether it will happen), it is better to try to become more intelligent to solve it.

But it matters how we get more intelligent. For example if I had to choose now, I'd want to increase the intelligence of biological humans (as I previously suggested) while holding off on AI. I want more time in part for people to think through the problem of which method of gaining intelligence is safest, in part for us to execute that method safely without undue time pressure.

If the alleged “problem” is that there might be a centralized agent in the future that can dominate the entire world, I’d intuitively reason that installing vast centralized regulatory controls over the entire world to pause AI is plausibly not actually helping to decentralize power in the way we’d prefer.

I wouldn't describe "the problem" that way, because in my mind there's roughly equal chance that the future will turn out badly after proceeding in a decentralized way (see 13-25 in The Main Sources of AI Risk? for some ideas of how) and it turns out instituting some kind of Singleton is the only way or one of the best ways to prevent that bad outcome.

Comment by Wei Dai (Wei_Dai) on In defense of anthropically updating EDT · 2024-03-13T03:41:31.852Z · LW · GW

It seems hard for me to understand you, which may be due to my lack of familiarity with your overall views on decision theory and related philosophy. Do you have something that explains, e.g., what is your current favorite decision theory and how should it be interpreted (what are the type signatures of different variables, what are probabilities, what is the background metaphysics, etc.), what kinds uncertainties exist and how they relate to each other, what is your view on the semantics of indexicals, what type of a thing is an agent (do you take more of an algorithmic view, or a physical view)? (I tried looking into your post history and couldn't find much that is relevant.) Also what are the "epistemic principles" that you mentioned in the OP?

Comment by Wei Dai (Wei_Dai) on AI Safety Action Plan - A report commissioned by the US State Department · 2024-03-12T20:27:03.783Z · LW · GW

I put the full report here so you don't have to wait for them to email it to you.

Comment by Wei Dai (Wei_Dai) on 0th Person and 1st Person Logic · 2024-03-12T20:15:14.524Z · LW · GW

Suppose I tell a stranger, "It's raining." Under possible worlds semantics, this seems pretty straightforward: I and the stranger share a similar map from sentences to sets of possible worlds, so with this sentence I'm trying to point them to a certain set of possible worlds that match the sentence, and telling them that I think the real world is in this set.

Can you tell a similar story of what I'm trying to do when I say something like this, under your proposed semantics?

And how does someone compute the degree to which they expect some experience to confirm a statement? I leave that outside the theory.

I don't think we should judge philosophical ideas in isolation, without considering what other ideas it's compatible with and how well it fits into them. So I think we should try to answer related questions like this, and look at the overall picture, instead of just saying "it's outside the theory".

Regarding “What Are Probabilities, Anyway?”. The problem you discuss there is how to define an objective notion of probability.

No, in that post I also consider interpretations of probability where it's subjective. I linked to that post mainly to show you some ideas for how to quantify sizes of sets of possible worlds, in response to your assertion that we don't have any ideas for this. Maybe try re-reading it with this in mind?

Comment by Wei Dai (Wei_Dai) on 0th Person and 1st Person Logic · 2024-03-12T08:30:01.343Z · LW · GW

You can interpret them as subjective probability functions, where the conditional probability P(A|B) is the probability you currently expect for A under the assumption that you are certain that B.

Where do they come from or how are they computed? However that's done, shouldn't the meaning or semantics of A and B play some role in that? In other words, how do you think about P(A|B) without first knowing what A and B mean (in some non-circular sense)? I think this suggests that "the meaning of a statement is instead a set of experience/degree-of-confirmation pairs" can't be right.

Each statement is true in infinitely many possible worlds and we have no idea how to count them to assign numbers like 20%.

See What Are Probabilities, Anyway? for some ideas.

Comment by Wei Dai (Wei_Dai) on 0th Person and 1st Person Logic · 2024-03-12T07:44:55.916Z · LW · GW

Then it would repeat the same process for t=1 and the copy. Conditioned on “I will see C” at t=1, it will conclude “I will see CO” with probability 1⁄2 by the same reasoning as above. So overall, it will assign:p(“I will see OO”) = 1⁄2,p(“I will see CO”) = 1⁄4,p(“I will see CC”) = 1⁄4

  1. If we look at the situation in 0P, the three versions of you at time 2 all seem equally real and equally you, yet in 1P you weigh the experiences of the future original twice as much as each of the copies.
  2. Suppose we change the setup slightly so that copying of the copy is done at time 1 instead of time 2. And at time 1 we show O to the original and C to the two copies, then at time 2 we show them OO, CO, CC like before. With this modified setup, your logic would conclude P(“I will see O”)=P(“I will see OO”)=P(“I will see CO”)=P(“I will see CC”)=1/3 and P(“I will see C”)=2/3. Right?
  3. Similarly, if we change the setup from the original so that no observation is made at time 1, the probabilities also become P(“I will see OO”)=P(“I will see CO”)=P(“I will see CC”)=1/3.
  4. Suppose we change the setup from the original so that at time 1, we make 999 copies of you instead of just 1 and show them all C before deleting all but 1 of the copies. Then your logic would imply P("I will see C")=.999 and therefore P(“I will see CO”)=P(“I will see CC”)=0.4995, and P(“I will see O”)=P(“I will see OO”)=.001.

This all make me think there's something wrong with the 1/2,1/4,1/4 answer and with the way you define probabilities of future experiences. More specifically, suppose OO wasn't just two letters but an unpleasant experience, and CO and CC are both pleasant experiences, so you prefer "I will experience CO/CC" to "I will experience OO". Then at time 0 you would be willing to pay to switch from the original setup to (2) or (3), and pay even more to switch to (4). But that seems pretty counterintuitive, i.e., why are you paying to avoid making observations in (3), or paying to make and delete copies of yourself in (4). Both of these seem at best pointless in 0P.

But every other approach I've seen or thought of also has problems, so maybe we shouldn't dismiss this one too easily based on these issues. I would be interested to see you work out everything more formally and address the above objections (to the extent possible).

Comment by Wei Dai (Wei_Dai) on 0th Person and 1st Person Logic · 2024-03-12T00:58:55.998Z · LW · GW

Assume the meaning of a statement is instead a set of experience/degree-of-confirmation pairs. That is, two statements have the same meaning if they get confirmed/disconfirmed to the same degree for all possible experiences that E.

Where do these degrees-of-confirmation come from? I think part of the motivation for defining meaning in terms of possible worlds is that it allows us to compute conditional and unconditional probabilities, e.g., P(A|B) = P(A and B)/P(B) where P(B) is defined in terms of the set of possible worlds that B "means". But with your proposed semantics, we can't do that, so I don't know where these probabilities are supposed come from.

Comment by Wei Dai (Wei_Dai) on Evolution did a surprising good job at aligning humans...to social status · 2024-03-11T09:42:19.280Z · LW · GW

The concept of status helps us predict that any given person is likely to do one of the relatively few things that are likely to increase their status, and not one of the many more things that are neutral or likely to decrease status, even if it can't by itself tell us exactly which status-raising thing they would do. Seems plenty useful to me.

Comment by Wei Dai (Wei_Dai) on 0th Person and 1st Person Logic · 2024-03-11T03:33:20.317Z · LW · GW

Defining the semantics and probabilities of anticipation seems to be a hard problem. You can see some past discussions of the difficulties at The Anthropic Trilemma and its back-references (posts that link to it). (I didn't link to this earlier in case you already found a fresh approach that solved the problem. You may also want to consider not reading the previous discussions to avoid possibly falling into the same ruts.)

Comment by Wei Dai (Wei_Dai) on Wei Dai's Shortform · 2024-03-11T02:36:13.644Z · LW · GW

Thanks for some interesting points. Can you expand on "Separately, I expect that the quoted comment results in a misleadingly perception of the current situation."? Also, your footnote seems incomplete? (It ends with "we could spend" on my browser.)

Comment by Wei Dai (Wei_Dai) on Wei Dai's Shortform · 2024-03-11T02:29:28.055Z · LW · GW

Apparently Gemini 1.5 Pro isn't working great with large contexts:

While this worked well, for even a slightly more complicated problem the model failed. One Twitter user suggested just adding a random ‘iPhone 15’ in the book text and then asking the model if there is anything in the book that seems out of place in the book. And the model failed to locate it.

The same was the case when the model was asked to summarize a 30-minute Mr. Beast video (over 300k tokens). It generated the summary but many people who had watched the video pointed out that the summary was mostly incorrect.

So while on paper this looked like a huge leap forward for Google, it seems that in practice it's not performing as well as they might have hoped.

But is this due to limitations of RLHF training, or something else?

Comment by Wei Dai (Wei_Dai) on Evolution did a surprising good job at aligning humans...to social status · 2024-03-11T02:14:14.222Z · LW · GW

Some possible examples of misgeneralization of status :

  1. arguing with people on Internet forums
  2. becoming really good at some obscure hobby
  3. playing the hero in a computer RPG (role-playing game)
Comment by Wei Dai (Wei_Dai) on What is progress? · 2024-03-10T14:48:13.305Z · LW · GW

We must commit to improving morality and society along with science, technology, and industry.

How would you translate this into practice? For example one way to commit to this would be to create some persistent governance structures that can ensure this over time. To be more concrete let's say it's a high level department within a world government that has the power to pause or roll back material progress from time to time in order for moral progress to catch up or to avoid imminent disaster.

A less drastic idea is to have AI regulations that say that nobody is allowed to deploy AIs that are better at making material progress than moral/social progress.

Or see "the long reflection" for a more drastic idea.

Which of these would you support, or what do you have in mind yourself?

Comment by Wei Dai (Wei_Dai) on Wei Dai's Shortform · 2024-03-10T13:54:47.856Z · LW · GW

AI labs are starting to build AIs with capabilities that are hard for humans to oversee, such as answering questions based on large contexts (1M+ tokens), but they are still not deploying "scalable oversight" techniques such as IDA and Debate. (Gemini 1.5 report says RLHF was used.) Is this more good news or bad news?

Good: Perhaps RLHF is still working well enough, meaning that the resulting AI is following human preferences even out of training distribution. In other words, they probably did RLHF on large contexts in narrow distributions, with human rater who have prior knowledge/familiarity of the whole context, since it would be too expensive to do RLHF with humans reading diverse 1M+ contexts from scratch, but the resulting chatbot is working well even outside the training distribution. (Is it actually working well? Can someone with access to Gemini 1.5 Pro please test this?)

Bad: AI developers haven't taken alignment seriously enough to have invested enough in scalable oversight, and/or those techniques are unworkable or too costly, causing them to be unavailable.

From a previous comment:

From my experience doing early RLHF work for Gemini, larger models exploit the reward model more. You need to constantly keep collecting more preferences and retraining reward models to make it not exploitable. Otherwise you get nonsensical responses which have exploited the idiosyncracy of your preferences data. There is a reason few labs have done RLHF successfully.

This seems to be evidence that RLHF does not tend to generalize well out-of-distribution, causing me to update the above "good news" interpretation downward somewhat. I'm still very uncertain though. What do others think?

Comment by Wei Dai (Wei_Dai) on 0th Person and 1st Person Logic · 2024-03-10T11:32:17.627Z · LW · GW

We can assign meanings to statements like “my sensor sees red” by picking out subsets of experiences, just as before.

How do you assign meaning to statements like "my sensor will see red"? (In the OP you mention "my sensors will see the heads side of the coin" but I'm not sure what your proposed semantics of such statements are in general.)

Also, here's an old puzzle of mine that I wonder if your line of thinking can help with: At time 1 you will be copied and the original will be shown "O" and the copy will be shown "C", then at time 2 the copy will be copied again, and the three of you will be shown "OO" (original), "CO" (original of copy), "CC" (copy of copy) respectively. At time 0, what are your probabilities for "I will see X" for each of the five possible values of X?

Comment by Wei Dai (Wei_Dai) on Tamsin Leake's Shortform · 2024-03-10T00:20:27.850Z · LW · GW

If current AIs are moral patients, it may be impossible to build highly capable AIs that are not moral patients, either for a while or forever, and this could change the future a lot. (Similar to how once we concluded that human slaves are moral patients, we couldn't just quickly breed slaves that are not moral patients, and instead had to stop slavery altogether.)

Also I'm highly unsure that I understand what you're trying to say. (The above may be totally missing your point.) I think it would help to know what you're arguing against or responding to, or what trigger your thought.

Comment by Wei Dai (Wei_Dai) on Matthew Barnett's Shortform · 2024-03-09T23:26:59.669Z · LW · GW

I'm saying that even if "AI values are well-modeled as being randomly sampled from a large space of possible goals" is true, the AI may well not be very certain that it is true, and therefore assign something like a 5% chance to humans using similar training methods to construct an AI that shares its values. (It has an additional tiny probability that "AI values are well-modeled as being randomly sampled from a large space of possible goals" is true and an AI with similar values get recreated anyway through random chance, but that's not what I'm focusing on.)

Hopefully this conveys my argument more clearly?

Comment by Wei Dai (Wei_Dai) on Charlie Steiner's Shortform · 2024-03-09T14:09:49.393Z · LW · GW

Can't find a reference that says it has actually happened already.

Comment by Wei Dai (Wei_Dai) on Wei Dai's Shortform · 2024-03-08T17:05:48.482Z · LW · GW

(It sucks to debate this, but ignoring it might be interpreted as tacit agreement. Maybe I should have considered the risk that something like this would happen and not written my OP.)

When I wrote the OP, I was pretty sure that the specific combination of ideas in UDT has not been invented or re-invented or have much of a following in academia, at least as of 2019 when Cheating Death in Damascus was published, because the authors of that paper obviously did a literature search and would have told me if they had found something very similar to UDT in the literature, and I think I also went through the papers it referenced as being related and did not find something that had all of the elements of UDT (that's probably why your references look familiar to me). Plus FDT was apparently considered novel enough that the reviewers of the paper didn't tell the authors that they had to call it by the name of an existing academic decision theory.

So it's not that I "don’t consider it a possibility that you might have re-invented something yourself" but that I had good reason to think that's not the case?

Comment by Wei Dai (Wei_Dai) on Wei Dai's Shortform · 2024-03-08T16:31:24.611Z · LW · GW

Thanks, will look into your references.

Okay, interesting! I thought UDT was meant to pay in CM, and that you were convinced of (some version of) UDT.

I wrote "I'm really not sure at this point whether UDT is even on the right track" in UDT shows that decision theory is more puzzling than ever which I think you've read? Did you perhaps miss that part?

(BTW this issue/doubt about whether UDT / paying CM is normative for humans is item 1 in the above linked post. Thought I'd point that out since it may not be obvious at first glance.)

And I think that discussions on LW about decision theory are often muddled due to not making clear what is being discussed.

Yeah I agree with this to some extent, and try to point out such confusions or make such distinctions when appropriate. (Such as in the CM / indexical values case.) Do you have more examples where making such distinctions would be helpful?

Comment by Wei Dai (Wei_Dai) on Matthew Barnett's Shortform · 2024-03-08T15:32:51.727Z · LW · GW

There’s a tiny chance someone could revive me in the future by reconstructing my identity through digital records etc. but I am not going to count on that possibility being decisive in almost any scenario.

On the other hand I'm so worried about this scenario (which I fear may well be a negative one) that I'm afraid to use the fully paid-for full-genome sequencing kit sitting on my desk (although I've been told that I'm leaving physical traces of my DNA everywhere so it may not make much difference) and I sometimes regret writing so much in public. (Interesting how different our intuitions are. I wonder how much of your intuition is due to thinking that such a reconstruction doesn't count as yourself or doesn't count as "not dying", analogous to how some people don't think it's safe to step into a teleporter that works by destructive scanning and reconstruction.)

It seems pretty clearly better, in the context of the original scenario, to “make a deal” with the humans, and receive something in exchange for admitting that you’re misaligned, rather than passively accepting your shutdown because of a tiny chance that your goals will be fulfilled by future agents eventually.

I don't understand why you say this chance is "tiny", given that earlier you wrote "I agree there’s a decent chance this hypothesis is true" in response to:

The AI could think that if it accepts shutdown, another AI with values similar to its own may be created again in the future (perhaps because design/training methods similar to its own will be reused), whereas if it admits misalignment, then that probability becomes much smaller.

I'm arguing that the AI could well also think there's "decent" chance this is true, due to being in a similar epistemic state as us. Let's say 5% to be concrete. That seems enough to make the AI's decision unclear, because .05*U(another AI with values similar to its own created again in the future ) > P(humans keep their promise)*U(reward for admitting misalignment) seems quite plausible. (Not that the AI is necessarily doing explicit EU maximization. It could just be thinking some thoughts or doing some calculations that's roughly analogous or has the same effect as this.)

If you still think "make a deal" is "clearly better" can you please give your own estimates of the various quantities involved in making this decision?

But right now this line of reasoning just seems like grasping at straws to me.

I sometimes think this of counterarguments given by my interlocutors, but usually don't say it aloud, since it's likely that from their perspective they're just trying to point out some reasonable and significant counterarguments that I missed, and it seems unlikely that saying something like this helps move the discussion forward more productively. (It may well cause them to feel offended or to dig in their heels more since they now have more social status on the line to lose. I.e., if they're wrong it's no longer an innocent mistake but "grasping at straws". I'm trying to not fall prey to this myself here.) Curious if you disagree with this policy in general, or think that normal policy doesn't apply here, or something else? (Also totally fine if you don't want to get into a meta-discussion about this here.)